1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243
|
This file contains a short description about the way states and look-aheads
(LA's) are constructed. The example does not cover new information. In fact,
it follows the construction of the states of the grammar given in example 4.42
in Aho et al.
A more extensive description is found in documentation/manual/algorithm.yo
The grammar has the following production rules (numbered 1-3):
1 S' -> S
2 S -> CC
3 C -> cC
4 C -> d
Production rule 1 is added to the grammar, representing the so-called
augmented grammar. When this rule is recognized at end-of-input (represented
by $) then the input is grammatically correct, and the parser will flag
`success'.
$ is called a `look ahead' (LA) token. LA's may be in sets, which are used to
determine whether a rule (when potentially recognized) will be reduced or
not. So, when recognizing S, it is only reduced to S' (by rule S' -> S) if $
is indeed the next input token (i.e., if the end-of-input has been reached.
With the starting rule an `item' is associated. An items consists of a rule,
having a `dot' and a LA-set. The dot (.) represents the point to where input
has been recognized. So, initially we have, starting from rule 1 and using []
to contain the LA token(s):
S' -> . S [$]
This is the starting point, state 0, and this point of departure defines the
`kernel' of a state. To the kernel additional (production) rules and new
LA-sets may be added as follows:
New rules: If a . is followed by a non-terminal all production rules of
that non-terminal are added to the item as new items having dots in postion
0. This process is followed for the newly added production rules as well until
no more rules are added.
LA-sets: Here is where it gets a bit complicated. To find LA-sets the
FIRST function must be defined first.
The FIRST function defines the set of terminal tokens that can be
encountered at the start of a (series of) grammatical tokens. This FIRST-set
is determined as follows:
- if X is a terminal token x, FIRST(X) is [x].
- if X is empty FIRST(X) is [e] (the empty set)
- if X is a production rule X1 X2 ... Xn then:
FIRST(X) is FIRST(X1), except [e].
If [e] was removed, FIRST(X2) is added to the set,
again removing [e], if it was now added to the set.
This process is repeated until an Xi was without [e], or
until FIRST(Xn) was added to the set. In this final case,
if FIRST(Xn) contains [e] it is kept
For our little grammar, we have the following FIRST sets:
S' -> S FIRST(S') = FIRST(S)
S -> CC FIRST(S) = FIRST(C)
C -> cC FIRST(cC) = [c]
C -> d FIRST(d) = [d]
so, FIRST(S') = FIRST(S) = FIRST(C) = FIRST(S) = [cd]
Back to the LA-sets: In a state having an item:
A -> x . B y, [a]
with A, B non-terminals, x, y possibly empty series of terminal symbols
and [a] the LA-set belonging to this item (so, x . B y is a production rule of
A with a dot), all B's production rules are added to the state having
FIRST(ya) as its LA-set.
Having constructed a state, the dots are moved one symbol to the
right. Every unique grammar symbol thus shifted represents a `shift' in the
parsing process to another state. A shift may result in a shift to an earlier
state. This earlier state may thus see its LA-set expanded, which expansion
must then propagate to all states derived from that state. When no shift is
possible, a reduction takes place. If a state has only a single reduction,
then this becomes the default reduction (on any symbol not resulting in a
shift). If a state has multiple reductions, then this is ok if their LA-sets
are different. Otherwise the grammar contains a reduce-reduce
conflict. Analogously, if a reduction on a certain LA-set is indicated, but
there's also a shift required for (one or more) terminal tokens in the LA-set,
then the grammar contains a shift-reduce conflict.
At each rule in each state actions/gotos are specified: `goto' a state if
the symbol following a dot is a nonterminal, `shift' to a state if the symbol
following a dot is a terminal, or reduce according to a production rule if the
dot is at the end of a production rule. This indicates that a rule has been
recognized, and if an action block has been defined for such a production
rule, it is executed at that point (i.e., at the point where the reduction
takes place).
Now it's time to start constructing states, called S0, S1, S2,
etc. Production rules are numbered by their original number.
S0:
1 S' -> . S [$] On S goto S1
for S' -> . S [$]
compute, matching A -> x . B y, [a],
FIRST(y a)
as FIRST( $), which is: [$], LA for the next S rule:
add S-rule:
2 S -> . C C [$] On C goto S2
for S -> . C C [$]
compute, matching A -> x . B y, [a],
FIRST(y a)
as FIRST(C $), which is: [cd], LA for the next C rules:
add C-rules:
3 C -> . c C [cd] On c shift S3
4 C -> . d [cd] On d shift S4
having (of course) the same LA set. `Of course',
since the LA set does not depend on this production
rule, but on the rule causing this addition. So, once
a nonterminal is added, the LA-set of its production
rules is determined once.
Continuing this way, each of the previous state's items is processed in turn:
S1:
1 S' -> S . [$] Reduce to S' by rule 1.
This is the ACCEPT state
S2:
2 S -> C . C [$] On C goto S5
for S -> C . C [$]
compute, matching A -> x . B y, [a]
FIRST(y a)
as FIRST( $), which is: [$], LA for the next C rules:
3 C -> . c C [$] On c shift S3
(Adding [$] to S3's LA-set)
4 C -> . d [$] On d shift S4
(Adding [$] to S4's LA-set)
S3:
3 C -> c . C [cd]+[$] from S2 On C goto S6
(Adding [cd] to S5's LA-set)
for C -> c . C [cd]+[$]
compute, matching A -> x . B y, [a]
FIRST(y a )
as FIRST( [cd]), which is: [cd]+[$]
3 C -> . c C [cd]+[$] On c shift S3
4 C -> . d [cd]+[$] On d shift S4
S4:
4 C -> d . [cd]+[$] from S2 Reduce to C by rule 4
S5:
2 S -> C C . [$] Reduce to S by rule 2
S6:
3 C -> c C . [cd]+[$] Reduce to C by rule 3
Now, the action/goto table can be summarized:
------------------------------------
action goto
------------------------------------
State c d $ S C
------------------------------------
0 s3 s4 s1 s2
1 acc
2 s3 s4 s5
3 s3 s4 s6
4 r4 r4 r4
5 r2 [d]
6 r3 r3 r3
------------------------------------
[d]: use this by default for every
next token
If a grammatical rule has an empty production the same procedure is followed.
Consider:
1 S' -> S
2 S -> S E
3 S ->
4 E -> n
5 E -> i
(representing a grammar accepting possibly empty series of numbers and
identifiers). The grammar is constructed as follows:
FIRST(E) = [in], FIRST(S') = FIRST(S) = [in$]
S0:
S' -> . S [$] On S: goto S1
Obtaining [$] from the above rule:
S -> . S E [$] On S: goto S1
S -> . [$] Default: reduce to S
S1:
S' -> S . [$] On $: ACCEPT
S -> S . E [$] On E: goto S2
Add E-rules: match A -> b . B y [a]
with: S -> S . E [$]
E -> . n [$] On n shift S3
E -> . i [$] On i shift S4
S2:
S -> S E . [$] REDUCE
S3:
E -> n . [$] REDUCE
S4:
E -> i . [$] REDUCE
Action/Goto table:
------------------------------------
action goto
------------------------------------
State i n $ S E
------------------------------------
0 r3[d] s1
1 s3 s4 acc s2
2 r2[d]
3 r4[d]
4 r5[d]
------------------------------------
|