File: README.states

package info (click to toggle)
bisonc%2B%2B 6.04.01-1
links: PTS, VCS
area: main
in suites: bullseye
size: 5,864 kB
sloc: cpp: 10,112; ansic: 1,434; makefile: 1,069; sh: 522; yacc: 84; lex: 60
file content (243 lines) | stat: -rw-r--r-- 9,326 bytes
parent folder | download | duplicates (9)
This file contains a short description about the way states and look-aheads
(LA's) are constructed. The example does not cover new information. In fact,
it follows the construction of the states of the grammar given in example 4.42
in Aho et al. 

A more extensive description is found in documentation/manual/algorithm.yo

The grammar has the following production rules (numbered 1-3):

1   S'  ->  S
2   S   ->  CC
3   C   ->  cC
4   C   ->  d

Production rule 1 is added to the grammar, representing the so-called
augmented grammar. When this rule is recognized at end-of-input (represented
by $) then the input is grammatically correct, and the parser will flag
`success'.

$ is called a `look ahead' (LA) token. LA's may be in sets, which are used to
determine whether a rule (when potentially recognized) will be reduced or
not. So, when recognizing S, it is only reduced to S' (by rule S' -> S) if $
is indeed the next input token (i.e., if the end-of-input has been reached.


With the starting rule an `item' is associated. An items consists of a rule,
having a `dot' and a LA-set. The dot (.) represents the point to where input
has been recognized. So, initially we have, starting from rule 1 and using  []
to contain the LA token(s):

    S' -> . S   [$]

This is the starting point, state 0, and this point of departure defines the
`kernel' of a state. To the kernel additional (production) rules and new
LA-sets may be added as follows:

    New rules: If a . is followed by a non-terminal all production rules of 
that non-terminal are added to the item as new items having dots in postion
0. This process is followed for the newly added production rules as well until
no more rules are added.

    LA-sets: Here is where it gets a bit complicated. To find LA-sets the
FIRST function must be defined first. 

    The FIRST function defines the set of terminal tokens that can be
encountered at the start of a (series of) grammatical tokens. This FIRST-set
is determined as follows:

    - if X is a terminal token x, FIRST(X) is [x].
    - if X is empty FIRST(X) is [e] (the empty set)
    - if X is a production rule X1 X2 ... Xn then:
            FIRST(X) is FIRST(X1), except [e].
                    If [e] was removed, FIRST(X2) is added to the set, 
                    again removing [e], if it was now added to the set.
                    This process is repeated until an Xi was without [e], or
                    until FIRST(Xn) was added to the set. In this final case,
                    if FIRST(Xn) contains [e] it is kept

    For our little grammar, we have the following FIRST sets:

    S'  ->  S   FIRST(S')   = FIRST(S)
    S   ->  CC  FIRST(S)    = FIRST(C)
    C   ->  cC  FIRST(cC)   = [c]
    C   ->  d   FIRST(d)    = [d]
                
        so, FIRST(S') = FIRST(S) = FIRST(C) = FIRST(S) = [cd]

    Back to the LA-sets: In a state having an item:

        A -> x . B y, [a]

    with A, B non-terminals, x, y possibly empty series of terminal symbols
and [a] the LA-set belonging to this item (so, x . B y is a production rule of
A with a dot), all B's production rules are added to the state having
FIRST(ya) as its LA-set.

    Having constructed a state, the dots are moved one symbol to the
right. Every unique grammar symbol thus shifted represents a `shift' in the
parsing process to another state. A shift may result in a shift to an earlier
state. This earlier state may thus see its LA-set expanded, which expansion
must then propagate to all states derived from that state. When no shift is
possible, a reduction takes place. If a state has only a single reduction,
then this becomes the default reduction (on any symbol not resulting in a
shift). If a state has multiple reductions, then this is ok if their LA-sets
are different. Otherwise the grammar contains a reduce-reduce
conflict. Analogously, if a reduction on a certain LA-set is indicated, but
there's also a shift required for (one or more) terminal tokens in the LA-set,
then the grammar contains a shift-reduce conflict.

    At each rule in each state actions/gotos are specified: `goto' a state if
the symbol following a dot is a nonterminal, `shift' to a state if the symbol
following a dot is a terminal, or reduce according to a production rule if the
dot is at the end of a production rule. This indicates that a rule has been
recognized, and if an action block has been defined for such a production
rule, it is executed at that point (i.e., at the point where the reduction
takes place).

    Now it's time to start constructing states, called S0, S1, S2,
etc. Production rules are numbered by their original number. 


S0:
1       S'  -> . S   [$]                                    On S goto S1
                    for               S' ->   . S    [$] 
                    compute, matching A  -> x . B y, [a], 
                       FIRST(y a)
                    as FIRST(  $), which is: [$], LA for the next S rule:
add S-rule:
2       S   -> . C C [$]                                    On C goto S2
                    for               S  ->   . C C  [$] 
                    compute, matching A  -> x . B y, [a], 
                       FIRST(y a)
                    as FIRST(C $), which is: [cd], LA for the next C rules:
add C-rules:
3       C   -> . c C    [cd]                                On c shift S3
                                
4       C   -> . d      [cd]                                On d shift S4
                    having (of course) the same LA set. `Of course',
                    since the LA set does not depend on this production
                    rule, but on the rule causing this addition. So, once
                    a nonterminal is added, the LA-set of its production
                    rules is determined once.


Continuing this way, each of the previous state's items is processed in turn:

S1:
1       S'  ->  S .     [$]                         Reduce to S' by rule 1. 
                                                    This is the ACCEPT state


S2:
2       S   -> C . C    [$]                                     On C goto S5
                    for               S  -> C .   C  [$] 
                    compute, matching A  -> x . B y, [a]
                       FIRST(y a)
                    as FIRST(  $), which is: [$], LA for the next C rules:

3       C   -> . c C    [$]                                     On c shift S3
                            (Adding [$] to S3's LA-set)

4       C   -> . d      [$]                                     On d shift S4
                            (Adding [$] to S4's LA-set)


S3:
3       C   -> c . C    [cd]+[$] from S2                        On C goto S6
                            (Adding [cd] to S5's LA-set)
                            for               C  -> c . C    [cd]+[$]
                            compute, matching A  -> x . B y, [a]
                               FIRST(y a   )
                            as FIRST(  [cd]), which is: [cd]+[$]
                                
3       C   -> . c C    [cd]+[$]                                On c shift S3

4       C   -> . d      [cd]+[$]                                On d shift S4


S4:
4       C   -> d .      [cd]+[$] from S2            Reduce to C by rule 4

S5:
2       S   -> C C .    [$]                         Reduce to S by rule 2
        
S6:
3       C   -> c C .    [cd]+[$]                    Reduce to C by rule 3



Now, the action/goto table can be summarized:

------------------------------------
             action         goto
------------------------------------
State       c   d   $       S   C
------------------------------------
    0       s3  s4          s1  s2
    1               acc
    2       s3  s4              s5
    3       s3  s4              s6
    4       r4  r4  r4
    5               r2 [d]
    6       r3  r3  r3
------------------------------------
[d]: use this by default for every    
     next token





If a grammatical rule has an empty production the same procedure is followed.

Consider:
1   S' -> S
2   S -> S E
3   S -> 
4   E -> n
5   E -> i

(representing a grammar accepting possibly empty series of numbers and
identifiers). The grammar is constructed as follows:

    FIRST(E) = [in], FIRST(S') = FIRST(S) = [in$]

S0:
    S'  -> . S      [$]             On S: goto S1
Obtaining [$] from the above rule:
    S   -> . S E    [$]             On S: goto S1
    S   -> .        [$]             Default: reduce to S

S1:
    S'  -> S .      [$]             On $: ACCEPT
    S   -> S . E    [$]             On E: goto S2
            Add E-rules: match A -> b . B y  [a]
                         with: S -> S . E    [$]                
    E   -> . n      [$]             On n shift S3
    E   -> . i      [$]             On i shift S4

S2:
    S   -> S E .    [$]             REDUCE

S3:
    E   -> n .      [$]             REDUCE

S4:
    E   -> i .      [$]             REDUCE


Action/Goto table:

------------------------------------
             action         goto
------------------------------------
State       i   n   $       S   E
------------------------------------
    0       r3[d]           s1    
    1       s3  s4  acc         s2
    2       r2[d]
    3       r4[d]                   
    4       r5[d]      
------------------------------------