1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107
|
B() can generate parsers for languages that are em(context-free). This means
that you define one or more em(constructs), commonly called em(nonterminal
symbols), and define em(rules) by which they are recognized (or defined).
For example, bf(C++) defines an `tt(expression)' nonterminal. One rule
defining an expression might be,
quote(
"An expression consists of a minus sign and another expression".
)
Another rule could be
quote(
"An expression can be an integer".
)
Rules often are recursive, but there must be at least one rule ending the
recursion.
A commonly used formal system for describing such rules is the em(Backus-Naur
Form) or `BNF', which was developed for specifying the language Algol 60. Any
grammar expressed in BNF is a context-free grammar. The input to b() is
essentially machine-readable BNF.
Not all context-free languages can be handled by b(), only those that are
LALR(1). In brief, this means that it must be possible to tell how to parse
input while `looking ahead' to just a single token. Strictly speaking, that is
a description of an LR(1) grammar, and LALR(1) involves some additional
restrictions which are not further elaborated in the current context; but it
in real life almost all LR(1) grammars are also LALR(1) grammars. See section
ref(MYSTERIOUS) for more information on this.
In the formal rules defining a language's grammar, the grammar's constructs
are called nonterminals. Some symbols are not further defined, but are
directly made available when matching the input to a series of regular
expressions. Matching the input to regular expressions is not the task of a
parser, but of a em(lexical scanner) (e.g., bf(flexc++)(1)). Lexical scanners
provide the parser with symbols, called em(tokens) or em(terminal symbols),
which are taken for granted by the parser.
Terminal and nonterminal symbols are used to define the rules of grammars.
We can use examples from bf(C++) to illustrate terminal and nonterminal
symbols. bf(C++)'s terminal symbols are identifiers, constants (numeric and
string), various keywords, arithmetic operators, and other simple character
tokens. So the terminal symbols of a grammar for bf(C++) can thus be
`IDENTIFIER', `NUMBER', `STRING', as wel as tokens for keywords
(e.g., tt(FOR, SWITCH), operators (tt('+', BIT_OR)),
and character tokens like tt(';', ':', '(', ')') ).
Here is a simple bf(C++) function subdivided into tokens:
verb(
int square(int x) // INT IDENTIFIER '(' INT IDENTIFIER ')'
{ // '{'
return x * x; // RETURN IDENTIFIER '*' IDENTIFIER ';'
} // '}'
)
bf(C++) rules include expression rules, statement rules, rules defining
declarations, and rules defining functions. These are defined in bf(C++)'s
grammar as nonterminal symbols `expression', `statement', `declaration' and
`function_definition'. The above example shows a function definition as well
as the sequence of tokens (starting from tt(INT), ending at tt('}')), received
by the parser when that definition shows up in its input.
Each nonterminal symbol must be defined by at least one so-called
em(production rule), defining the sequence of symbols that must have been
observed before it can be decided that the nonterminal symbol has been
encountered. For example, one kind of bf(C++) statement is the return
statement; it could be described by the following informal rule:
quote(
A `statement' can be made of a `return' keyword, an `expression' and a
`semicolon'.
)
Formally, such a rule is written like this:
verb(
statement:
RETURN expression ';'
;
)
and in general grammar rules have the following form:
verb(
nonterminal:
production(s)
;
)
where em(nonterminal) is the nonterminal symbol defined by the rule and
em(productions(s)) are one or more sequences of terminal and nonterminal
symbols that define how tt(nonterminal) can be recognized. Concentrating on a
single production rule, the rule's nonterminal is called the rule's left-hand
side (LHS), the production rule itself is called the rule's right-hand side
(RHS). In the tt(statement) example the LHS equals tt(statement), the RHS
equals tt(RETURN expression ';'). The RHS consists of three elements,
numbered, respectively, 1 through 3.
One nonterminal symbol is special in that it defines a syntactically correct
specification of the defined language. It is called the em(start symbol). In a
compiler this means a complete input program. In bf(C++), a nonterminal symbol
`optional-sequence-of-definitions-and-declarations' could be used for this.
The parser generated by b() reads a sequence of tokens, and combines the
tokens using the rules of the specified grammar. If the input is valid, the
end result is that the entire token sequence reduces to the start symbol
nonterminal. If we use bf(C++)'s grammar, then the entire input must reducible
to `optional-sequence-of-definitions-and-declarations'. If not, the parser
reports a syntax error.
|