1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477
|
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"
"http://www.w3.org/TR/REC-html40/loose.dtd">
<HTML>
<HEAD>
<META http-equiv="Content-Type" content="text/html; charset= ISO-8859-1">
<TITLE>
Lexer and parser generators (ocamllex, ocamlyacc)
</TITLE>
</HEAD>
<BODY >
<A HREF="manual023.html"><IMG SRC ="previous_motif.gif" ALT="Previous"></A>
<A HREF="manual025.html"><IMG SRC ="next_motif.gif" ALT="Next"></A>
<A HREF="index.html"><IMG SRC ="contents_motif.gif" ALT="Contents"></A>
<HR>
<H1>Chapter 11: Lexer and parser generators (ocamllex, ocamlyacc)</H1>
<A NAME="c:ocamlyacc"></A>
This chapter describes two program generators: <TT>ocamllex</TT>, that
produces a lexical analyzer from a set of regular expressions with
associated semantic actions, and <TT>ocamlyacc</TT>, that produces a parser
from a grammar with associated semantic actions.<BR>
<BR>
These program generators are very close to the well-known <TT>lex</TT> and
<TT>yacc</TT> commands that can be found in most C programming environments.
This chapter assumes a working knowledge of <TT>lex</TT> and <TT>yacc</TT>: while
it describes the input syntax for <TT>ocamllex</TT> and <TT>ocamlyacc</TT> and the
main differences with <TT>lex</TT> and <TT>yacc</TT>, it does not explain the basics
of writing a lexer or parser description in <TT>lex</TT> and <TT>yacc</TT>. Readers
unfamiliar with <TT>lex</TT> and <TT>yacc</TT> are referred to ``Compilers:
principles, techniques, and tools'' by Aho, Sethi and Ullman
(Addison-Wesley, 1986), or ``Lex & Yacc'', by Levine, Mason and
Brown (O'Reilly, 1992).<BR>
<BR>
<H2>11.1 Overview of <TT>ocamllex</TT></H2>The <TT>ocamllex</TT> command produces a lexical analyzer from a set of regular
expressions with attached semantic actions, in the style of
<TT>lex</TT>. Assuming the input file is <I>lexer</I><TT>.mll</TT>, executing
<PRE>
ocamllex <I>lexer</I>.mll
</PRE>
produces Caml code for a lexical analyzer in file <I>lexer</I><TT>.ml</TT>.
This file defines one lexing function per entry point in the lexer
definition. These functions have the same names as the entry
points. Lexing functions take as argument a lexer buffer, and return
the semantic attribute of the corresponding entry point.<BR>
<BR>
Lexer buffers are an abstract data type implemented in the standard
library module <TT>Lexing</TT>. The functions <TT>Lexing.from</TT><TT>_</TT><TT>channel</TT>,
<TT>Lexing.from</TT><TT>_</TT><TT>string</TT> and <TT>Lexing.from</TT><TT>_</TT><TT>function</TT> create
lexer buffers that read from an input channel, a character string, or
any reading function, respectively. (See the description of module
<TT>Lexing</TT> in chapter <A HREF="manual029.html#c:stdlib">16</A>.)<BR>
<BR>
When used in conjunction with a parser generated by <TT>ocamlyacc</TT>, the
semantic actions compute a value belonging to the type <TT>token</TT> defined
by the generated parsing module. (See the description of <TT>ocamlyacc</TT>
below.)<BR>
<BR>
<H2>11.2 Syntax of lexer definitions</H2>The format of lexer definitions is as follows:
<PRE>
{ <I>header</I> }
let <I>ident</I> = <I>regexp</I> ...
rule <I>entrypoint</I> =
parse <I>regexp</I> { <I>action</I> }
| ...
| <I>regexp</I> { <I>action</I> }
and <I>entrypoint</I> =
parse ...
and ...
{ <I>trailer</I> }
</PRE>Comments are delimited by <TT>(*</TT> and <TT>*)</TT>, as in Caml.<BR>
<BR>
<H3>11.2.1 Header and trailer</H3>
The <I>header</I> and <I>trailer</I> sections are arbitrary Caml
text enclosed in curly braces. Either or both can be omitted. If
present, the header text is copied as is at the beginning of the
output file and the trailer text at the end. Typically, the
header section contains the <CODE>open</CODE> directives required
by the actions, and possibly some auxiliary functions used in the
actions.<BR>
<BR>
<H3>11.2.2 Naming regular expressions</H3>Between the header and the entry points, one can give names to
frequently-occurring regular expressions. This is written
<TT><FONT COLOR=blue>let</FONT></TT> <TT><I><FONT COLOR=maroon>ident</FONT></I></TT> <TT><FONT COLOR=blue>=</FONT></TT> <TT><I><FONT COLOR=maroon>regexp</FONT></I></TT>.
In following regular expressions, the identifier
<I>ident</I> can be used as shorthand for <I>regexp</I>.<BR>
<BR>
<H3>11.2.3 Entry points</H3>The names of the entry points must be valid identifiers for Caml
values (starting with a lowercase letter).<BR>
<BR>
<H3>11.2.4 Regular expressions</H3>The regular expressions are in the style of <TT>lex</TT>, with a more
Caml-like syntax.<BR>
<BR>
<DL COMPACT=compact>
<DT><TT><FONT COLOR=blue>'</FONT></TT> <TT><I><FONT COLOR=maroon>char</FONT></I></TT> <TT><FONT COLOR=blue>'</FONT></TT><DD>
A character constant, with the same syntax as Objective Caml character
constants. Match the denoted character.<BR>
<BR>
<DT><TT>_</TT><DD>
(Underscore.) Match any character.<BR>
<BR>
<DT><TT>eof</TT><DD>
Match the end of the lexer input.<BR><B>Note:</B> On some systems, with interactive input, and end-of-file
may be followed by more characters. However, <TT>ocamllex</TT> will not
correctly handle regular expressions that contain <TT>eof</TT> followed by
something else.<BR>
<BR>
<DT><TT><FONT COLOR=blue>"</FONT></TT> <TT><I><FONT COLOR=maroon>string</FONT></I></TT> <TT><FONT COLOR=blue>"</FONT></TT><DD>
A string constant, with the same syntax as Objective Caml string
constants. Match the corresponding sequence of characters.<BR>
<BR>
<DT><TT><FONT COLOR=blue>[</FONT></TT> <TT><I><FONT COLOR=maroon>character-set</FONT></I></TT> <TT><FONT COLOR=blue>]</FONT></TT><DD>
Match any single character belonging to the given
character set. Valid character sets are: single
character constants <TT><FONT COLOR=blue>'</FONT></TT> <TT><I><FONT COLOR=maroon>c</FONT></I></TT> <TT><FONT COLOR=blue>'</FONT></TT>; ranges of characters
<TT><FONT COLOR=blue>'</FONT></TT> <TT><I><FONT COLOR=maroon>c</FONT></I></TT><SUB><FONT SIZE=2>1</FONT></SUB> <TT><FONT COLOR=blue>'</FONT></TT> <TT><FONT COLOR=blue>-</FONT></TT> <TT><FONT COLOR=blue>'</FONT></TT> <TT><I><FONT COLOR=maroon>c</FONT></I></TT><SUB><FONT SIZE=2>2</FONT></SUB> <TT><FONT COLOR=blue>'</FONT></TT> (all characters between <I>c</I><SUB><FONT SIZE=2>1</FONT></SUB> and <I>c</I><SUB><FONT SIZE=2>2</FONT></SUB>,
inclusive); and the union of two or more character sets, denoted by
concatenation.<BR>
<BR>
<DT><TT><FONT COLOR=blue>[</FONT></TT> <TT><FONT COLOR=blue>^</FONT></TT> <TT><I><FONT COLOR=maroon>character-set</FONT></I></TT> <TT><FONT COLOR=blue>]</FONT></TT><DD>
Match any single character not belonging to the given character set.<BR>
<BR>
<DT><TT><I><FONT COLOR=maroon>regexp</FONT></I></TT> <TT><FONT COLOR=blue>*</FONT></TT><DD>
(Repetition.) Match the concatenation of zero or more
strings that match <TT><I><FONT COLOR=maroon>regexp</FONT></I></TT>. <BR>
<BR>
<DT><TT><I><FONT COLOR=maroon>regexp</FONT></I></TT> <TT><FONT COLOR=blue>+</FONT></TT><DD>
(Strict repetition.) Match the concatenation of one or more
strings that match <TT><I><FONT COLOR=maroon>regexp</FONT></I></TT>.<BR>
<BR>
<DT><TT><I><FONT COLOR=maroon>regexp</FONT></I></TT> <TT><FONT COLOR=blue>?</FONT></TT><DD>
(Option.) Match either the empty string, or a string matching <TT><I><FONT COLOR=maroon>regexp</FONT></I></TT>.<BR>
<BR>
<DT><TT><I><FONT COLOR=maroon>regexp</FONT></I></TT><SUB><FONT SIZE=2>1</FONT></SUB> <TT><FONT COLOR=blue>|</FONT></TT> <TT><I><FONT COLOR=maroon>regexp</FONT></I></TT><SUB><FONT SIZE=2>2</FONT></SUB><DD>
(Alternative.) Match any string that matches either <TT><I><FONT COLOR=maroon>regexp</FONT></I></TT><SUB><FONT SIZE=2>1</FONT></SUB> or <TT><I><FONT COLOR=maroon>regexp</FONT></I></TT><SUB><FONT SIZE=2>2</FONT></SUB><BR>
<BR>
<DT><TT><I><FONT COLOR=maroon>regexp</FONT></I></TT><SUB><FONT SIZE=2>1</FONT></SUB> <TT><I><FONT COLOR=maroon>regexp</FONT></I></TT><SUB><FONT SIZE=2>2</FONT></SUB><DD>
(Concatenation.) Match the concatenation of two strings, the first
matching <TT><I><FONT COLOR=maroon>regexp</FONT></I></TT><SUB><FONT SIZE=2>1</FONT></SUB>, the second matching <TT><I><FONT COLOR=maroon>regexp</FONT></I></TT><SUB><FONT SIZE=2>2</FONT></SUB>.<BR>
<BR>
<DT><TT><FONT COLOR=blue>(</FONT></TT> <TT><I><FONT COLOR=maroon>regexp</FONT></I></TT> <TT><FONT COLOR=blue>)</FONT></TT><DD>
Match the same strings as <TT><I><FONT COLOR=maroon>regexp</FONT></I></TT>.<BR>
<BR>
<DT><TT><I><FONT COLOR=maroon>ident</FONT></I></TT><DD>
Reference the regular expression bound to <TT><I><FONT COLOR=maroon>ident</FONT></I></TT> by an earlier
<TT><FONT COLOR=blue>let</FONT></TT> <TT><I><FONT COLOR=maroon>ident</FONT></I></TT> <TT><FONT COLOR=blue>=</FONT></TT> <TT><I><FONT COLOR=maroon>regexp</FONT></I></TT> definition.</DL>Concerning the precedences of operators, <TT>*</TT> and <TT>+</TT> have
highest precedence, followed by <TT>?</TT>, then concatenation, then
<TT>|</TT> (alternation).<BR>
<BR>
<H3>11.2.5 Actions</H3>The actions are arbitrary Caml expressions. They are evaluated in
a context where the identifier <TT>lexbuf</TT> is bound to the current lexer
buffer. Some typical uses for <TT>lexbuf</TT>, in conjunction with the
operations on lexer buffers provided by the <TT>Lexing</TT> standard library
module, are listed below.<BR>
<BR>
<DL COMPACT=compact>
<DT>
<TT>Lexing.lexeme lexbuf</TT><DD>
Return the matched string.<BR>
<BR>
<DT><TT>Lexing.lexeme</TT><TT>_</TT><TT>char lexbuf </TT><I>n</I><DD>
Return the <I>n</I><SUP><FONT SIZE=2>th</FONT></SUP>
character in the matched string. The first character corresponds to <I>n</I> = 0.<BR>
<BR>
<DT><TT>Lexing.lexeme</TT><TT>_</TT><TT>start lexbuf</TT><DD>
Return the absolute position in the input text of the beginning of the
matched string. The first character read from the input text has
position 0.<BR>
<BR>
<DT><TT>Lexing.lexeme</TT><TT>_</TT><TT>end lexbuf</TT><DD>
Return the absolute position in the input text of the end of the
matched string. The first character read from the input text has
position 0.<BR>
<BR>
<DT><I>entrypoint</I> <TT>lexbuf</TT><DD>
(Where <I>entrypoint</I> is the name of another entry point in the same
lexer definition.) Recursively call the lexer on the given entry point.
Useful for lexing nested comments, for example.</DL>
<H3>11.2.6 Reserved identifiers</H3>All identifiers starting with <TT>_</TT><TT>_</TT><TT>ocaml</TT><TT>_</TT><TT>lex</TT> are reserved for use by
<TT>ocamllex</TT>; do not use any such identifier in your programs.<BR>
<BR>
<H2>11.3 Overview of <TT>ocamlyacc</TT></H2>The <TT>ocamlyacc</TT> command produces a parser from a context-free grammar
specification with attached semantic actions, in the style of <TT>yacc</TT>.
Assuming the input file is <I>grammar</I><TT>.mly</TT>, executing
<PRE>
ocamlyacc <I>options</I> <I>grammar</I>.mly
</PRE>
produces Caml code for a parser in the file <I>grammar</I><TT>.ml</TT>,
and its interface in file <I>grammar</I><TT>.mli</TT>.<BR>
<BR>
The generated module defines one parsing function per entry point in
the grammar. These functions have the same names as the entry points.
Parsing functions take as arguments a lexical analyzer (a function
from lexer buffers to tokens) and a lexer buffer, and return the
semantic attribute of the corresponding entry point. Lexical analyzer
functions are usually generated from a lexer specification by the
<TT>ocamllex</TT> program. Lexer buffers are an abstract data type
implemented in the standard library module <TT>Lexing</TT>. Tokens are values from
the concrete type <TT>token</TT>, defined in the interface file
<I>grammar</I><TT>.mli</TT> produced by <TT>ocamlyacc</TT>.<BR>
<BR>
<H2>11.4 Syntax of grammar definitions</H2>Grammar definitions have the following format:
<PRE>
%{
<I>header</I>
%}
<I>declarations</I>
%%
<I>rules</I>
%%
<I>trailer</I>
</PRE>Comments are enclosed between <CODE>/*</CODE> and <CODE>*/</CODE> (as in C) in the
``declarations'' and ``rules'' sections, and between <CODE>(*</CODE> and
<CODE>*)</CODE> (as in Caml) in the ``header'' and ``trailer'' sections.<BR>
<BR>
<H3>11.4.1 Header and trailer</H3>The header and the trailer sections are Caml code that is copied
as is into file <I>grammar</I><TT>.ml</TT>. Both sections are optional. The header
goes at the beginning of the output file; it usually contains
<TT>open</TT> directives and auxiliary functions required by the semantic
actions of the rules. The trailer goes at the end of the output file.<BR>
<BR>
<H3>11.4.2 Declarations</H3>Declarations are given one per line. They all start with a <CODE>%</CODE> sign.<BR>
<BR>
<DL COMPACT=compact>
<DT><TT><FONT COLOR=blue>%token</FONT></TT> <TT><I><FONT COLOR=maroon>symbol</FONT></I></TT> ... <TT><I><FONT COLOR=maroon>symbol</FONT></I></TT><DD>
Declare the given symbols as tokens (terminal symbols). These symbols
are added as constant constructors for the <TT>token</TT> concrete type.<BR>
<BR>
<DT><TT><FONT COLOR=blue>%token</FONT></TT> <TT><FONT COLOR=blue><</FONT></TT> <TT><I><FONT COLOR=maroon>type</FONT></I></TT> <TT><FONT COLOR=blue>></FONT></TT> <TT><I><FONT COLOR=maroon>symbol</FONT></I></TT> ... <TT><I><FONT COLOR=maroon>symbol</FONT></I></TT><DD>
Declare the given symbols as tokens with an attached attribute of the
given type. These symbols are added as constructors with arguments of
the given type for the <TT>token</TT> concrete type. The <TT><I><FONT COLOR=maroon>type</FONT></I></TT> part is
an arbitrary Caml type expression, except that all type
constructor names must be fully qualified (e.g. <TT>Modname.typename</TT>)
for all types except standard built-in types, even if the proper
<CODE>open</CODE> directives (e.g. <CODE>open Modname</CODE>) were given in the
header section. That's because the header is copied only to the <TT>.ml</TT>
output file, but not to the <TT>.mli</TT> output file, while the <TT><I><FONT COLOR=maroon>type</FONT></I></TT> part
of a <CODE>%token</CODE> declaration is copied to both.<BR>
<BR>
<DT><TT><FONT COLOR=blue>%start</FONT></TT> <TT><I><FONT COLOR=maroon>symbol</FONT></I></TT> ... <TT><I><FONT COLOR=maroon>symbol</FONT></I></TT><DD>
Declare the given symbols as entry points for the grammar. For each
entry point, a parsing function with the same name is defined in the
output module. Non-terminals that are not declared as entry points
have no such parsing function. Start symbols must be given a type with
the <CODE>%type</CODE> directive below.<BR>
<BR>
<DT><TT><FONT COLOR=blue>%type</FONT></TT> <TT><FONT COLOR=blue><</FONT></TT> <TT><I><FONT COLOR=maroon>type</FONT></I></TT> <TT><FONT COLOR=blue>></FONT></TT> <TT><I><FONT COLOR=maroon>symbol</FONT></I></TT> ... <TT><I><FONT COLOR=maroon>symbol</FONT></I></TT><DD>
Specify the type of the semantic attributes for the given symbols.
This is mandatory for start symbols only. Other nonterminal symbols
need not be given types by hand: these types will be inferred when
running the output files through the Objective Caml compiler (unless the
<CODE>-s</CODE> option is in effect). The <TT><I><FONT COLOR=maroon>type</FONT></I></TT> part is an arbitrary Caml
type expression, except that all type constructor names must be
fully qualified, as explained above for <TT>%token</TT>.<BR>
<BR>
<DT><TT><FONT COLOR=blue>%left</FONT></TT> <TT><I><FONT COLOR=maroon>symbol</FONT></I></TT> ... <TT><I><FONT COLOR=maroon>symbol</FONT></I></TT><DD>
<DT><TT><FONT COLOR=blue>%right</FONT></TT> <TT><I><FONT COLOR=maroon>symbol</FONT></I></TT> ... <TT><I><FONT COLOR=maroon>symbol</FONT></I></TT><DD>
<DT><TT><FONT COLOR=blue>%nonassoc</FONT></TT> <TT><I><FONT COLOR=maroon>symbol</FONT></I></TT> ... <TT><I><FONT COLOR=maroon>symbol</FONT></I></TT><DD><BR>
<BR>
Associate precedences and associativities to the given symbols. All
symbols on the same line are given the same precedence. They have
higher precedence than symbols declared before in a <CODE>%left</CODE>,
<CODE>%right</CODE> or <CODE>%nonassoc</CODE> line. They have lower precedence
than symbols declared after in a <CODE>%left</CODE>, <CODE>%right</CODE> or
<CODE>%nonassoc</CODE> line. The symbols are declared to associate to the
left (<CODE>%left</CODE>), to the right (<CODE>%right</CODE>), or to be
non-associative (<CODE>%nonassoc</CODE>). The symbols are usually tokens.
They can also be dummy nonterminals, for use with the <CODE>%prec</CODE>
directive inside the rules.</DL>
<H3>11.4.3 Rules</H3>The syntax for rules is as usual:
<PRE>
<I>nonterminal</I> :
<I>symbol</I> ... <I>symbol</I> { <I>semantic-action</I> }
| ...
| <I>symbol</I> ... <I>symbol</I> { <I>semantic-action</I> }
;
</PRE>
Rules can also contain the <CODE>%prec </CODE><I>symbol</I> directive in the
right-hand side part, to override the default precedence and
associativity of the rule with the precedence and associativity of the
given symbol.<BR>
<BR>
Semantic actions are arbitrary Caml expressions, that
are evaluated to produce the semantic attribute attached to
the defined nonterminal. The semantic actions can access the
semantic attributes of the symbols in the right-hand side of
the rule with the <CODE>$</CODE> notation: <CODE>$1</CODE> is the attribute for the
first (leftmost) symbol, <CODE>$2</CODE> is the attribute for the second
symbol, etc.<BR>
<BR>
The rules may contain the special symbol <TT>error</TT> to indicate
resynchronization points, as in <TT>yacc</TT>.<BR>
<BR>
Actions occurring in the middle of rules are not supported.<BR>
<BR>
<H3>11.4.4 Error handling</H3>Error recovery is supported as follows: when the parser reaches an
error state (no grammar rules can apply), it calls a function named
<TT>parse</TT><TT>_</TT><TT>error</TT> with the string <TT>syntax error</TT> as argument. The default
<TT>parse</TT><TT>_</TT><TT>error</TT> function does nothing and returns, thus initiating error
recovery (see below). The user can define a customized <TT>parse</TT><TT>_</TT><TT>error</TT>
function in the header section of the grammar file.<BR>
<BR>
The parser also enters error recovery mode if one of the grammar
actions raise the <TT>Parsing.Parse</TT><TT>_</TT><TT>error</TT> exception.<BR>
<BR>
In error recovery mode, the parser discards states from the
stack until it reaches a place where the error token can be shifted.
It then discards tokens from the input until it finds three successive
tokens that can be accepted, and starts processing with the first of
these. If no state can be uncovered where the error token can be
shifted, then the parser aborts by raising the <TT>Parsing.Parse</TT><TT>_</TT><TT>error</TT>
exception.<BR>
<BR>
Refer to documentation on <TT>yacc</TT> for more details and guidance in how
to use error recovery.<BR>
<BR>
<H2>11.5 Options</H2>The <TT>ocamlyacc</TT> command recognizes the following options:<BR>
<BR>
<DL COMPACT=compact>
<DT>
<TT>-v</TT><DD>
Generate a description of the parsing tables and a report on conflicts
resulting from ambiguities in the grammar. The description is put in
file <I>grammar</I><TT>.output</TT>.<BR>
<BR>
<DT><TT>-b</TT><I>prefix</I><DD>
Name the output files <I>prefix</I><TT>.ml</TT>, <I>prefix</I><TT>.mli</TT>,
<I>prefix</I><TT>.output</TT>, instead of the default naming convention.</DL>
<H2>11.6 A complete example</H2>The all-time favorite: a desk calculator. This program reads
arithmetic expressions on standard input, one per line, and prints
their values. Here is the grammar definition:
<PRE>
/* File parser.mly */
%token <int> INT
%token PLUS MINUS TIMES DIV
%token LPAREN RPAREN
%token EOL
%left PLUS MINUS /* lowest precedence */
%left TIMES DIV /* medium precedence */
%nonassoc UMINUS /* highest precedence */
%start main /* the entry point */
%type <int> main
%%
main:
expr EOL { $1 }
;
expr:
INT { $1 }
| LPAREN expr RPAREN { $2 }
| expr PLUS expr { $1 + $3 }
| expr MINUS expr { $1 - $3 }
| expr TIMES expr { $1 * $3 }
| expr DIV expr { $1 / $3 }
| MINUS expr %prec UMINUS { - $2 }
;
</PRE>
Here is the definition for the corresponding lexer:
<PRE>
(* File lexer.mll *)
{
open Parser (* The type token is defined in parser.mli *)
exception Eof
}
rule token = parse
[' ' '\t'] { token lexbuf } (* skip blanks *)
| ['\n' ] { EOL }
| ['0'-'9']+ { INT(int_of_string(Lexing.lexeme lexbuf)) }
| '+' { PLUS }
| '-' { MINUS }
| '*' { TIMES }
| '/' { DIV }
| '(' { LPAREN }
| ')' { RPAREN }
| eof { raise Eof }
</PRE>
Here is the main program, that combines the parser with the lexer:
<PRE>
(* File calc.ml *)
let _ =
try
let lexbuf = Lexing.from_channel stdin in
while true do
let result = Parser.main Lexer.token lexbuf in
print_int result; print_newline(); flush stdout
done
with Lexer.Eof ->
exit 0
</PRE>
To compile everything, execute:
<PRE>
ocamllex lexer.mll # generates lexer.ml
ocamlyacc parser.mly # generates parser.ml and parser.mli
ocamlc -c parser.mli
ocamlc -c lexer.ml
ocamlc -c parser.ml
ocamlc -c calc.ml
ocamlc -o calc lexer.cmo parser.cmo calc.cmo
</PRE>
<H2>11.7 Common errors</H2><DL COMPACT=compact>
<DT>ocamllex: transition table overflow, automaton is too big<DD><BR>
<BR>
The deterministic automata generated by <TT>ocamllex</TT> are limited to at
most 32767 transitions. The message above indicates that your lexer
definition is too complex and overflows this limit. This is commonly
caused by lexer definitions that have separate rules for each of the
alphabetic keywords of the language, as in the following example.
<PRE>
rule token = parse
"keyword1" { KWD1 }
| "keyword2" { KWD2 }
| ...
| "keyword100" { KWD100 }
| ['A'-'Z' 'a'-'z'] ['A'-'Z' 'a'-'z' '0'-'9' '_'] *
{ IDENT(Lexing.lexeme lexbuf) }
</PRE>
To keep the generated automata small, rewrite those definitions with
only one general ``identifier'' rule, followed by a hashtable lookup
to separate keywords from identifiers:
<PRE>
{ let keyword_table = Hashtbl.create 53
let _ =
List.iter (fun (kwd, tok) -> Hashtbl.add keyword_table kwd tok)
[ "keyword1", KWD1;
"keyword2", KWD2; ...
"keyword100", KWD100 ]
}
rule token = parse
['A'-'Z' 'a'-'z'] ['A'-'Z' 'a'-'z' '0'-'9' '_'] *
{ let id = Lexing.lexeme lexbuf in
try
Hashtbl.find keyword_table s
with Not_found ->
IDENT s }
</PRE></DL>
<HR>
<A HREF="manual023.html"><IMG SRC ="previous_motif.gif" ALT="Previous"></A>
<A HREF="manual025.html"><IMG SRC ="next_motif.gif" ALT="Next"></A>
<A HREF="index.html"><IMG SRC ="contents_motif.gif" ALT="Contents"></A>
</BODY>
</HTML>
|