File: manual026.html

package info (click to toggle)
ocaml-doc 3.09-1
  • links: PTS
  • area: non-free
  • in suites: etch, etch-m68k
  • size: 10,428 kB
  • ctags: 4,963
  • sloc: ml: 9,244; makefile: 2,413; ansic: 122; sh: 49; asm: 17
file content (622 lines) | stat: -rw-r--r-- 34,608 bytes parent folder | download
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"
            "http://www.w3.org/TR/REC-html40/loose.dtd">
<HTML>
<HEAD>



<META http-equiv="Content-Type" content="text/html; charset=ISO-8859-1">
<META name="GENERATOR" content="hevea 1.08">
<LINK rel="stylesheet" type="text/css" href="manual.css">
<TITLE>
Lexer and parser generators (ocamllex, ocamlyacc)
</TITLE>
</HEAD>
<BODY >
<A HREF="manual025.html"><IMG SRC ="previous_motif.gif" ALT="Previous"></A>
<A HREF="index.html"><IMG SRC ="contents_motif.gif" ALT="Up"></A>
<A HREF="manual027.html"><IMG SRC ="next_motif.gif" ALT="Next"></A>
<HR>

<H1 CLASS="chapter"><A NAME="htoc126">Chapter&nbsp;12</A>&nbsp;&nbsp;Lexer and parser generators (ocamllex, ocamlyacc)</H1>
<A NAME="c:ocamlyacc"></A>

This chapter describes two program generators: <TT>ocamllex</TT>, that
produces a lexical analyzer from a set of regular expressions with
associated semantic actions, and <TT>ocamlyacc</TT>, that produces a parser
from a grammar with associated semantic actions.<BR>
<BR>
These program generators are very close to the well-known <TT>lex</TT> and
<TT>yacc</TT> commands that can be found in most C programming environments.
This chapter assumes a working knowledge of <TT>lex</TT> and <TT>yacc</TT>: while
it describes the input syntax for <TT>ocamllex</TT> and <TT>ocamlyacc</TT> and the
main differences with <TT>lex</TT> and <TT>yacc</TT>, it does not explain the basics
of writing a lexer or parser description in <TT>lex</TT> and <TT>yacc</TT>. Readers
unfamiliar with <TT>lex</TT> and <TT>yacc</TT> are referred to &#8220;Compilers:
principles, techniques, and tools&#8221; by Aho, Sethi and Ullman
(Addison-Wesley, 1986), or &#8220;Lex &amp; Yacc&#8221;, by Levine, Mason and
Brown (O'Reilly, 1992).<BR>
<BR>

<H2 CLASS="section"><A NAME="htoc127">12.1</A>&nbsp;&nbsp;Overview of <TT>ocamllex</TT></H2>
The <TT>ocamllex</TT> command produces a lexical analyzer from a set of regular
expressions with attached semantic actions, in the style of
<TT>lex</TT>. Assuming the input file is <I>lexer</I><TT>.mll</TT>, executing
<PRE>
        ocamllex <I>lexer</I>.mll
</PRE>
produces Caml code for a lexical analyzer in file <I>lexer</I><TT>.ml</TT>.
This file defines one lexing function per entry point in the lexer
definition. These functions have the same names as the entry
points. Lexing functions take as argument a lexer buffer, and return
the semantic attribute of the corresponding entry point.<BR>
<BR>
Lexer buffers are an abstract data type implemented in the standard
library module <TT>Lexing</TT>. The functions <TT>Lexing.from_channel</TT>,
<TT>Lexing.from_string</TT> and <TT>Lexing.from_function</TT> create
lexer buffers that read from an input channel, a character string, or
any reading function, respectively. (See the description of module
<TT>Lexing</TT> in chapter&nbsp;<A HREF="manual034.html#c:stdlib">20</A>.)<BR>
<BR>
When used in conjunction with a parser generated by <TT>ocamlyacc</TT>, the
semantic actions compute a value belonging to the type <TT>token</TT> defined
by the generated parsing module. (See the description of <TT>ocamlyacc</TT>
below.)<BR>
<BR>

<H3 CLASS="subsection"><A NAME="htoc128">12.1.1</A>&nbsp;&nbsp;Options</H3>
The following command-line options are recognized by <TT>ocamllex</TT>.
<DL CLASS="description" COMPACT=compact><DT CLASS="dt-description">
<B><TT>-o</TT> <I>output-file</I></B><DD CLASS="dd-description">
Specify the name of the output file produced by <TT>ocamllex</TT>.
Default is <I>lexer</I><TT>.ml</TT>, <TT>ocamllex</TT> being invoked as
<TT>ocamllex</TT> <I>lexer</I><TT>.mll</TT>.<BR>
<BR>
<DT CLASS="dt-description"><TT><B>-ml</B></TT><DD CLASS="dd-description">
Output code that does not use the Caml built-in automata
interpreter. Instead, the automaton is encoded by Caml functions.
This option is useful for debugging <TT>ocamllex</TT>, using it for
production lexers is not recommended.<BR>
<BR>
<DT CLASS="dt-description"><TT><B>-q</B></TT><DD CLASS="dd-description">
Quiet mode. <TT>ocamllex</TT> normally outputs informational messages
to standard output. They are suppressed if option <TT>-q</TT> is used.<BR>
<BR>
<DT CLASS="dt-description" style="background-color:yellow; color:red"><TT><B>-version</B></TT><DD CLASS="dd-description" style="background-color:yellow; color:red"> 
Print version and exit.
</DL>

<H2 CLASS="section"><A NAME="htoc129">12.2</A>&nbsp;&nbsp;Syntax of lexer definitions</H2>
The format of lexer definitions is as follows:

<PRE>
{ <I>header</I> }
let <I>ident</I> = <I>regexp</I> ...
rule <I>entrypoint</I> [<I>arg<SUB>1</SUB></I>... <I>arg<SUB>n</SUB></I>] =
  parse <I>regexp</I> { <I>action</I> }
      | ...
      | <I>regexp</I> { <I>action</I> }
and <I>entrypoint</I> [<I>arg<SUB>1</SUB></I>... <I>arg<SUB>n</SUB></I>] =
  parse ...
and ...
{ <I>trailer</I> }
</PRE>
Comments are delimited by <TT>(*</TT> and <TT>*)</TT>, as in Caml.
The <TT>parse</TT> keyword, can be replaced by the <TT>shortest</TT> keyword, with
the semantic consequences explained below.<BR>
<BR>

<H3 CLASS="subsection"><A NAME="htoc130">12.2.1</A>&nbsp;&nbsp;Header and trailer</H3>
The <I>header</I> and <I>trailer</I> sections are arbitrary Caml
text enclosed in curly braces. Either or both can be omitted. If
present, the header text is copied as is at the beginning of the
output file and the trailer text at the end. Typically, the
header section contains the <CODE>open</CODE> directives required
by the actions, and possibly some auxiliary functions used in the
actions.<BR>
<BR>

<H3 CLASS="subsection"><A NAME="htoc131">12.2.2</A>&nbsp;&nbsp;Naming regular expressions</H3>
Between the header and the entry points, one can give names to
frequently-occurring regular expressions. This is written
<FONT COLOR=blue><TT>let</TT></FONT> <FONT COLOR=maroon><TT><I>ident</I></TT> <FONT COLOR=blue><TT>=</TT></FONT> &nbsp;<TT><I>regexp</I></TT></FONT>.
In regular expressions that follow this declaration, the identifier
<I>ident</I> can be used as shorthand for <I>regexp</I>.<BR>
<BR>

<H3 CLASS="subsection"><A NAME="htoc132">12.2.3</A>&nbsp;&nbsp;Entry points</H3>
The names of the entry points must be valid identifiers for Caml
values (starting with a lowercase letter).
Similarily, the arguments <TT><I>arg<SUB>1</SUB></I>...
<I>arg<SUB>n</SUB></I></TT> must be valid identifiers for Caml.
Each entry point becomes a
Caml function that takes <I>n</I>+1 arguments,
the extra implicit last argument being of type <TT>Lexing.lexbuf</TT>.
Characters are read from the <TT>Lexing.lexbuf</TT> argument and matched
against the regular expressions provided in the rule, until a prefix
of the input matches one of the rule. The corresponding action is
then evaluated and returned as the result of the function.<BR>
<BR>
If several regular expressions match a prefix of the input, the
&#8220;longest match&#8221; rule applies: the regular expression that matches
the longest prefix of the input is selected. In case of tie, the
regular expression that occurs earlier in the rule is selected.<BR>
<BR>
However, if lexer rules are introduced with the <TT>shortest</TT> keyword in
place of the <TT>parse</TT> keyword, then the &#8220;shortest match&#8221; rule applies:
the shortest prefix of the input is selected. In case of tie, the
regular expression that occurs earlier in the rule is still selected.
This feature is not intended for use in ordinary lexical analyzers, it
may facilitate the use of <TT>ocamllex</TT> as a simple text processing tool.<BR>
<BR>

<H3 CLASS="subsection"><A NAME="htoc133">12.2.4</A>&nbsp;&nbsp;Regular expressions</H3>
The regular expressions are in the style of <TT>lex</TT>, with a more
Caml-like syntax.
<DL CLASS="description" COMPACT=compact><DT CLASS="dt-description"><FONT COLOR=blue><B><TT>'</TT> <FONT COLOR=maroon><TT><I>char</I></TT></FONT> <TT>'</TT></B></FONT><DD CLASS="dd-description">
A character constant, with the same syntax as Objective Caml character
constants. Match the denoted character.<BR>
<BR>
<DT CLASS="dt-description"><TT><B>_</B></TT><DD CLASS="dd-description">
(Underscore.) Match any character.<BR>
<BR>
<DT CLASS="dt-description"><TT><B>eof</B></TT><DD CLASS="dd-description">
Match the end of the lexer input.<BR>
<B>Note:</B> On some systems, with interactive input, an end-of-file
may be followed by more characters. However, <TT>ocamllex</TT> will not
correctly handle regular expressions that contain <TT>eof</TT> followed by
something else.<BR>
<BR>
<DT CLASS="dt-description"><FONT COLOR=blue><B><TT>"</TT> <FONT COLOR=maroon><TT><I>string</I></TT></FONT> <TT>"</TT></B></FONT><DD CLASS="dd-description">
A string constant, with the same syntax as Objective Caml string
constants. Match the corresponding sequence of characters.<BR>
<BR>
<DT CLASS="dt-description"><FONT COLOR=blue><B><TT>[</TT> <FONT COLOR=maroon><TT><I>character-set</I></TT></FONT> <TT>]</TT></B></FONT><DD CLASS="dd-description">
Match any single character belonging to the given
character set. Valid character sets are: single
character constants <FONT COLOR=blue><TT>'</TT> <FONT COLOR=maroon><TT><I>c</I></TT></FONT> <TT>'</TT></FONT>; ranges of characters
<FONT COLOR=blue><TT>'</TT></FONT> <FONT COLOR=maroon><I><TT>c</TT></I></FONT><SUB>1</SUB> <FONT COLOR=blue><TT>'</TT> <TT>-</TT> <TT>'</TT></FONT> &nbsp;<FONT COLOR=maroon><I><TT>c</TT></I></FONT><SUB>2</SUB> <FONT COLOR=blue><TT>'</TT></FONT> (all characters between <I>c</I><SUB>1</SUB> and <I>c</I><SUB>2</SUB>,
inclusive); and the union of two or more character sets, denoted by
concatenation.<BR>
<BR>
<DT CLASS="dt-description"><FONT COLOR=blue><B><TT>[</TT> <TT>^</TT> <FONT COLOR=maroon><TT><I>character-set</I></TT></FONT> <TT>]</TT></B></FONT><DD CLASS="dd-description">
Match any single character not belonging to the given character set.<BR>
<BR>
<DT CLASS="dt-description"><B><FONT COLOR=maroon><TT><I>regexp</I></TT></FONT><SUB>1</SUB> <FONT COLOR=blue><TT>#</TT></FONT> &nbsp;<FONT COLOR=maroon><TT><I>regexp</I></TT></FONT><SUB>2</SUB></B><DD CLASS="dd-description">
(Difference of character sets).
Regular expressions <FONT COLOR=maroon><I><TT>regexp</TT></I></FONT><SUB>1</SUB> and <FONT COLOR=maroon><I><TT>regexp</TT></I></FONT><SUB>2</SUB> must be character sets
defined with <FONT COLOR=blue><TT>[</TT></FONT>&hellip; <FONT COLOR=blue><TT>]</TT></FONT> (or a a single character expression or
underscore <TT>_</TT>).
Match the difference of the two specified character sets.<BR>
<BR>
<DT CLASS="dt-description"><B><FONT COLOR=maroon><TT><I>regexp</I></TT></FONT> <FONT COLOR=blue><TT>*</TT></FONT></B><DD CLASS="dd-description">
(Repetition.) Match the concatenation of zero or more
strings that match <FONT COLOR=maroon><I><TT>regexp</TT></I></FONT>. <BR>
<BR>
<DT CLASS="dt-description"><B><FONT COLOR=maroon><TT><I>regexp</I></TT></FONT> <FONT COLOR=blue><TT>+</TT></FONT></B><DD CLASS="dd-description">
(Strict repetition.) Match the concatenation of one or more
strings that match <FONT COLOR=maroon><I><TT>regexp</TT></I></FONT>.<BR>
<BR>
<DT CLASS="dt-description"><B><FONT COLOR=maroon><TT><I>regexp</I></TT></FONT> <FONT COLOR=blue><TT>?</TT></FONT></B><DD CLASS="dd-description">
(Option.) Match either the empty string, or a string matching <FONT COLOR=maroon><I><TT>regexp</TT></I></FONT>.<BR>
<BR>
<DT CLASS="dt-description"><B><FONT COLOR=maroon><TT><I>regexp</I></TT></FONT><SUB>1</SUB> <FONT COLOR=blue><TT>|</TT></FONT> &nbsp;<FONT COLOR=maroon><TT><I>regexp</I></TT></FONT><SUB>2</SUB></B><DD CLASS="dd-description">
(Alternative.) Match any string that matches either <FONT COLOR=maroon><I><TT>regexp</TT></I></FONT><SUB>1</SUB> or <FONT COLOR=maroon><I><TT>regexp</TT></I></FONT><SUB>2</SUB><BR>
<BR>
<DT CLASS="dt-description"><B><FONT COLOR=maroon><TT><I>regexp</I></TT></FONT><SUB>1</SUB> &nbsp;<FONT COLOR=maroon><TT><I>regexp</I></TT></FONT><SUB>2</SUB></B><DD CLASS="dd-description">
(Concatenation.) Match the concatenation of two strings, the first
matching <FONT COLOR=maroon><I><TT>regexp</TT></I></FONT><SUB>1</SUB>, the second matching <FONT COLOR=maroon><I><TT>regexp</TT></I></FONT><SUB>2</SUB>.<BR>
<BR>
<DT CLASS="dt-description"><FONT COLOR=blue><B><TT>(</TT> <FONT COLOR=maroon><TT><I>regexp</I></TT></FONT> <TT>)</TT></B></FONT><DD CLASS="dd-description">
Match the same strings as <FONT COLOR=maroon><I><TT>regexp</TT></I></FONT>.<BR>
<BR>
<DT CLASS="dt-description"><FONT COLOR=maroon><I><TT><B>ident</B></TT></I></FONT><DD CLASS="dd-description">
Reference the regular expression bound to <FONT COLOR=maroon><I><TT>ident</TT></I></FONT> by an earlier
<FONT COLOR=blue><TT>let</TT></FONT> <FONT COLOR=maroon><TT><I>ident</I></TT> <FONT COLOR=blue><TT>=</TT></FONT> &nbsp;<TT><I>regexp</I></TT></FONT> definition.<BR>
<BR>
<DT CLASS="dt-description"><FONT COLOR=maroon><B><TT><I>regexp</I></TT> <FONT COLOR=blue><TT>as</TT></FONT> &nbsp;<TT><I>ident</I></TT></B></FONT><DD CLASS="dd-description">
Bind the substring matched by <FONT COLOR=maroon><I><TT>regexp</TT></I></FONT> to identifier <FONT COLOR=maroon><I><TT>ident</TT></I></FONT>.
</DL>
Concerning the precedences of operators, <TT>*</TT> and <TT>+</TT> have
highest precedence, followed by <TT>?</TT>, then concatenation, then
<TT>|</TT> (alternation), then <TT>as</TT>.<BR>
<BR>

<H3 CLASS="subsection"><A NAME="htoc134">12.2.5</A>&nbsp;&nbsp;Actions</H3>
The actions are arbitrary Caml expressions. They are evaluated in
a context where the identifiers defined by using the <TT>as</TT> construct
are bound to subparts of the matched string.
Additionally, <TT>lexbuf</TT> is bound to the current lexer
buffer. Some typical uses for <TT>lexbuf</TT>, in conjunction with the
operations on lexer buffers provided by the <TT>Lexing</TT> standard library
module, are listed below.
<DL CLASS="description" COMPACT=compact><DT CLASS="dt-description">
<TT><B>Lexing.lexeme lexbuf</B></TT><DD CLASS="dd-description">
Return the matched string.<BR>
<BR>
<DT CLASS="dt-description"><B><TT>Lexing.lexeme_char lexbuf </TT><I>n</I></B><DD CLASS="dd-description">
Return the <I>n</I><FONT SIZE=2><SUP>th</SUP></FONT>
character in the matched string. The first character corresponds to <I>n</I> = 0.<BR>
<BR>
<DT CLASS="dt-description"><TT><B>Lexing.lexeme_start lexbuf</B></TT><DD CLASS="dd-description">
Return the absolute position in the input text of the beginning of the
matched string. The first character read from the input text has
position 0.<BR>
<BR>
<DT CLASS="dt-description"><TT><B>Lexing.lexeme_end lexbuf</B></TT><DD CLASS="dd-description">
Return the absolute position in the input text of the end of the
matched string. The first character read from the input text has
position 0.<BR>
<BR>
<DT CLASS="dt-description"><B><I>entrypoint</I> [<I>exp<SUB>1</SUB></I>... <I>exp<SUB>n</SUB></I>] <TT>lexbuf</TT></B><DD CLASS="dd-description">
(Where <I>entrypoint</I> is the name of another entry point in the same
lexer definition.) Recursively call the lexer on the given entry point.
Notice that <TT>lexbuf</TT> is the last argument.
Useful for lexing nested comments, for example.</DL>

<H3 CLASS="subsection"><A NAME="htoc135">12.2.6</A>&nbsp;&nbsp;Variables in regular expressions</H3>
The <TT>as</TT> construct is similar to &#8220;<EM>groups</EM>&#8221; as provided by
numerous regular expression packages.
The type of these variables can be <TT>string</TT>, <TT>char</TT>, <TT>string option</TT>
or <TT>char option</TT>.<BR>
<BR>
We first consider the case of linear patterns, that is the case when
all <TT>as</TT> bound variables are distinct.
In <FONT COLOR=maroon><TT><I>regexp</I></TT> <FONT COLOR=blue><TT>as</TT></FONT> &nbsp;<TT><I>ident</I></TT></FONT>, the type of <FONT COLOR=maroon><I><TT>ident</TT></I></FONT> normally is <TT>string</TT> (or
<TT>string option</TT>) except
when <FONT COLOR=maroon><I><TT>regexp</TT></I></FONT> is a character constant, an underscore, a string
constant of length one, a character set specification, or an
alternation of those. Then, the type of <FONT COLOR=maroon><I><TT>ident</TT></I></FONT> is <TT>char</TT> (or <TT>char option</TT>).
Option types are introduced when overall rule matching does not
imply matching of the bound sub-pattern. This is in particular the
case of <FONT COLOR=blue><TT>(</TT> <FONT COLOR=maroon><TT><I>regexp</I></TT></FONT> <TT>as</TT> &nbsp;<TT><FONT COLOR=maroon><I>indent</I></FONT>)</TT> <TT>?</TT></FONT> and of
<FONT COLOR=maroon><I><TT>regexp</TT></I></FONT><SUB>1</SUB> <FONT COLOR=blue><TT>|</TT> <TT>(</TT></FONT> &nbsp;<FONT COLOR=maroon><I><TT>regexp</TT></I></FONT><SUB>2</SUB> <FONT COLOR=blue><TT>as</TT> &nbsp;<FONT COLOR=maroon><TT><I>ident</I></TT></FONT> <TT>)</TT></FONT>.<BR>
<BR>
There is no linearity restriction over <TT>as</TT> bound variables.
When a variable is bound more than once, the previous rules are to be
extended as follows:
<UL CLASS="itemize"><LI CLASS="li-itemize">
A variable is a <TT>char</TT> variable when all its occurrences bind
<TT>char</TT> occurrences in the previous sense.
<LI CLASS="li-itemize">A variable is an <TT>option</TT> variable when the overall expression
can be matched without binding this variable.
</UL>
For instance, in
<CODE>('a' as x) | ( 'a' (_ as x) )</CODE> the variable <CODE>x</CODE> is of type
<TT>char</TT>, whereas in 
<CODE>("ab" as x) | ( 'a' (_ as x) ? )</CODE> the variable <CODE>x</CODE> is of type
<TT>string option</TT>.<BR>
<BR>
In some cases, a sucessful match may not yield a unique set of bindings.
For instance the matching of <CODE>aba</CODE> by the regular expression
<CODE>(('a'|"ab") as x) (("ba"|'a') as y)</CODE> may result in binding
either
<CODE>x</CODE> to <CODE>"ab"</CODE> and <CODE>y</CODE> to <CODE>"a"</CODE>, or
<CODE>x</CODE> to <CODE>"a"</CODE> and <CODE>y</CODE> to <CODE>"ba"</CODE>.
The automata produced <TT>ocamllex</TT> on such ambiguous regular
expressions will select one of the possible resulting sets of
bindings.
The selected set of bindings is purposely left unspecified.<BR>
<BR>

<H3 CLASS="subsection"><A NAME="htoc136">12.2.7</A>&nbsp;&nbsp;Reserved identifiers</H3>
All identifiers starting with <TT>__ocaml_lex</TT> are reserved for use by
<TT>ocamllex</TT>; do not use any such identifier in your programs.<BR>
<BR>

<H2 CLASS="section"><A NAME="htoc137">12.3</A>&nbsp;&nbsp;Overview of <TT>ocamlyacc</TT></H2>
The <TT>ocamlyacc</TT> command produces a parser from a context-free grammar
specification with attached semantic actions, in the style of <TT>yacc</TT>.
Assuming the input file is <I>grammar</I><TT>.mly</TT>, executing
<PRE>
        ocamlyacc <I>options grammar</I>.mly
</PRE>
produces Caml code for a parser in the file <I>grammar</I><TT>.ml</TT>,
and its interface in file <I>grammar</I><TT>.mli</TT>.<BR>
<BR>
The generated module defines one parsing function per entry point in
the grammar. These functions have the same names as the entry points.
Parsing functions take as arguments a lexical analyzer (a function
from lexer buffers to tokens) and a lexer buffer, and return the
semantic attribute of the corresponding entry point. Lexical analyzer
functions are usually generated from a lexer specification by the
<TT>ocamllex</TT> program. Lexer buffers are an abstract data type
implemented in the standard library module <TT>Lexing</TT>. Tokens are values from
the concrete type <TT>token</TT>, defined in the interface file
<I>grammar</I><TT>.mli</TT> produced by <TT>ocamlyacc</TT>.<BR>
<BR>

<H2 CLASS="section"><A NAME="htoc138">12.4</A>&nbsp;&nbsp;Syntax of grammar definitions</H2>
Grammar definitions have the following format:
<PRE>
%{
  <I>header</I>
%}
  <I>declarations</I>
%%
  <I>rules</I>
%%
  <I>trailer</I>
</PRE>
Comments are enclosed between <CODE>/*</CODE> and <CODE>*/</CODE> (as in C) in the
&#8220;declarations&#8221; and &#8220;rules&#8221; sections, and between <CODE>(*</CODE> and
<CODE>*)</CODE> (as in Caml) in the &#8220;header&#8221; and &#8220;trailer&#8221; sections.<BR>
<BR>

<H3 CLASS="subsection"><A NAME="htoc139">12.4.1</A>&nbsp;&nbsp;Header and trailer</H3>
The header and the trailer sections are Caml code that is copied
as is into file <I>grammar</I><TT>.ml</TT>. Both sections are optional. The header
goes at the beginning of the output file; it usually contains
<TT>open</TT> directives and auxiliary functions required by the semantic
actions of the rules. The trailer goes at the end of the output file.<BR>
<BR>

<H3 CLASS="subsection"><A NAME="htoc140">12.4.2</A>&nbsp;&nbsp;Declarations</H3>
Declarations are given one per line. They all start with a <CODE>%</CODE> sign.
<DL CLASS="description" COMPACT=compact><DT CLASS="dt-description"><B><FONT COLOR=blue><TT>%token</TT></FONT> <FONT COLOR=maroon><TT><I>symbol</I></TT></FONT> &hellip; &nbsp;<FONT COLOR=maroon><TT><I>symbol</I></TT></FONT></B><DD CLASS="dd-description">
Declare the given symbols as tokens (terminal symbols). These symbols
are added as constant constructors for the <TT>token</TT> concrete type.<BR>
<BR>
<DT CLASS="dt-description"><B><FONT COLOR=blue><TT>%token</TT></FONT> <FONT COLOR=blue><TT>&lt;</TT></FONT> <FONT COLOR=maroon><I><TT>type</TT></I> <FONT COLOR=blue><TT>&gt;</TT></FONT> &nbsp;<I><TT>symbol</TT></I></FONT> &hellip; &nbsp;<FONT COLOR=maroon><TT><I>symbol</I></TT></FONT></B><DD CLASS="dd-description">
Declare the given symbols as tokens with an attached attribute of the
given type. These symbols are added as constructors with arguments of
the given type for the <TT>token</TT> concrete type. The <FONT COLOR=maroon><I><TT>type</TT></I></FONT> part is
an arbitrary Caml type expression, except that all type
constructor names must be fully qualified (e.g. <TT>Modname.typename</TT>)
for all types except standard built-in types, even if the proper
<CODE>open</CODE> directives (e.g. <CODE>open Modname</CODE>) were given in the
header section. That's because the header is copied only to the <TT>.ml</TT>
output file, but not to the <TT>.mli</TT> output file, while the <FONT COLOR=maroon><I><TT>type</TT></I></FONT> part
of a <CODE>%token</CODE> declaration is copied to both.<BR>
<BR>
<DT CLASS="dt-description"><B><FONT COLOR=blue><TT>%start</TT></FONT> <FONT COLOR=maroon><TT><I>symbol</I></TT></FONT> &hellip; &nbsp;<FONT COLOR=maroon><TT><I>symbol</I></TT></FONT></B><DD CLASS="dd-description">
Declare the given symbols as entry points for the grammar. For each
entry point, a parsing function with the same name is defined in the
output module. Non-terminals that are not declared as entry points
have no such parsing function. Start symbols must be given a type with
the <CODE>%type</CODE> directive below.<BR>
<BR>
<DT CLASS="dt-description"><B><FONT COLOR=blue><TT>%type</TT></FONT> <FONT COLOR=blue><TT>&lt;</TT></FONT> <FONT COLOR=maroon><I><TT>type</TT></I> <FONT COLOR=blue><TT>&gt;</TT></FONT> &nbsp;<I><TT>symbol</TT></I></FONT> &hellip; &nbsp;<FONT COLOR=maroon><TT><I>symbol</I></TT></FONT></B><DD CLASS="dd-description">
Specify the type of the semantic attributes for the given symbols.
This is mandatory for start symbols only. Other nonterminal symbols
need not be given types by hand: these types will be inferred when
running the output files through the Objective Caml compiler (unless the
<CODE>-s</CODE> option is in effect). The <FONT COLOR=maroon><I><TT>type</TT></I></FONT> part is an arbitrary Caml
type expression, except that all type constructor names must be
fully qualified, as explained above for <TT>%token</TT>.<BR>
<BR>
<DT CLASS="dt-description"><B><FONT COLOR=blue><TT>%left</TT></FONT> <FONT COLOR=maroon><TT><I>symbol</I></TT></FONT> &hellip; &nbsp;<FONT COLOR=maroon><TT><I>symbol</I></TT></FONT></B><DD CLASS="dd-description">
<DT CLASS="dt-description"><B><FONT COLOR=blue><TT>%right</TT></FONT> <FONT COLOR=maroon><TT><I>symbol</I></TT></FONT> &hellip; &nbsp;<FONT COLOR=maroon><TT><I>symbol</I></TT></FONT></B><DD CLASS="dd-description">
<DT CLASS="dt-description"><B><FONT COLOR=blue><TT>%nonassoc</TT></FONT> <FONT COLOR=maroon><TT><I>symbol</I></TT></FONT> &hellip; &nbsp;<FONT COLOR=maroon><TT><I>symbol</I></TT></FONT></B><DD CLASS="dd-description"><BR>
<BR>
Associate precedences and associativities to the given symbols. All
symbols on the same line are given the same precedence. They have
higher precedence than symbols declared before in a <CODE>%left</CODE>,
<CODE>%right</CODE> or <CODE>%nonassoc</CODE> line. They have lower precedence
than symbols declared after in a <CODE>%left</CODE>, <CODE>%right</CODE> or
<CODE>%nonassoc</CODE> line. The symbols are declared to associate to the
left (<CODE>%left</CODE>), to the right (<CODE>%right</CODE>), or to be
non-associative (<CODE>%nonassoc</CODE>). The symbols are usually tokens.
They can also be dummy nonterminals, for use with the <CODE>%prec</CODE>
directive inside the rules.<BR>
<BR>
The precedence declarations are used in the following way to
resolve reduce/reduce and shift/reduce conflicts:
<UL CLASS="itemize"><LI CLASS="li-itemize">
Tokens and rules have precedences. By default, the precedence
 of a rule is the precedence of its rightmost terminal. You
 can override this default by using the <FONT COLOR=blue><TT>%prec</TT></FONT> directive in the rule.
<LI CLASS="li-itemize">A reduce/reduce conflict
 is resolved in favor of the first rule (in the order given by the
 source file), and <TT>ocamlyacc</TT> outputs a warning.
<LI CLASS="li-itemize">A shift/reduce conflict
 is resolved by comparing the precedence of the rule to be
 reduced with the precedence of the token to be shifted. If the
 precedence of the rule is higher, then the rule will be reduced;
 if the precedence of the token is higher, then the token will
 be shifted.
<LI CLASS="li-itemize">A shift/reduce conflict between a rule and a token with the
 same precedence will be resolved using the associativity: if the
 token is left-associative, then the parser will reduce; if the
 token is right-associative, then the parser will shift. If the
 token is non-associative, then the parser will declare a syntax
 error.
<LI CLASS="li-itemize">When a shift/reduce conflict cannot be resolved using the above
 method, then <TT>ocamlyacc</TT> will output a warning and the parser will
 always shift.
</UL></DL>

<H3 CLASS="subsection"><A NAME="htoc141">12.4.3</A>&nbsp;&nbsp;Rules</H3>
The syntax for rules is as usual:
<PRE>
<I>nonterminal</I> :
    <I>symbol</I> ... <I>symbol</I> { <I>semantic-action</I> }
  | ...
  | <I>symbol</I> ... <I>symbol</I> { <I>semantic-action</I> }
;
</PRE>
Rules can also contain the <CODE>%prec </CODE><I>symbol</I> directive in the
right-hand side part, to override the default precedence and
associativity of the rule with the precedence and associativity of the
given symbol.<BR>
<BR>
Semantic actions are arbitrary Caml expressions, that
are evaluated to produce the semantic attribute attached to
the defined nonterminal. The semantic actions can access the
semantic attributes of the symbols in the right-hand side of
the rule with the <CODE>$</CODE> notation: <CODE>$1</CODE> is the attribute for the
first (leftmost) symbol, <CODE>$2</CODE> is the attribute for the second
symbol, etc.<BR>
<BR>
The rules may contain the special symbol <TT>error</TT> to indicate
resynchronization points, as in <TT>yacc</TT>.<BR>
<BR>
Actions occurring in the middle of rules are not supported.<BR>
<BR>
Nonterminal symbols are like regular Caml symbols, except that they
cannot end with <TT>'</TT> (single quote).<BR>
<BR>

<H3 CLASS="subsection"><A NAME="htoc142">12.4.4</A>&nbsp;&nbsp;Error handling</H3>
Error recovery is supported as follows: when the parser reaches an
error state (no grammar rules can apply), it calls a function named
<TT>parse_error</TT> with the string <TT>"syntax error"</TT> as argument. The default
<TT>parse_error</TT> function does nothing and returns, thus initiating error
recovery (see below). The user can define a customized <TT>parse_error</TT>
function in the header section of the grammar file.<BR>
<BR>
The parser also enters error recovery mode if one of the grammar
actions raises the <TT>Parsing.Parse_error</TT> exception.<BR>
<BR>
In error recovery mode, the parser discards states from the
stack until it reaches a place where the error token can be shifted.
It then discards tokens from the input until it finds three successive
tokens that can be accepted, and starts processing with the first of
these. If no state can be uncovered where the error token can be
shifted, then the parser aborts by raising the <TT>Parsing.Parse_error</TT>
exception.<BR>
<BR>
Refer to documentation on <TT>yacc</TT> for more details and guidance in how
to use error recovery.<BR>
<BR>

<H2 CLASS="section"><A NAME="htoc143">12.5</A>&nbsp;&nbsp;Options</H2>
The <TT>ocamlyacc</TT> command recognizes the following options:
<DL CLASS="description" COMPACT=compact><DT CLASS="dt-description"><B><TT>-b</TT><I>prefix</I></B><DD CLASS="dd-description">
Name the output files <I>prefix</I><TT>.ml</TT>, <I>prefix</I><TT>.mli</TT>,
<I>prefix</I><TT>.output</TT>, instead of the default naming convention.<BR>
<BR>
<DT CLASS="dt-description"><TT><B>-v</B></TT><DD CLASS="dd-description">
Generate a description of the parsing tables and a report on conflicts
resulting from ambiguities in the grammar. The description is put in
file <I>grammar</I><TT>.output</TT>.<BR>
<BR>
<DT CLASS="dt-description" style="background-color:yellow; color:red"><TT><B>-version</B></TT><DD CLASS="dd-description" style="background-color:yellow; color:red"> 
Print version and exit.</DL>
At run-time, the <TT>ocamlyacc</TT>-generated parser can be debugged by
setting the <TT>p</TT> option in the <TT>OCAMLRUNPARAM</TT> environment variable
(see section&nbsp;<A HREF="manual024.html#ocamlrun-options">10.2</A>). This causes the pushdown
automaton executing the parser to print a trace of its action (tokens
shifted, rules reduced, etc). The trace mentions rule numbers and
state numbers that can be interpreted by looking at the file
<I>grammar</I><TT>.output</TT> generated by <TT>ocamlyacc -v</TT>.<BR>
<BR>

<H2 CLASS="section"><A NAME="htoc144">12.6</A>&nbsp;&nbsp;A complete example</H2>
The all-time favorite: a desk calculator. This program reads
arithmetic expressions on standard input, one per line, and prints
their values. Here is the grammar definition:
<PRE CLASS="verbatim">
        /* File parser.mly */
        %token &lt;int&gt; INT
        %token PLUS MINUS TIMES DIV
        %token LPAREN RPAREN
        %token EOL
        %left PLUS MINUS        /* lowest precedence */
        %left TIMES DIV         /* medium precedence */
        %nonassoc UMINUS        /* highest precedence */
        %start main             /* the entry point */
        %type &lt;int&gt; main
        %%
        main:
            expr EOL                { $1 }
        ;
        expr:
            INT                     { $1 }
          | LPAREN expr RPAREN      { $2 }
          | expr PLUS expr          { $1 + $3 }
          | expr MINUS expr         { $1 - $3 }
          | expr TIMES expr         { $1 * $3 }
          | expr DIV expr           { $1 / $3 }
          | MINUS expr %prec UMINUS { - $2 }
        ;
</PRE>Here is the definition for the corresponding lexer:
<PRE CLASS="verbatim">
        (* File lexer.mll *)
        {
        open Parser        (* The type token is defined in parser.mli *)
        exception Eof
        }
        rule token = parse
            [' ' '\t']     { token lexbuf }     (* skip blanks *)
          | ['\n' ]        { EOL }
          | ['0'-'9']+ as lxm { INT(int_of_string lxm) }
          | '+'            { PLUS }
          | '-'            { MINUS }
          | '*'            { TIMES }
          | '/'            { DIV }
          | '('            { LPAREN }
          | ')'            { RPAREN }
          | eof            { raise Eof }
</PRE>Here is the main program, that combines the parser with the lexer:
<PRE CLASS="verbatim">
        (* File calc.ml *)
        let _ =
          try
            let lexbuf = Lexing.from_channel stdin in
            while true do
              let result = Parser.main Lexer.token lexbuf in
                print_int result; print_newline(); flush stdout
            done
          with Lexer.Eof -&gt;
            exit 0
</PRE>To compile everything, execute:
<PRE CLASS="verbatim">
        ocamllex lexer.mll       # generates lexer.ml
        ocamlyacc parser.mly     # generates parser.ml and parser.mli
        ocamlc -c parser.mli
        ocamlc -c lexer.ml
        ocamlc -c parser.ml
        ocamlc -c calc.ml
        ocamlc -o calc lexer.cmo parser.cmo calc.cmo
</PRE>

<H2 CLASS="section"><A NAME="htoc145">12.7</A>&nbsp;&nbsp;Common errors</H2>
<DL CLASS="description" COMPACT=compact><DT CLASS="dt-description"><B>ocamllex: transition table overflow, automaton is too big</B><DD CLASS="dd-description"><BR>
<BR>
The deterministic automata generated by <TT>ocamllex</TT> are limited to at
most 32767 transitions. The message above indicates that your lexer
definition is too complex and overflows this limit. This is commonly
caused by lexer definitions that have separate rules for each of the
alphabetic keywords of the language, as in the following example.
<PRE CLASS="verbatim">
rule token = parse
  "keyword1"   { KWD1 }
| "keyword2"   { KWD2 }
| ...
| "keyword100" { KWD100 }
| ['A'-'Z' 'a'-'z'] ['A'-'Z' 'a'-'z' '0'-'9' '_'] * as id
               { IDENT id}
</PRE>To keep the generated automata small, rewrite those definitions with
only one general &#8220;identifier&#8221; rule, followed by a hashtable lookup
to separate keywords from identifiers:
<PRE CLASS="verbatim">
{ let keyword_table = Hashtbl.create 53
  let _ =
    List.iter (fun (kwd, tok) -&gt; Hashtbl.add keyword_table kwd tok)
              [ "keyword1", KWD1;
                "keyword2", KWD2; ...
                "keyword100", KWD100 ]
}
rule token = parse
  ['A'-'Z' 'a'-'z'] ['A'-'Z' 'a'-'z' '0'-'9' '_'] * as id
               { try
                   Hashtbl.find keyword_table id
                 with Not_found -&gt;
                   IDENT id }
</PRE><BR>
<BR>
<DT CLASS="dt-description"><B>ocamllex: Position memory overflow, too many bindings</B><DD CLASS="dd-description">
The deterministic automata generated by <TT>ocamllex</TT> maintains a table of
positions inside the scanned lexer buffer. The size of this table is
limited to at most 255 cells. This error should not show up in normal
situations.</DL>



<HR>
<A HREF="manual025.html"><IMG SRC ="previous_motif.gif" ALT="Previous"></A>
<A HREF="index.html"><IMG SRC ="contents_motif.gif" ALT="Up"></A>
<A HREF="manual027.html"><IMG SRC ="next_motif.gif" ALT="Next"></A>
</BODY>
</HTML>