1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81
|
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"
"http://www.w3.org/TR/REC-html40/loose.dtd">
<HTML>
<HEAD>
<META http-equiv="Content-Type" content="text/html; charset= ISO-8859-1">
<TITLE>
Module Genlex: a generic lexical analyzer
</TITLE>
</HEAD>
<BODY >
<A HREF="manual039.html"><IMG SRC ="previous_motif.gif" ALT="Previous"></A>
<A HREF="manual041.html"><IMG SRC ="next_motif.gif" ALT="Next"></A>
<A HREF="manual030.html"><IMG SRC ="contents_motif.gif" ALT="Contents"></A>
<HR>
<H2>17.10 Module <TT>Genlex</TT>: a generic lexical analyzer</H2><A NAME="s:Genlex"></A>
<A NAME="@manual321"></A><BLOCKQUOTE>
This module implements a simple ``standard'' lexical analyzer, presented
as a function from character streams to token streams. It implements
roughly the lexical conventions of Caml, but is parameterized by the
set of keywords of your language.
</BLOCKQUOTE>
<PRE>
type token =
Kwd of string
| Ident of string
| Int of int
| Float of float
| String of string
| Char of char
</PRE>
<BLOCKQUOTE>
The type of tokens. The lexical classes are: <CODE>Int</CODE> and <CODE>Float</CODE>
for integer and floating-point numbers; <CODE>String</CODE> for
string literals, enclosed in double quotes; <CODE>Char</CODE> for
character literals, enclosed in single quotes; <CODE>Ident</CODE> for
identifiers (either sequences of letters, digits, underscores
and quotes, or sequences of ``operator characters'' such as
<CODE>+</CODE>, <CODE>*</CODE>, etc); and <CODE>Kwd</CODE> for keywords (either identifiers or
single ``special characters'' such as <CODE>(</CODE>, <CODE>}</CODE>, etc).
</BLOCKQUOTE>
<PRE>
val make_lexer: string list -> (char Stream.t -> token Stream.t)
</PRE>
<A NAME="@manual322"></A><BLOCKQUOTE>
Construct the lexer function. The first argument is the list of
keywords. An identifier <CODE>s</CODE> is returned as <CODE>Kwd s</CODE> if <CODE>s</CODE>
belongs to this list, and as <CODE>Ident s</CODE> otherwise.
A special character <CODE>s</CODE> is returned as <CODE>Kwd s</CODE> if <CODE>s</CODE>
belongs to this list, and cause a lexical error (exception
<CODE>Parse_error</CODE>) otherwise. Blanks and newlines are skipped.
Comments delimited by <CODE>(*</CODE> and <CODE>*)</CODE> are skipped as well,
and can be nested.
</BLOCKQUOTE>
<BLOCKQUOTE>
Example: a lexer suitable for a desk calculator is obtained by
<PRE>
let lexer = make_lexer ["+";"-";"*";"/";"let";"="; "("; ")"]
</PRE>
The associated parser would be a function from <CODE>token stream</CODE>
to, for instance, <CODE>int</CODE>, and would have rules such as:
<PRE>
let parse_expr = parser
[< 'Int n >] -> n
| [< 'Kwd "("; n = parse_expr; 'Kwd ")" >] -> n
| [< n1 = parse_expr; n2 = parse_remainder n1 >] -> n2
and parse_remainder n1 = parser
[< 'Kwd "+"; n2 = parse_expr >] -> n1+n2
| ...
</PRE>
</BLOCKQUOTE>
<HR>
<A HREF="manual039.html"><IMG SRC ="previous_motif.gif" ALT="Previous"></A>
<A HREF="manual041.html"><IMG SRC ="next_motif.gif" ALT="Next"></A>
<A HREF="manual030.html"><IMG SRC ="contents_motif.gif" ALT="Contents"></A>
</BODY>
</HTML>
|