File: parsers.html

package info (click to toggle)
camlp5 6.06-1
links: PTS, VCS
area: main
in suites: wheezy
size: 7,428 kB
sloc: ml: 77,055; sh: 1,417; makefile: 1,211
file content (719 lines) | stat: -rw-r--r-- 25,903 bytes
parent folder | download | duplicates (2)
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN"
 "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
  <!-- $Id: parsers.html,v 6.4 2012-01-09 14:22:20 deraugla Exp $ -->
  <!-- Copyright (c) INRIA 2007-2012 -->
  <title>parsers</title>
  <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
  <meta http-equiv="Content-Style-Type" content="text/css" />
  <link rel="stylesheet" type="text/css" href="styles/base.css"
        title="Normal" />
</head>
<body>

<div id="menu">
</div>

<div id="content">

<h1 class="top">Stream parsers</h1>

<p>We describe here the syntax and the semantics of the parsers of
  streams of Camlp5. Streams are kinds of lazy lists. The parsers of
  these streams use recursive descendent method without backtracking,
  which is the most natural one in functional languages. In
  particular, parsers are normal functions.</p>

<p>Notice that the parsers have existed in OCaml since many years (the
  beginning of the 90ies), but some new features have been added in
  2007 (lookahead, "no error" optimization, let..in statement and left
  factorization) in Camlp5 distribution. This chapter describes them
  also.</p>

<div id="tableofcontents">
</div>

<h2>Introduction</h2>

<p>Parsers apply to values of type "Stream.t" defined in the module
  "Stream" of the standard library of OCaml. Like the type "list", the
  type "Stream.t" has a type parameter, indicating the type of its
  elements. They differ from the lists that they are lazy (the
  elements are evaluated as long as the parser need them for its
  actions), and imperative (parsers deletes their first elements when
  they take their parsing decisions): notice that purely functional
  parsers exist in Camlp5, where the corresponding streams are lazy
  and functional, the analyzed elements remaining in the initial
  stream and the semantic action returning the resulting stream
  together with the normal result, which allow natural limited
  backtrack but have the drawback that it is not easy to find the
  position of parsing errors when they happen.</p>

<p>Parsers of lazy+imperative streams, which are described here, use a
  method named "recursive descendent": they look at the first element,
  they decide what to do in function of its value, and continue the
  parsing with the remaining elements. Parsers can call other parsers,
  and can be recursive, like normal functions.</p>

<p>Actually, parsers are just pure syntactic sugar. When writing a
  parser in the syntax of the parser, Camlp5 transforms them into
  normal call to functions, use of patterns matchings and try..with
  statements.  The pretty printer of Camlp5, by default, displays this
  expanded result, without syntax of parsers. A pretty printing kit,
  when added, can rebuild the parsers in their initial syntax and
  display it.</p>

<h2>Syntax</h2>

<p>The syntax of the parsers, when loading "pa_rp.cmo" (or already
  included in the command "camlp5r"), is the following:</p>

<pre>
            expression ::= parser
                         | match-with-parser
                parser ::= "parser" pos-opt "[" parser-cases "]"
                         | "parser" pos-opt parser-case
     match-with-parser ::= "match" expression "with" parser
          parser-cases ::= parser-cases parser-case
                         | &lt;nothing&gt;
           parser-case ::= "[:" stream-pattern ":]" pos-opt "->" expression
        stream-pattern ::= stream-patt-comp
                         | stream-patt-comp ";" stream-patt-cont
                         | "let" LIDENT "=" expression "in" stream-pattern
                         | &lt;nothing&gt;
      stream-patt-cont ::= stream-patt-comp-err
                         | stream-patt-comp-err ";" stream-patt-cont
                         | "let" LIDENT "=" expression "in" stream-patt-cont
  stream-patt-comp-err ::= stream-patt-comp
                         | stream-patt-comp "?" expression
                         | stream-patt-comp "!"
      stream-patt-comp ::= "`" pattern
                         | "`" pattern "when" expression
                         | "?=" lookaheads
                         | pattern "=" expression
                         | pattern
            lookaheads ::= lookaheads "|" lookahead
                         | lookahead
             lookahead ::= "[" patterns "]"
              patterns ::= patterns pattern
                         | pattern
               pos-opt ::= pattern
                         | &lt;nothing&gt;
</pre>

<h2>Streams</h2>

<p>The parsers are functions taking streams as parameter. Streams are
  are values of type "<code>Stream.t a</code>" for some type
  "<code>a</code>". It is possible to build streams using the
  functions defined in the module "<code>Stream</code>":</p>

<h3>Stream.from</h3>

<p>"<code>Stream.from f</code>" returns a stream built from the
  function "<code>f</code>". To create a new stream element, the
  function "<code>f</code>" is called with the current stream count,
  starting with zero. The user function "<code>f</code>" must return
  either "<code>Some &lt;value&gt;</code>" for a value or
  "<code>None</code>" to specify the end of the stream.</p>

<h3>Stream.of_list</h3>

<p>Return a stream built from the list in the same order.</p>

<h3>Stream.of_string</h3>

<p>Return a stream of the characters of the string parameter.</p>

<h3>Stream.of_channel</h3>

<p>Return a stream of the characters read from the input channel
  parameter.</p>

<h2>Semantics of parsers</h2>

<h3>Parser</h3>

<p>A parser, defined with the syntax "parser" above, is of type
  "<code>Stream.t a -> b</code>" where "a" is the type of the elements
  of the streams and "b" the type of the result. The parser cases are
  tested in the order they are defined until one of them applies. The
  result is the semantic action of the parser case which applies. If
  no parser case applies, the exception "<code>Stream.Failure</code>"
  is raised.</p>

<p>When testing a parser case, if the first stream pattern component
  matches, all remaining stream pattern components of the stream
  pattern must match also. If one does not match, the parser raises
  the exception "<code>Stream.Error</code>" which has a parameter of
  type string: by default, this string is the empty string, but if the
  stream pattern component which does not match is followed by a
  question mark and an expression, this expression is evaluated and
  given as parameter to "<code>Stream.Error</code>".</p>

<p>In short, a parser can return with three ways:</p>

<ul>
  <li>A normal result, of type "<code>b</code>" for a parser of type
    "<code>Stream.t a -> b</code>".</li>
  <li>Raising the exception "<code>Stream.Failure</code>".</li>
  <li>Raising the exception "<code>Stream.Error</code>".</li>
</ul>

<p>Fundamentally, the exception "<code>Stream.Failure</code>" means
  "this parser does not apply and no element have been removed from
  the initial stream". This is a normal case when parsing: the parser
  locally fails, but the parsing can continue.</p>

<p>Conversely, the exception "<code>Stream.Error</code>" means that
  "this parser encountered a syntax error and elements have probably
  been removed from the stream". In this case, there is no way to
  recover the parsing, and it definitively fails.</p>

<h3>Left factorization</h3>

<p>In parsers, <em>consecutive</em> rules starting with the same
  components are left factorized. It means that they are transformed
  into one only rule starting with the common path, and continuing
  with a call to a parser separating the two cases. The order is
  kept, except that the possible empty rule is inserted at the
  end.</p>

<p>For example, the parser:</p>

<pre>
  parser
  [ [: `If; e1 = expr; `Then; e2 = expr; `Else; e3 = expr :] -> f e1 e2 e3
  | [: `If; e1 = expr; `Then; e2 = expr :] -> g e1 e2 ]
</pre>

<p>is transformed into:</p>

<pre>
  parser
    [: `If; e1 = expr; `Then; e2 = expr;
       a =
         parser
         [ [: `Else; e3 = expr :] -> f e1 e2 e3
         | [: :] -> g e1 e2 ] :] -> a
</pre>

<p>The version where rules are inverted:</p>

<pre>
  parser
  [ [: `If; e1 = expr; `Then; e2 = expr :] -> g e1 e2
  | [: `If; e1 = expr; `Then; e2 = expr; `Else; e3 = expr :] -> f e1 e2 e3 ]
</pre>

<p>is transformed into the same parser.</p>

<p>Notice that:

<ul>
  <li>Only <em>consecutive</em> rules are left factorized. In the
    following parser:
<pre>
  parser
  [ [: `If; e1 = expr; `Then; e2 = expr; `Else; e3 = expr :] -> ...
  | [: a = b :] -> a
  | [: `If; e1 = expr; `Then; e2 = expr :] -> ... ]
</pre>
    the two rules starting with "<tt>If</tt>" are not left factorized,
    and the second "<tt>If</tt>" rule will never work.
  </li>
  <li>The components which are not <em>identical</em> are not
    factorized. In the following parser:
<pre>
  parser
  [ [: `If; e1 = expr; `Then; e2 = expr; `Else; e3 = expr :] -> ...
  | [: `If; e4 = expr; `Then; e2 = expr :] -> ... ]
</pre>
    only the first component, "<tt>If</tt>" is factorized, the second
    one being different because of different patterns ("<tt>e1</tt>" and
      "<tt>e4</tt>").
  </li>
</ul>

<h3>Match with parser</h3>

<p>The syntax "match expression with parser" allows to match a stream
  against a parser. It is, for "parser", the equivalent of "match
  expression with" for "fun". The same way we could say:</p>

<pre>
  match expression with ...
</pre>

<p>could be considered as an equivalent to:</p>

<pre>
  (fun ...) expression
</pre>

<p>we could consider that:</p>

<pre>
  match expression with parser ...
</pre>

<p>is an equivalent to:</p>

<pre>
  (parser ...) expression
</pre>

<h3>Error messages</h3>

<p>A "<code>Stream.Error</code>" exception is raised when a stream
  pattern component does not match and that it is not the first one of
  the parser case. This exception has a parameter of type string,
  useful to specify the error message. By default, this is the empty
  string. To specify an error message, add a question mark and an
  expression after the stream pattern component. A typical error
  message is "that stream pattern component expected". Example with
  the parser of "if..then..else.." above:</p>

<pre>
  parser
    [: `If; e1 = expr ? "expression expected after 'if'";
       `Then ? "'then' expected";
       e2 = expr ? "expression expected after 'then'";
       a =
         parser
         [ [: `Else; e3 = expr ? "expression expected" :] -> f e1 e2 e3
         | [: :] -> g e1 e2 ] :] -> a
</pre>

<p>Notice that the expression after the question mark is evaluated
  only in case of syntax error. Therefore, it can be a complicated
  call to a complicated function without slowing down the normal
  parsing.</p>


<h3>Stream pattern component</h3>

<p>In a stream pattern (starting with "<code>[:</code>" and ending
  with "<code>:]</code>"), the stream pattern components are separated
  with the semicolon character. There are three cases of stream
  pattern components with some sub-cases for some of them, and an
  extra syntax can be used with a "let..in" construction. The three
  cases are:</p>

<ul>
  <li>A direct test of one or several stream elements
    (called <b>terminal</b> symbol), in three ways:
    <ol>
      <li>The character "backquote" followed by a pattern, meaning: if
        the stream starts with an element which is matched by this
        pattern, the stream pattern component matches, and the stream
        element is removed from the stream.</li>
      <li>The character "backquote" followed by a pattern, the keyword
        "when" and an expression of type "<code>bool</code>", meaning:
        if the stream starts with an element which is matched by this
        pattern and if the evaluation of the expression is
        "<code>True</code>", the stream pattern component matches, and
        the first element of the stream is removed.</li>
      <li>The character "question mark" followed by the character
        "equal" and a lookahead expression (see further), meaning: if
        the lookahead applies, the stream pattern component
        matches. The lookahead may unfreeze one or several elements on
        the stream, but does not remove them.</li>
    </ol>
  </li>
  <li>A pattern followed by the "equal" sign and an expression of type
    "<code>Stream.t x -> y</code>" for some types "<code>x</code>" and
    "<code>y</code>". This expression is called a <b>non terminal</b>
    symbol. It means: call the expression (which is a parser) with the
    current stream. If this sub-parser:

    <ol>
      <li>Returns an element, the pattern is bound to this result and
        the next stream pattern component is tested.</li>
      <li>Raises the exception "<code>Stream.Failure</code>", there
        are two cases:

        <ul>
          <li>if the stream pattern component is the first one of the
            stream case, the current parser also fails with the
            exception "<code>Stream.Failure</code>".</li>
          <li>if the stream pattern component is not the first one of
            the stream case, the current parser fails with the
            exception "<code>Stream.Error</code>".</li>
        </ul>

        In this second case:

        <ul>
          <li>If the stream pattern component is followed by a
            "question mark" and an expression (which must be of type
            "<code>string</code>"), the expression is evaluated and
            given as parameter of the exception
            "<tt>Stream.Error</tt>".</li>
          <li>If the expression is followed by an "exclamation mark",
            the test and conversion from "<code>Stream.Failure</code>"
            to "<code>Stream.Error</code>" is not done, and the parser
            just raises "<code>Stream.Failure</code>" again. This is
            an optimization which must be assumed by the programmer,
            in general when he knows that the sub-parser called never
            raises "<code>Stream.Failure</code>" (for example if the
            called parser ends with a parser case containing an empty
            stream pattern). See "no error optionization" below.</li>
          <li>Otherwise the exception parameter is the empty string.</li>
        </ul>
      </li>
    </ol>
  </li>
  <li>A pattern, which is bound to the current stream.</li>
</ul>

<p>Notice that patterns are bound immediately and can be used in the
  next stream pattern component.</p>

<h3>Let statement</h3>

<p>Between stream pattern components, it is possible to use the
  "let..in" construction. This is not considered as a real stream
  pattern component, in the fact that is is not tested against the
  exception "<code>Stream.Failure</code>" it may raise. It can be
  useful for intermediate computation. In particular, it is used
  internally by the lexers (see chapter
  about <a href="lexers.html">lexers</a> as character stream
  parsers).</p>

<p>Example of use, when an expression have to be used several times
  (in the example, "<code>d a</code>", which is bound to the variable
  "<code>c</code>"):</p>

<pre>
  parser
    [: a = b;
       let c = d a in
       e =
         parser
         [ [: f = g :] -> h c
         | [: :] -> c ] :] -> e
</pre>

<h3>Lookahead</h3>

<p>The lookahead feature allows to look at several terminals in the
  stream without removing them, in order to take decisions when more
  than one terminal is necessary.</p>

<p>For example, when parsing the normal syntax of the OCaml language,
  there is a problem, in recursing descendent parsing, for the cases
  where to treat and differentiate the following inputs:</p>

<pre>
  (-x+1)
  (-)
</pre>

<p>The first case is treated in a rule, telling: "a left parenthesis,
  followed by an expression, and a right parenthesis". The second one
  is "a left parenthesis, an operator, a right
  parenthesis". Programming it like this (left factorizing the first
  parenthesis):</p>

<pre>
  parser
    [: `Lparen;
       e =
         parser
         [ [: e = expr; `Rparen :] -> e
         | [: `Minus; `Rparen :] -> minus_op ] :] -> e
</pre>

<p>does not work if the input is "<code>(-)</code>" because the rule
  "<code>e = expr</code>" accepts the minus sign as expression start,
  removing it from the input stream and fails as parsing error, while
  encountering the right parenthesis.</p>

<p>Conversely, writing it this way:</p>

<pre>
  parser
    [: `Lparen;
       e =
         parser
         [ [: `Minus; `Rparen :] -> minus_op
         | [: e = expr; `Rparen :] -> e ] :] -> e
</pre>

<p>does not help, because if the input is "<code>(-x+1)</code>" the
  rule above starting with "<code>`Minus</code>" is accepted and the
  exception "<code>Stream.Error</code>" is raised while encountering
  the variable "<code>x</code>" since a right parenthesis is
  expected.</p>

<p>In general, this kind of situation is best resolved by a left
  factorization of the parser cases (see the section "Semantics"
  above), but that is not possible in this case. The solution is to
  test whether the character after the minus sign is a right
  parenthesis:</p>

<pre>
  parser
    [: `Lparen;
       e =
         parser
         [ [: ?= [ _ Rparen ]; `Minus; `Rparen :] -> minus_op
         | [: e = expr; `Rparen :] -> e ] :] -> e
</pre>

<p>It is possible to put several lists of patterns separated by a
  vertical bar in the lookahead construction, but with a limitation
  (due to the implementation): all lists of patterns must have the
  same number of elements.</p>

<h3>No error optimization</h3>

<p>The "no error optimization" is the fact to end a stream pattern
  component of kind "non-terminal" ("pattern" "equal" "expression") by
  the character "exclamation mark". Like said above, this inhibits the
  transformation of the exception "<code>Stream.Failure</code>",
  possibly raised by the called parser, into the exception
  "<code>Stream.Error</code>".</p>

<p>The code:</p>

<pre>
  parser [: a = b; c = d ! :] -> e
</pre>

<p>is equivalent to:</p>

<pre>
  parser [: a = b; s :] -> let c = d s in e
</pre>

<p>One interest of the first syntax is that it shows to readers that
  "<code>d</code>" is indeed a syntactic sub-parser. In the second
  syntax, it is called in the semantic action, which makes the parser
  case not so clear, as far as readability is concerned.</p>

<p>If the stream pattern component is at end of the stream pattern,
  this allow possible tail recursion by the OCaml compiler, in the
  following case:</p>

<pre>
  parser [: a = b; c = d ! :] -> c
</pre>

<p>since it is equivalent (with the fact that "<code>c</code>" is at
  the same time the pattern of the last case and the expression of the
  parser case semantic action) to:</p>

<pre>
  parser [: a = b; s :] -> d s
</pre>

<p>The call to "<code>d s</code>" can be a tail recursive
  call. Without the use of the "exclamation mark" in the rule, the
  equivalent code is:</p>

<pre>
  parser [: a = b; s :] ->
    try d s with [ Stream.Failure -> raise (Stream.Error "") ]
</pre>

<p>which is not tail recursive (due to the "try..with" construction
  pushes a context), preventing the compiler to optimize its
  code. This can be important when many recursive calls happen, since
  it can overflow the OCaml stack.</p>

<h3>Position</h3>

<p>The optional "pattern" before and after a stream pattern is bound
  to the current stream count. Indeed, streams internally contain a
  count of their elements. At the beginning the count is zero. When an
  element is removed, the count is incremented. The example:</p>

<pre>
  parser [: a = b :] ep -> c
</pre>

<p>is equivalent to:</p>

<pre>
  parser [: a = b; s :] -> let ep = Stream.count s in c
</pre>

<p>There is no direct syntax equivalent to the optional pattern at
  beginning of the stream pattern:</p>

<pre>
  parser bp [: a = b :] -> c
</pre>

<p>These optional patterns allow disposal of the stream count at the
  beginning and at the end of the parser case, allowing to compute
  locations of the rule in the source. In particular, if the stream is
  a stream of characters, these counts are the source location in
  number of characters.</p>

<h3>Semantic action</h3>

<p>In a parser case, after the stream pattern, there is an "arrow" and
  an expression, called the "semantic action". If the parser case is
  matched the parser returns with the evaluated expression whose
  environment contains all values bound in the stream pattern.</p>

<h2>Remarks</h2>

<h3>Simplicity vs Associativity</h3>

<p>This parsing technology has the advantage of simplicity of use and
  understanding, but it does not treat the associativity of
  operators. For example, if you write a parser like this (to compute
  arithmetic expressions):</p>

<pre>
  value rec expr =
    parser
    [ [: e1 = expr; `'+'; e2 = expr :] -> e1 + e2
    | [: `('0'..'9' as c) :] -> Char.code c - Char.code '0' ]
</pre>

<p>this would loop endlessly, exactly as if you wrote code starting
  with:</p>

<pre>
  value rec expr e =
    let e1 = expr e in
    ...
</pre>

<p>One solution is to treat the associativity "by hand": by reading a
  sub-expression, then looping with a parser which parses the operator
  and another sub-expression, and so on.</p>

<p>An alternative solution is to write parsing "combinators". Indeed,
  parsers being normal functions, it is possible to make a function
  which takes a parser as parameter and returning a parser using
  it. For example, left and right associativity parsing
  combinators:</p>

<pre>
  value rec left_assoc op elem =
    let rec op_elem x =
      parser
      [ [: t = op; y = elem; r = op_elem (t x y) :] -> r
      | [: :] -> x ]
    in
    parser [: x = elem; r = op_elem x :] -> r
  ;

  value rec right_assoc op elem =
    let rec op_elem x =
      parser
      [ [: t = op; y = elem; r = op_elem y :] -> t x r
      | [: :] -> x ]
    in
    parser [: x = elem; r = op_elem x :] -> r
  ;
</pre>

<p>which can be used, e.g. like this:</p>

<pre>
  value expr =
    List.fold_right (fun op elem -> op elem)
      [left_assoc (parser [: `'+' :] -> fun x y -> x +. y);
       left_assoc (parser [: `'*' :] -> fun x y -> x *. y);
       right_assoc (parser [: `'^' :] -> fun x y -> x ** y)]
      (parser [: `('0'..'9' as c) :] -> float (Char.code c - Char.code '0'))
  ;
</pre>

<p>and tested, e.g. in the toplevel, like that:</p>

<pre>
  expr (Stream.of_string "2^3^2+1");
</pre>

<p>The same way, it is possible to parse non-context free grammars, by
  programming parsers returning other parsers.</p>

<p>A third solution, to resolve the problem of associativity, is to
  use the grammars of Camlp5, which have the other advantage that they
  are extensible.</p>

<h3>Lexing vs Parsing</h3>

<p>In general, while analyzing a language, there are two levels:</p>

<ul>
  <li>The level where the input, considered as a stream of characters,
    is read to make a stream of tokens (for example "words", if it is
    a human language, or punctuation). This level is generally called
    "lexing".</li>
  <li>The level where the input is a stream of tokens where grammar
    rules are parsed. This level is generally called "parsing".</li>
</ul>

<p>The "parser" construction described here can be used for both,
  thanks to the polymorphism of OCaml:</p>

<ul>
  <li>The lexing level is a "parser" of streams of characters
    returning tokens.</li>
  <li>The parsing level is a "parser" of streams of tokens returning
    syntax trees.</li>
</ul>

<p>By comparison, the programs "lex" and "yacc" use two different
  technologies. With "parser"s, it is possible to use the same one for
  both.</p>

<h3>Lexer syntax vs Parser syntax</h3>

<p>For "lexers", i.e. for the specific case of parsers when the input
  is a stream of characters, it is possible to use a shorter
  syntax. See the chapter on <a href="lexers.html">lexers</a>. They
  have another syntax, shorter and adapted for the specific type
  "<code>char</code>". But they still are internally parsers of
  streams with the same semantics.</p>

<h3>Purely functional parsers</h3>

<p>This system of parsers is imperative: while parsing, the stream
  advances and the already parsed terminals disappear from the stream
  structure. This is useful because it is not necessary to return the
  remaining stream together with the normal result. This is the reason
  there is this "<tt>Stream.Error</tt>" exception: when it is raised,
  it means that some terminals have been consummed from the stream,
  which are definitively lost, and therefore that are no more possible
  parser cases to try.</p>

<p>An alternative is to use <a href="fparsers.html">functional
  parsers</a> which use a new stream type, lazy but not
  destructive. Their advantage is that they use a limited backtrack:
  the case of "if..then..else.." and the shorter "if..then.." work
  without having to left factorize the parser cases, and there is no
  need to lookahead. They have no equivalent to the exception
  "<code>Stream.Error</code>": when all cases are tested, and have
  failed, the parsers return the value "<code>None</code>". The
  drawback is that, when a parsing error happens, it is not easily
  possible to know the location of the error in the input, as the
  initial stream has not been modified: the system would indicate a
  failure at the first character of the first line: this is a general
  drawback of backtracking parsers. See the solutions found to this
  problem in the chapter about <a href="fparsers.html">purely
  functional parsers</a>.</p>

<p>A second alternative is to use the
  <a href="bparsers.html">backtracking parsers</a>. They use the same
  stream type as the functional parsers, but they test more cases than
  them. They have the same advantages and drawbacks than the functional
  parsers.</p>

<div class="trailer">
</div>

</div>

</body>
</html>