1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134
|
=================================================
Parsley Tutorial Part II: Parsing Structured Data
=================================================
Now that you are familiar with the basics of Parsley syntax, let's
look at a more realistic example: a JSON parser.
The JSON spec on http://json.org/ describes the format, and we can
adapt its description to a parser. We'll write the Parsley rules in
the same order as the grammar rules in the right sidebar on the JSON
site, starting with the top-level rule, 'object'.
::
object = ws '{' members:m ws '}' -> dict(m)
Parsley defines a builtin rule ``ws`` which consumes any spaces, tabs,
or newlines it can.
Since JSON objects are represented in Python as dicts, and ``dict``
takes a list of pairs, we need a rule to collect name/value pairs
inside an object expression.
::
members = (pair:first (ws ',' pair)*:rest -> [first] + rest)
| -> []
This handles the three cases for object contents: one, multiple, or
zero pairs. A name/value pair is separated by a colon. We use the
builtin rule ``spaces`` to consume any whitespace after the colon::
pair = ws string:k ws ':' value:v -> (k, v)
Arrays, similarly, are sequences of array elements, and are
represented as Python lists.
::
array = '[' elements:xs ws ']' -> xs
elements = (value:first (ws ',' value)*:rest -> [first] + rest) | -> []
Values can be any JSON expression.
::
value = ws (string | number | object | array
| 'true' -> True
| 'false' -> False
| 'null' -> None)
Strings are sequences of zero or more characters between double
quotes. Of course, we need to deal with escaped characters as
well. This rule introduces the operator ``~``, which does negative
lookahead; if the expression following it succeeds, its parse will
fail. If the expression fails, the rest of the parse continues. Either
way, no input will be consumed.
::
string = '"' (escapedChar | ~'"' anything)*:c '"' -> ''.join(c)
This is a common pattern, so let's examine it step by step. This will
match leading whitespace and then a double quote character. It then
matches zero or more characters. If it's not an ``escapedChar`` (which
will start with a backslash), we check to see if it's a double quote,
in which case we want to end the loop. If it's not a double quote, we
match it using the rule ``anything``, which accepts a single character
of any kind, and continue. Finally, we match the ending double quote
and return the characters in the string. We cannot use the ``<>``
syntax in this case because we don't want a literal slice of the input
-- we want escape sequences to be replaced with the character they
represent.
It's very common to use ``~`` for "match until" situations where you
want to keep parsing only until an end marker is found. Similarly,
``~~`` is positive lookahead: it succeed if its expression succeeds
but not consume any input.
The ``escapedChar`` rule should not be too surprising: we match a
backslash then whatever escape code is given.
::
escapedChar = '\\' (('"' -> '"') |('\\' -> '\\')
|('/' -> '/') |('b' -> '\b')
|('f' -> '\f') |('n' -> '\n')
|('r' -> '\r') |('t' -> '\t')
|('\'' -> '\'') | escapedUnicode)
Unicode escapes (of the form ``\u2603``) require matching four hex
digits, so we use the repetition operator ``{}``, which works like +
or * except taking either a ``{min, max}`` pair or simply a
``{number}`` indicating the exact number of repetitions.
::
hexdigit = :x ?(x in '0123456789abcdefABCDEF') -> x
escapedUnicode = 'u' <hexdigit{4}>:hs -> unichr(int(hs, 16))
With strings out of the way, we advance to numbers, both integer and
floating-point.
::
number = spaces ('-' | -> ''):sign (intPart:ds (floatPart(sign ds)
| -> int(sign + ds)))
Here we vary from the json.org description a little and move sign
handling up into the ``number`` rule. We match either an ``intPart``
followed by a ``floatPart`` or just an ``intPart`` by itself.
::
digit = :x ?(x in '0123456789') -> x
digits = <digit*>
digit1_9 = :x ?(x in '123456789') -> x
intPart = (digit1_9:first digits:rest -> first + rest) | digit
floatPart :sign :ds = <('.' digits exponent?) | exponent>:tail
-> float(sign + ds + tail)
exponent = ('e' | 'E') ('+' | '-')? digits
In JSON, multi-digit numbers cannot start with 0 (since that is
Javascript's syntax for octal numbers), so ``intPart`` uses ``digit1_9``
to exclude it in the first position.
The ``floatPart`` rule takes two parameters, ``sign`` and ``ds``. Our
``number`` rule passes values for these when it invokes ``floatPart``,
letting us avoid duplication of work within the rule. Note that
pattern matching on arguments to rules works the same as on the string
input to the parser. In this case, we provide no pattern, just a name:
``:ds`` is the same as ``anything:ds``.
(Also note that our float rule cheats a little: it does not really
parse floating-point numbers, it merely recognizes them and passes
them to Python's ``float`` builtin to actually produce the value.)
The full version of this parser and its test cases can be found in the
``examples`` directory in the Parsley distribution.
|