1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245
|
[comment {-*- text -*-}]
[section {PE serialization format}]
Here we specify the format used by the Parser Tools to serialize
Parsing Expressions as immutable values for transport, comparison,
etc.
[para]
We distinguish between [term regular] and [term canonical]
serializations.
While a parsing expression may have more than one regular
serialization only exactly one of them will be [term canonical].
[list_begin definitions][comment {-- serializations --}]
[def {Regular serialization}]
[list_begin definitions][comment {-- regular points --}]
[def [const {Atomic Parsing Expressions}]]
[list_begin enumerated][comment {-- atomic points --}]
[enum]
The string [const epsilon] is an atomic parsing expression. It matches
the empty string.
[enum]
The string [const dot] is an atomic parsing expression. It matches
any character.
[enum]
The string [const alnum] is an atomic parsing expression. It matches
any Unicode alphabet or digit character. This is a custom extension of
PEs based on Tcl's builtin command [cmd {string is}].
[enum]
The string [const alpha] is an atomic parsing expression. It matches
any Unicode alphabet character. This is a custom extension of PEs
based on Tcl's builtin command [cmd {string is}].
[enum]
The string [const ascii] is an atomic parsing expression. It matches
any Unicode character below U0080. This is a custom extension of PEs
based on Tcl's builtin command [cmd {string is}].
[enum]
The string [const control] is an atomic parsing expression. It matches
any Unicode control character. This is a custom extension of PEs based
on Tcl's builtin command [cmd {string is}].
[enum]
The string [const digit] is an atomic parsing expression. It matches
any Unicode digit character. Note that this includes characters
outside of the [lb]0..9[rb] range. This is a custom extension of PEs
based on Tcl's builtin command [cmd {string is}].
[enum]
The string [const graph] is an atomic parsing expression. It matches
any Unicode printing character, except for space. This is a custom
extension of PEs based on Tcl's builtin command [cmd {string is}].
[enum]
The string [const lower] is an atomic parsing expression. It matches
any Unicode lower-case alphabet character. This is a custom extension
of PEs based on Tcl's builtin command [cmd {string is}].
[enum]
The string [const print] is an atomic parsing expression. It matches
any Unicode printing character, including space. This is a custom
extension of PEs based on Tcl's builtin command [cmd {string is}].
[enum]
The string [const punct] is an atomic parsing expression. It matches
any Unicode punctuation character. This is a custom extension of PEs
based on Tcl's builtin command [cmd {string is}].
[enum]
The string [const space] is an atomic parsing expression. It matches
any Unicode space character. This is a custom extension of PEs based
on Tcl's builtin command [cmd {string is}].
[enum]
The string [const upper] is an atomic parsing expression. It matches
any Unicode upper-case alphabet character. This is a custom extension
of PEs based on Tcl's builtin command [cmd {string is}].
[enum]
The string [const wordchar] is an atomic parsing expression. It
matches any Unicode word character. This is any alphanumeric character
(see alnum), and any connector punctuation characters (e.g.
underscore). This is a custom extension of PEs based on Tcl's builtin
command [cmd {string is}].
[enum]
The string [const xdigit] is an atomic parsing expression. It matches
any hexadecimal digit character. This is a custom extension of PEs
based on Tcl's builtin command [cmd {string is}].
[enum]
The string [const ddigit] is an atomic parsing expression. It matches
any decimal digit character. This is a custom extension of PEs based
on Tcl's builtin command [cmd regexp].
[enum]
The expression
[lb]list t [var x][rb]
is an atomic parsing expression. It matches the terminal string [var x].
[enum]
The expression
[lb]list n [var A][rb]
is an atomic parsing expression. It matches the nonterminal [var A].
[list_end][comment {-- atomic points --}]
[def [const {Combined Parsing Expressions}]]
[list_begin enumerated][comment {-- combined points --}]
[enum]
For parsing expressions [var e1], [var e2], ... the result of
[lb]list / [var e1] [var e2] ... [rb]
is a parsing expression as well.
This is the [term {ordered choice}], aka [term {prioritized choice}].
[enum]
For parsing expressions [var e1], [var e2], ... the result of
[lb]list x [var e1] [var e2] ... [rb]
is a parsing expression as well.
This is the [term {sequence}].
[enum]
For a parsing expression [var e] the result of
[lb]list * [var e][rb]
is a parsing expression as well.
This is the [term {kleene closure}], describing zero or more
repetitions.
[enum]
For a parsing expression [var e] the result of
[lb]list + [var e][rb]
is a parsing expression as well.
This is the [term {positive kleene closure}], describing one or more
repetitions.
[enum]
For a parsing expression [var e] the result of
[lb]list & [var e][rb]
is a parsing expression as well.
This is the [term {and lookahead predicate}].
[enum]
For a parsing expression [var e] the result of
[lb]list ! [var e][rb]
is a parsing expression as well.
This is the [term {not lookahead predicate}].
[enum]
For a parsing expression [var e] the result of
[lb]list ? [var e][rb]
is a parsing expression as well.
This is the [term {optional input}].
[list_end][comment {-- combined points --}]
[list_end][comment {-- regular points --}]
[def {Canonical serialization}]
The canonical serialization of a parsing expression has the format as
specified in the previous item, and then additionally satisfies the
constraints below, which make it unique among all the possible
serializations of this parsing expression.
[list_begin enumerated][comment {-- canonical points --}]
[enum]
The string representation of the value is the canonical representation
of a pure Tcl list. I.e. it does not contain superfluous whitespace.
[enum]
Terminals are [emph not] encoded as ranges (where start and end of the
range are identical).
[comment {
Thinking about this I am not sure if that was a good move.
There are a lot more equivalent encodings around that just
the one I used above. Examples
{x {t a} {t b} {tc } {t d}}
{x {x {t a} {t b}} {x {tc } {t d}}}
{x {x {t a} {t b} {tc } {t d}}}
etc. Having the t/.. equivalence added it can now be argued
that we should handle these as well. Which essentially
amounts to a whole-sale system to simplify parsing
expressions. This moves expression equality from intensional
to extensional, or as near as is possible.
The only counter-argument I have is that the t/.. equivalence
is restricted to leaves of the tree, or alternatively, to
terminal symbol operators.
}]
[list_end][comment {-- canonical points --}]
[list_end][comment {-- serializations --}]
[para]
[subsection Example]
Assuming the parsing expression shown on the right-hand side of the
rule
[para]
[include ../example/expr_pe.inc]
[para]
then its canonical serialization (except for whitespace) is
[para]
[include ../example/expr_pe_serial.inc]
[para]
|