File: pexpression.inc

package info (click to toggle)
tcllib 1.14-dfsg-3%2Bdeb7u1
  • links: PTS
  • area: main
  • in suites: wheezy
  • size: 33,036 kB
  • sloc: tcl: 148,302; ansic: 14,067; sh: 10,320; xml: 1,766; yacc: 753; pascal: 551; makefile: 129; perl: 84; f90: 84; python: 33; ruby: 13; php: 11
file content (245 lines) | stat: -rw-r--r-- 7,187 bytes parent folder | download | duplicates (8)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
[comment {-*- text -*-}]
[section {PE serialization format}]

Here we specify the format used by the Parser Tools to serialize
Parsing Expressions as immutable values for transport, comparison,
etc.

[para]

We distinguish between [term regular] and [term canonical]
serializations.

While a parsing expression may have more than one regular
serialization only exactly one of them will be [term canonical].

[list_begin definitions][comment {-- serializations --}]
[def {Regular serialization}]

[list_begin definitions][comment {-- regular points --}]

[def [const {Atomic Parsing Expressions}]]
[list_begin enumerated][comment {-- atomic points --}]

[enum]
The string [const epsilon] is an atomic parsing expression. It matches
the empty string.

[enum]
The string [const dot] is an atomic parsing expression. It matches
any character.

[enum]
The string [const alnum] is an atomic parsing expression. It matches
any Unicode alphabet or digit character. This is a custom extension of
PEs based on Tcl's builtin command [cmd {string is}].

[enum]
The string [const alpha] is an atomic parsing expression. It matches
any Unicode alphabet character. This is a custom extension of PEs
based on Tcl's builtin command [cmd {string is}].

[enum]
The string [const ascii] is an atomic parsing expression. It matches
any Unicode character below U0080. This is a custom extension of PEs
based on Tcl's builtin command [cmd {string is}].

[enum]
The string [const control] is an atomic parsing expression. It matches
any Unicode control character. This is a custom extension of PEs based
on Tcl's builtin command [cmd {string is}].

[enum]
The string [const digit] is an atomic parsing expression. It matches
any Unicode digit character. Note that this includes characters
outside of the [lb]0..9[rb] range. This is a custom extension of PEs
based on Tcl's builtin command [cmd {string is}].

[enum]
The string [const graph] is an atomic parsing expression. It matches
any Unicode printing character, except for space. This is a custom
extension of PEs based on Tcl's builtin command [cmd {string is}].

[enum]
The string [const lower] is an atomic parsing expression. It matches
any Unicode lower-case alphabet character. This is a custom extension
of PEs based on Tcl's builtin command [cmd {string is}].

[enum]
The string [const print] is an atomic parsing expression. It matches
any Unicode printing character, including space. This is a custom
extension of PEs based on Tcl's builtin command [cmd {string is}].

[enum]
The string [const punct] is an atomic parsing expression. It matches
any Unicode punctuation character. This is a custom extension of PEs
based on Tcl's builtin command [cmd {string is}].

[enum]
The string [const space] is an atomic parsing expression. It matches
any Unicode space character. This is a custom extension of PEs based
on Tcl's builtin command [cmd {string is}].

[enum]
The string [const upper] is an atomic parsing expression. It matches
any Unicode upper-case alphabet character. This is a custom extension
of PEs based on Tcl's builtin command [cmd {string is}].

[enum]
The string [const wordchar] is an atomic parsing expression. It
matches any Unicode word character. This is any alphanumeric character
(see alnum), and any connector punctuation characters (e.g.
underscore). This is a custom extension of PEs based on Tcl's builtin
command [cmd {string is}].

[enum]
The string [const xdigit] is an atomic parsing expression. It matches
any hexadecimal digit character. This is a custom extension of PEs
based on Tcl's builtin command [cmd {string is}].

[enum]
The string [const ddigit] is an atomic parsing expression. It matches
any decimal digit character. This is a custom extension of PEs based
on Tcl's builtin command [cmd regexp].

[enum]
The expression
    [lb]list t [var x][rb]
is an atomic parsing expression. It matches the terminal string [var x].

[enum]
The expression
    [lb]list n [var A][rb]
is an atomic parsing expression. It matches the nonterminal [var A].

[list_end][comment {-- atomic points --}]

[def [const {Combined Parsing Expressions}]]
[list_begin enumerated][comment {-- combined points --}]

[enum]
For parsing expressions [var e1], [var e2], ... the result of

    [lb]list / [var e1] [var e2] ... [rb]

is a parsing expression as well.

This is the [term {ordered choice}], aka [term {prioritized choice}].

[enum]
For parsing expressions [var e1], [var e2], ... the result of

    [lb]list x [var e1] [var e2] ... [rb]

is a parsing expression as well.

This is the [term {sequence}].

[enum]
For a parsing expression [var e] the result of

    [lb]list * [var e][rb]

is a parsing expression as well.

This is the [term {kleene closure}], describing zero or more
repetitions.

[enum]
For a parsing expression [var e] the result of

    [lb]list + [var e][rb]

is a parsing expression as well.

This is the [term {positive kleene closure}], describing one or more
repetitions.

[enum]
For a parsing expression [var e] the result of

    [lb]list & [var e][rb]

is a parsing expression as well.

This is the [term {and lookahead predicate}].

[enum]
For a parsing expression [var e] the result of

    [lb]list ! [var e][rb]

is a parsing expression as well.

This is the [term {not lookahead predicate}].


[enum]
For a parsing expression [var e] the result of

    [lb]list ? [var e][rb]

is a parsing expression as well.

This is the [term {optional input}].


[list_end][comment {-- combined points --}]
[list_end][comment {-- regular points --}]

[def {Canonical serialization}]

The canonical serialization of a parsing expression has the format as
specified in the previous item, and then additionally satisfies the
constraints below, which make it unique among all the possible
serializations of this parsing expression.

[list_begin enumerated][comment {-- canonical points --}]
[enum]

The string representation of the value is the canonical representation
of a pure Tcl list. I.e. it does not contain superfluous whitespace.

[enum]

Terminals are [emph not] encoded as ranges (where start and end of the
range are identical).

[comment {
	 Thinking about this I am not sure if that was a good move.
	 There are a lot more equivalent encodings around that just
	 the one I used above. Examples

	 	 {x {t a} {t b} {tc } {t d}}
	 	 {x {x {t a} {t b}} {x {tc } {t d}}}
	 	 {x {x {t a} {t b} {tc } {t d}}}

	 etc. Having the t/.. equivalence added it can now be argued
	 that we should handle these as well. Which essentially
	 amounts to a whole-sale system to simplify parsing
	 expressions. This moves expression equality from intensional
	 to extensional, or as near as is possible.

	 The only counter-argument I have is that the t/.. equivalence
	 is restricted to leaves of the tree, or alternatively, to
	 terminal symbol operators.
}]

[list_end][comment {-- canonical points --}]
[list_end][comment {-- serializations --}]
[para]

[subsection Example]

Assuming the parsing expression shown on the right-hand side of the
rule

[para]
[include ../example/expr_pe.inc]
[para]

then its canonical serialization (except for whitespace) is

[para]
[include ../example/expr_pe_serial.inc]
[para]