File: PREPROCESSOR

package info (click to toggle)
pxp 1.1.96-8
  • links: PTS
  • area: main
  • in suites: etch, etch-m68k
  • size: 5,960 kB
  • ctags: 2,016
  • sloc: ml: 21,018; xml: 2,597; sh: 727; makefile: 706
file content (604 lines) | stat: -rw-r--r-- 20,607 bytes parent folder | download | duplicates (3)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
******************************************************************************
The Preprocessor for PXP
******************************************************************************


==============================================================================
The Preprocessor for PXP
==============================================================================

Since PXP-1.1.95, there is a preprocessor as part of the PXP distribution. It 
allows you to compose XML trees and event lists dynamically, which is very 
handy to write XML transformations.

To enable the preprocessor, compile your source files as in: 

ocamlfind ocamlc -syntax camlp4o -package pxp-pp,... ...

The package pxp-pp contains the preprocessor. The -syntax option enables 
camlp4, on which the preprocessor is based. It is also possible to use it 
together with the revised syntax, use "-syntax camlp4r" in this case.

Important: Up to version 1.0.4, findlib (ocamlfind) has a problem with the 
definition for pxp-pp. There is an easy workaround: Use "-syntax camlp4o,byte".

In the toploop, type 

ocaml
# #use "topfind";;
# #camlp4o;;
# #require "pxp-pp";;
# #require "pxp";;



The preprocessor defines the following new syntax notations, explained below in 
detail: 

<:pxp_charset< CHARSET_DECL >>
<:pxp_tree< EXPR >>
<:pxp_vtree< EXPR >>
<:pxp_evlist< EXPR >>
<:pxp_evpull< EXPR >>
<:pxp_text< TEXT >>

The basic notation is "pxp_tree" which creates a tree of PXP document nodes as 
described in EXPR. "pxp_vtree" is the variant where the tree is immediately 
validated. "pxp_evlist" creates a list of PXP events instead of nodes, useful 
together with the event-based parser. "pxp_evpull" is a variation of the 
latter: Instead of an event list an event generator is created that works like 
a pull parser.

The "pxp_charset" notation only configures the character sets to assume. 
Finally, "pxp_text" is a notation for string literals.

------------------------------------------------------------------------------
Creating constant XML
------------------------------------------------------------------------------

The following examples are all written for "pxp_tree". You can also use one of 
the other XML composers instead, but see the notes below.

In order to use "pxp_tree", you must define two variables in the environment: 
"spec" and "dtd": 

let spec = Pxp_tree_parser.default_spec;;
let dtd = Pxp_dtd.create_dtd `Enc_iso88591;;

These variables occur in the code generated by the preprocessor. The "dtd" 
variable is the DTD object. Note that you need it even in well-formedness mode 
(validation turned off). The "spec" variable controls which classes are 
instantiated as node representation (see PXP manual).

Now you can create XML trees like in 

let book = 
  <:pxp_tree< 
    <book>
      [ <title>[ "The Lord of The Rings" ]
        <author>[ "J.R.R. Tolkien" ]
      ]
  >>

As you can see, the syntax is somehow XML-related but not really XML. (Many 
ideas are borrowed from CDUCE, by the way.) In particular, there are start tags 
like <title> but no end tags. Instead, we are using square brackets to denote 
the children of an XML element. Furthermore, character data must be put into 
double quotes.

You may ask why the well-known XML syntax has been modified for this 
preprocessor. There are many reasons, and they will become clearer in the 
following explanations. For now, you can see the advantage that the syntax is 
less verbose, as you need not to repeat the element names in end tags. 
Furthermore, you can exactly control which characters are part of the data 
nodes without having to make compromises with indentation.

Attributes are written as in XML: 

let book = 
  <:pxp_tree< 
    <book id="BOOK_001">
      [ <title lang="en">[ "The Lord of The Rings" ]
        <author>[ "J.R.R. Tolkien" ]
      ]
  >>



An element without children can be written 

<element>[]

or slightly shorter: 

<element/>



You can also create processing instructions and comment nodes: 

let list =
  <:pxp_tree<
    <list>
      [ <!>"Now the list of books follows!"
        <?>"formatter_directive" "one book per page"
        book
      ]
 >>

The notation "<!>" creates a comment node with the following string as 
contents. The notation "<?>" needs two strings, first the target, then the 
value (here, this results in "<?formatter_directive one book per page?>". 

Look again at the last example: The O'Caml variable "book" occurs, and it 
inserts its tree into the list of books. Identifiers without "decoration" just 
refer to O'Caml variables. We will see more examples below.

The preprocessor syntax knows a number of shortcuts and variations. First, you 
can omit the square brackets when an element has exactly one child: 

<element><child>"Data inside child"

This is the same as 

<element>[ <child>[ "Data inside child" ] ]

Second, you are already used to a common abbreviation: Strings are 
automatically converted to data nodes. The "expanded" syntax is 

<*>"Data string"

where "<*>" denotes a data node, and the following string is used as contents. 
Usually, you can omit "<*>". However, there are a few occasions where this 
notation is still useful, see below.

In strings, the usual entity references can be used: "Double quotes: &quot;". 
For a newline character, write &#10;.

The preprocessor knows two operators: "^" concatenates strings, and "@" 
concatenates lists. Examples: 

<element>[ "Word1" ^ "Word2" ]
<element>([ <a/> ] @ [ <b/> ])



Parentheses can be used to clarify precedence. For example: 

<element>(l1 @ l2)

Here, the concatenation operator "@" could also be parsed as 

(<element> l1) @ l2

Parentheses may be used in every expression.

Rarely used, there is also a notation for the "super root" nodes (see the PXP 
manual for their meaning): 

<^>[ <element> ... ]



------------------------------------------------------------------------------
Dynamic XML
------------------------------------------------------------------------------

Let us begin with an example. The task is to convert O'Caml values of type 

type book = 
  { title : string;
    author : string;
    isbn : string;
  }

to XML trees like 

<book id="BOOK_'isbn'">
  <title>'title'</title>
  <author>'author'</title>
</book>

(conventional syntax). When b is the book variable, the solution is 

let book = 
  let title = b.title
  and author = b.author
  and isbn = b.isbn in
  <:pxp_tree<
    <book id=("BOOK_" ^ isbn)>
      [ <title><*>title
        <author><*>author
      ]
  >>

First, we bind the simple O'Caml variables "title", "author", and "isbn". The 
reason is that the preprocessor syntax does not allow expressions like 
"b.title" directly in the XML tree (but see below for a better workaround).

The XML tree contains the O'Caml variables. The "id" attribute is a 
concatenation of the fixed prefix "BOOK_" and the contents of "isbn". The 
"title" and "author" elements contain a data node whose contents are the O'Caml 
strings "title", and "author", respectively.

Why "<*>"? If we just wrote "<title>title", the generated code would assume 
that the "title" variable is an XML node, and not a string. From this point of 
view, "<*>" works like a type annotation, as it specialises the type of the 
following expression.

Here is an alternate solution: 

let book = 
  <:pxp_tree<
    <book id=("BOOK_" ^ (: b.isbn :))>
      [ <title><*>(: b.title :)
        <author><*>(: b.author :)
      ]
  >>

The notation "(: ... :)" allows you to include arbitrary O'Caml expressions 
into the tree. In this solution it is no longer necessary to create artificial 
O'Caml variables for the only purpose of injecting values into trees.  

It is possible to create XML elements with dynamic names: Just put parentheses 
around the expression. Example: 

let name = "book" in
<:pxp_tree< <(name)> ... >>

With the same notation, one can also set attribute names dynamically: 

let att_name = "id" in
<:pxp_tree< <book (att_name)=...> ... >>

Finally, it is also possible to include complete attribute lists dynamically: 

let att_list = [ "id", ("BOOK_" ^ b.isbn) ] in
<:pxp_tree< <book (: att_list :) > ... >>



Typing: Depending on where a variable or O'Caml expression occurs, different 
types are assumed. Compare the following examples: 

<:pxp_tree< <element>x1 >>
<:pxp_tree< <element>[x2] >>
<:pxp_tree< <element><*>x3 >>

As a rule of thumb, the most general type is assumed that would make sense at a 
certain location. As x1 could be replaced by a list of children, its type is 
assumed to be a node list. As x2 could be replaced by a single node, its type 
is assumed to be a node. And x3 is a string, we had this case already. 

------------------------------------------------------------------------------
Character Encodings
------------------------------------------------------------------------------

As the preprocessor generates code that builds XML trees, it must know two 
character encodings:

-  Which encoding is used in the source code (in the .ml file) 
   
-  Which encoding is used in the XML representation, i.e. in the O'Caml values 
   representing the XML trees
   
Both encodings can be set independently. The syntax is: 

<:pxp_charset< source="ENC" representation="ENC" >>

The default is ISO-8859-1 for both encodings. For example, to set the 
representation encoding to UTF-8, use: 

<:pxp_charset< representation="UTF-8" >>

The "pxp_charset" notation is a constant expression that always evaluates to 
"()". (A requirement by camlp4 that looks artificial.) 

When you set the representation encoding, it is required that the encoding 
stored in the DTD object is the same. Remember that we need a DTD object like 

let dtd = Pxp_dtd.create_dtd `Enc_iso88591;;

Of course, we must change this to the representation encoding, too, in our 
example: 

let dtd = Pxp_dtd.create_dtd `Enc_utf8;;

The preprocessor cannot check this at compile time, and for performance 
reasons, a runtime check is not generated. So it is up to the programmer that 
the character encodings are used in a consistent way. 

------------------------------------------------------------------------------
Validated Trees
------------------------------------------------------------------------------

In order to validate trees, you need a filled DTD object. In principle, you can 
create this object by a number of methods. For example, you can parse an 
external file: 

let dtd = Pxp_dtd_parser.parse_dtd_entity config (from_file "sample.dtd")

It is, however, often more convenient to include the DTD literally into the 
program. This works by 

let dtd = Pxp_dtd_parser.parse_dtd_entity config (from_string "...")

As the double quotes are often used inside DTDs, O'Caml string literals are a 
bit impractical, as they are also delimited by double quotes, and one needs to 
add backslashes as escape characters. The "pxp_text" notation is often more 
readable here: <:pxp_text<STRING>> is just another way of writing "STRING". In 
our DTD, we have 

let dtd_text =
  <:pxp_text<
    <!ELEMENT book (title,author)>
    <!ATTLIST book id CDATA #REQUIRED>
    <!ELEMENT title (#PCDATA)>
    <!ATTLIST title lang CDATA "en">
    <!ELEMENT author (#PCDATA)>
  >>;;
let config = default_config;;
let dtd = Pxp_dtd_parser.parse_dtd_entity config (from_string dtd_text);;

Note that "pxp_text" is not restricted to DTDs, as it can be used for any kind 
of string.

After we have the DTD, we can validate the trees. One option is to call the 
"validate" function: 

let book = 
  <:pxp_tree< 
    <book>
      [ <title>[ "The Lord of The Rings" ]
        <author>[ "J.R.R. Tolkien" ]
      ]
  >>;;
Pxp_document.validate book;;

(This example is invalid, as the "id" attribute is missing.)

Note that it is a misunderstanding that "pxp_tree" builds XML trees in 
well-formed mode. You can create any tree with it, and the fact is that 
"pxp_tree" just does not invoke the validator. So if the DTD enforces 
validation, the tree is validated when the "validate" function is called. If 
the DTD is in well-formedness mode, the tree is effectively not validated, even 
when the "validate" function is invoked. Btw, the following statements would 
create a DTD in well-formedness mode: 

let dtd = Pxp_dtd.create_dtd `Enc_iso88591;;
dtd # allow_arbitrary;

As an alternative of calling the "validate" function, one can also use 
"pxp_vtree" instead. It immediately validates every XML element it creates. 
However, "injected" subtrees are not validated, i.e. validation does not 
proceed recursively to subnodes as the "validate" function does it.

------------------------------------------------------------------------------
Generating Events
------------------------------------------------------------------------------

As PXP has also an event model to represent XML, the preprocessor can also 
produce such events. In particular, there are two modes: The "pxp_evlist" 
notation outputs lists of events (type "event list") representing the XML 
expression. The "pxp_evpull" notation creates an automaton from which one can 
"pull" events (like from a pull parser).

These two notations work very much like "pxp_tree". For example, 

let book = 
  <:pxp_evlist< 
    <book>
      [ <title>[ "The Lord of The Rings" ]
        <author>[ "J.R.R. Tolkien" ]
      ]
  >>

generates 

[ E_start_tag ("book", [], None, <obj>);
  E_start_tag ("title", [], None, <obj>);
  E_char_data "The Lord of The Rings"; 
  E_end_tag ("title", <obj>);
  E_start_tag ("author", [], None, <obj>); 
  E_char_data "J.R.R. Tolkien";
  E_end_tag ("author", <obj>); 
  E_end_tag ("book", <obj>)
]

Note that you neither need a "dtd" variable nor a "spec" variable. There is one 
important difference, however: Both nodes and lists of nodes are represented by 
the same type, "event list". That has the consequence that in the following 
example x1 and x2 have the same type "event list": 

<:pxp_evlist< <element>x1 >>
<:pxp_evlist< <element>[x2] >>
<:pxp_evlist< <element><*>x3 >>

In principle, it could be checked at runtime whether x1 and x2 have the right 
structure. However, this is not done because of performance reasons.

As mentioned, "pxp_evpull" works like a pull parser. After defining 

let book = 
  <:pxp_evpull< 
    <book>
      [ <title>[ "The Lord of The Rings" ]
        <author>[ "J.R.R. Tolkien" ]
      ]
  >>

"book" is a function 'a->event. One can call it to get the events one after the 
other: 

let e1 = book();;       (* = Some(E_start_tag ("book", [], None, <obj>)) *)
let e2 = book();;       (* = Some(E_start_tag ("title", [], None, <obj>)) *)
...

After the last event, "book" returns None to indicate the end of the event 
stream.

As for "pxp_evlist", it is not possible to distinguish between nodes and node 
lists. In this example, both x1 and x2 are assumed to have type 'a->event: 

<:pxp_evlist< <element>x1 >>
<:pxp_evlist< <element>[x2] >>
<:pxp_evlist< <element><*>x3 >>

Note that "<element>x1" actually means to build a new pull automaton around the 
existing pull automaton x1: The children of "element" are retrieved by pulling 
events from x1 until "None" is returned.

A consequence of the pull semantics is that once an event is obtained from an 
automaton, the state of the automaton is modified such that it is not possible 
to get the same event again. If you need an automaton that can be reset to the 
beginning, just wrap the "pxp_evlist" notation into a functional abstraction: 

let book_maker() =
  <:pxp_evpull< <book ...> ... >>;;
let book1 = book_maker();;
let book2 = book_maker();;

This way, "book1" and "book2" are independent event streams.

There is another implication of the nature of the automatons: Subexpressions 
are lazily evaluated. For example, in 

<:pxp_evpull< <element>[ <*> (: get_data_contents() :) ] >>

the call of get_data_contents is performed just before the event for the data 
node is constructed.

------------------------------------------------------------------------------
Namespaces
------------------------------------------------------------------------------

By default, the preprocessor does not generate nodes or events that support 
namespaces. It can, however, be configured to create namespace-aware XML 
aggregations.  

In any case, you need a namespace manager. This is an object that tracks the 
usage of namespace prefixes in XML nodes. For example, we can create a 
namespace manager that knows the "html" prefix: 

let mng = new namespace_manager in
mng # add_namespace "html" "http://www.w3.org/1999/xhtml"

Here, we declare that we want to use the "html" prefix for the internal 
representation of the XML nodes. This kind of prefix is called normalized 
prefix, or normprefix for short. It is possible to configure different prefixes 
for the external representation, i.e. when the XML tree is printed to a file. 
This other kind of prefix is called display prefix. We will have a look at them 
later.

Next, we must tell the DTD object that we have a namespace manager: 

let dtd = Pxp_dtd.create_dtd `Enc_iso88591;;
dtd # set_namespace_manager mng;;



For "pxp_evlist" and "pxp_evpull" we are now prepared (note that we need now a 
"dtd" variable, as the DTD object knows the namespace manager). For "pxp_tree" 
and "pxp_vtree", it is required to use a namespace-aware specification: 

let spec = Pxp_tree_parser.default_namespace_spec 

(Normal specifications do not work, you would get "Namespace method not 
applicable" errors if you tried to use them.)

The special notation "<:autoscope>" enables namespace mode in this example: 

let list =
  <:pxp_tree<
    <:autoscope>
      <html:ul>
        [ <html:li>"Item1"
          <html:li>"Item2"
        ]
  >>

In particular, "<:autoscope>" defines a new O'Caml variable for its 
subexpression: "scope". This variable contains the namespace scope object, 
which contains the namespace declarations for the subexpression. "<:autoscope>" 
initialises this variable from the namespace manager such that it contains now 
a declaration for the "html" prefix.

In general, the namespace scope object contains the prefixes to use for the 
external representation. For this simple example, we have chosen to use the 
same prefixes as for the internal representation, and "<:autoscope>" performs 
the right initialisations for this.

Print the tree by 

list # display (`Out_channel stdout) `Enc_iso88591

The point is to call the "display" method and not the "write" method. The 
latter would not respect the display prefixes.  

Alternatively, we can also create the "scope" variable manually: 

let scope = Pxp_dtd.create_namespace_scope
              ~decl:[ "", "http://www.w3.org/1999/xhtml" ]
              mng;;
let list =
  <:pxp_tree<
    <:scope>
      <html:ul>
        [ <html:li>"Item1"
          <html:li>"Item2"
        ]
  >>

Note that we now use "<:scope>". In this simple form, this construct just 
enables namespace mode, and takes the "scope" variable from the environment.

Furthermore, the namespace scope contains now a different namespace 
declaration: The display prefix "" is used for HTML. The empty prefix just 
means to declare a default prefix (by xmlns="URI"). The effect can be seen when 
the XML tree is printed by calling the "display" method.

Here is a third variant of the same example: 

let scope = Pxp_dtd.create_namespace_scope mng ;;
let list =
  <:pxp_tree<
    <:scope ("")="http://www.w3.org/1999/xhtml">
      <html:ul>
        [ <html:li>"Item1"
          <html:li>"Item2"
        ]
  >>

The "scope" is now initially empty. The "<:scope>" notation is used to extend 
the scope for the time the subexpression is evaluated.

There is also a notation "<:emptyscope" that creates an empty scope object, so 
one could even write 

let list =
  <:pxp_tree<
    <:emptyscope>
      <:scope ("")="http://www.w3.org/1999/xhtml">
        <html:ul>
          [ <html:li>"Item1"
            <html:li>"Item2"
          ]
  >>



It is recommended to create the "scope" variable manually with a reasonable 
initial declaration, and to use "<:scope>" to enable namespace processing, and 
to extend the scope when necessary. The advantage of this approach is that the 
same scope object can be shared by many XML nodes, so you need less memory.

One tip: To get a namespace scope that is initialised with all prefixes of the 
namespace manager (as <:autoscope> does it), define 

let scope = create_namespace_scope ~decl: mng#as_declaration mng



For event-based processing of XML, the namespace mode works in the same way as 
described here, there is no difference.