1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604
|
******************************************************************************
The Preprocessor for PXP
******************************************************************************
==============================================================================
The Preprocessor for PXP
==============================================================================
Since PXP-1.1.95, there is a preprocessor as part of the PXP distribution. It
allows you to compose XML trees and event lists dynamically, which is very
handy to write XML transformations.
To enable the preprocessor, compile your source files as in:
ocamlfind ocamlc -syntax camlp4o -package pxp-pp,... ...
The package pxp-pp contains the preprocessor. The -syntax option enables
camlp4, on which the preprocessor is based. It is also possible to use it
together with the revised syntax, use "-syntax camlp4r" in this case.
Important: Up to version 1.0.4, findlib (ocamlfind) has a problem with the
definition for pxp-pp. There is an easy workaround: Use "-syntax camlp4o,byte".
In the toploop, type
ocaml
# #use "topfind";;
# #camlp4o;;
# #require "pxp-pp";;
# #require "pxp";;
The preprocessor defines the following new syntax notations, explained below in
detail:
<:pxp_charset< CHARSET_DECL >>
<:pxp_tree< EXPR >>
<:pxp_vtree< EXPR >>
<:pxp_evlist< EXPR >>
<:pxp_evpull< EXPR >>
<:pxp_text< TEXT >>
The basic notation is "pxp_tree" which creates a tree of PXP document nodes as
described in EXPR. "pxp_vtree" is the variant where the tree is immediately
validated. "pxp_evlist" creates a list of PXP events instead of nodes, useful
together with the event-based parser. "pxp_evpull" is a variation of the
latter: Instead of an event list an event generator is created that works like
a pull parser.
The "pxp_charset" notation only configures the character sets to assume.
Finally, "pxp_text" is a notation for string literals.
------------------------------------------------------------------------------
Creating constant XML
------------------------------------------------------------------------------
The following examples are all written for "pxp_tree". You can also use one of
the other XML composers instead, but see the notes below.
In order to use "pxp_tree", you must define two variables in the environment:
"spec" and "dtd":
let spec = Pxp_tree_parser.default_spec;;
let dtd = Pxp_dtd.create_dtd `Enc_iso88591;;
These variables occur in the code generated by the preprocessor. The "dtd"
variable is the DTD object. Note that you need it even in well-formedness mode
(validation turned off). The "spec" variable controls which classes are
instantiated as node representation (see PXP manual).
Now you can create XML trees like in
let book =
<:pxp_tree<
<book>
[ <title>[ "The Lord of The Rings" ]
<author>[ "J.R.R. Tolkien" ]
]
>>
As you can see, the syntax is somehow XML-related but not really XML. (Many
ideas are borrowed from CDUCE, by the way.) In particular, there are start tags
like <title> but no end tags. Instead, we are using square brackets to denote
the children of an XML element. Furthermore, character data must be put into
double quotes.
You may ask why the well-known XML syntax has been modified for this
preprocessor. There are many reasons, and they will become clearer in the
following explanations. For now, you can see the advantage that the syntax is
less verbose, as you need not to repeat the element names in end tags.
Furthermore, you can exactly control which characters are part of the data
nodes without having to make compromises with indentation.
Attributes are written as in XML:
let book =
<:pxp_tree<
<book id="BOOK_001">
[ <title lang="en">[ "The Lord of The Rings" ]
<author>[ "J.R.R. Tolkien" ]
]
>>
An element without children can be written
<element>[]
or slightly shorter:
<element/>
You can also create processing instructions and comment nodes:
let list =
<:pxp_tree<
<list>
[ <!>"Now the list of books follows!"
<?>"formatter_directive" "one book per page"
book
]
>>
The notation "<!>" creates a comment node with the following string as
contents. The notation "<?>" needs two strings, first the target, then the
value (here, this results in "<?formatter_directive one book per page?>".
Look again at the last example: The O'Caml variable "book" occurs, and it
inserts its tree into the list of books. Identifiers without "decoration" just
refer to O'Caml variables. We will see more examples below.
The preprocessor syntax knows a number of shortcuts and variations. First, you
can omit the square brackets when an element has exactly one child:
<element><child>"Data inside child"
This is the same as
<element>[ <child>[ "Data inside child" ] ]
Second, you are already used to a common abbreviation: Strings are
automatically converted to data nodes. The "expanded" syntax is
<*>"Data string"
where "<*>" denotes a data node, and the following string is used as contents.
Usually, you can omit "<*>". However, there are a few occasions where this
notation is still useful, see below.
In strings, the usual entity references can be used: "Double quotes: "".
For a newline character, write .
The preprocessor knows two operators: "^" concatenates strings, and "@"
concatenates lists. Examples:
<element>[ "Word1" ^ "Word2" ]
<element>([ <a/> ] @ [ <b/> ])
Parentheses can be used to clarify precedence. For example:
<element>(l1 @ l2)
Here, the concatenation operator "@" could also be parsed as
(<element> l1) @ l2
Parentheses may be used in every expression.
Rarely used, there is also a notation for the "super root" nodes (see the PXP
manual for their meaning):
<^>[ <element> ... ]
------------------------------------------------------------------------------
Dynamic XML
------------------------------------------------------------------------------
Let us begin with an example. The task is to convert O'Caml values of type
type book =
{ title : string;
author : string;
isbn : string;
}
to XML trees like
<book id="BOOK_'isbn'">
<title>'title'</title>
<author>'author'</title>
</book>
(conventional syntax). When b is the book variable, the solution is
let book =
let title = b.title
and author = b.author
and isbn = b.isbn in
<:pxp_tree<
<book id=("BOOK_" ^ isbn)>
[ <title><*>title
<author><*>author
]
>>
First, we bind the simple O'Caml variables "title", "author", and "isbn". The
reason is that the preprocessor syntax does not allow expressions like
"b.title" directly in the XML tree (but see below for a better workaround).
The XML tree contains the O'Caml variables. The "id" attribute is a
concatenation of the fixed prefix "BOOK_" and the contents of "isbn". The
"title" and "author" elements contain a data node whose contents are the O'Caml
strings "title", and "author", respectively.
Why "<*>"? If we just wrote "<title>title", the generated code would assume
that the "title" variable is an XML node, and not a string. From this point of
view, "<*>" works like a type annotation, as it specialises the type of the
following expression.
Here is an alternate solution:
let book =
<:pxp_tree<
<book id=("BOOK_" ^ (: b.isbn :))>
[ <title><*>(: b.title :)
<author><*>(: b.author :)
]
>>
The notation "(: ... :)" allows you to include arbitrary O'Caml expressions
into the tree. In this solution it is no longer necessary to create artificial
O'Caml variables for the only purpose of injecting values into trees.
It is possible to create XML elements with dynamic names: Just put parentheses
around the expression. Example:
let name = "book" in
<:pxp_tree< <(name)> ... >>
With the same notation, one can also set attribute names dynamically:
let att_name = "id" in
<:pxp_tree< <book (att_name)=...> ... >>
Finally, it is also possible to include complete attribute lists dynamically:
let att_list = [ "id", ("BOOK_" ^ b.isbn) ] in
<:pxp_tree< <book (: att_list :) > ... >>
Typing: Depending on where a variable or O'Caml expression occurs, different
types are assumed. Compare the following examples:
<:pxp_tree< <element>x1 >>
<:pxp_tree< <element>[x2] >>
<:pxp_tree< <element><*>x3 >>
As a rule of thumb, the most general type is assumed that would make sense at a
certain location. As x1 could be replaced by a list of children, its type is
assumed to be a node list. As x2 could be replaced by a single node, its type
is assumed to be a node. And x3 is a string, we had this case already.
------------------------------------------------------------------------------
Character Encodings
------------------------------------------------------------------------------
As the preprocessor generates code that builds XML trees, it must know two
character encodings:
- Which encoding is used in the source code (in the .ml file)
- Which encoding is used in the XML representation, i.e. in the O'Caml values
representing the XML trees
Both encodings can be set independently. The syntax is:
<:pxp_charset< source="ENC" representation="ENC" >>
The default is ISO-8859-1 for both encodings. For example, to set the
representation encoding to UTF-8, use:
<:pxp_charset< representation="UTF-8" >>
The "pxp_charset" notation is a constant expression that always evaluates to
"()". (A requirement by camlp4 that looks artificial.)
When you set the representation encoding, it is required that the encoding
stored in the DTD object is the same. Remember that we need a DTD object like
let dtd = Pxp_dtd.create_dtd `Enc_iso88591;;
Of course, we must change this to the representation encoding, too, in our
example:
let dtd = Pxp_dtd.create_dtd `Enc_utf8;;
The preprocessor cannot check this at compile time, and for performance
reasons, a runtime check is not generated. So it is up to the programmer that
the character encodings are used in a consistent way.
------------------------------------------------------------------------------
Validated Trees
------------------------------------------------------------------------------
In order to validate trees, you need a filled DTD object. In principle, you can
create this object by a number of methods. For example, you can parse an
external file:
let dtd = Pxp_dtd_parser.parse_dtd_entity config (from_file "sample.dtd")
It is, however, often more convenient to include the DTD literally into the
program. This works by
let dtd = Pxp_dtd_parser.parse_dtd_entity config (from_string "...")
As the double quotes are often used inside DTDs, O'Caml string literals are a
bit impractical, as they are also delimited by double quotes, and one needs to
add backslashes as escape characters. The "pxp_text" notation is often more
readable here: <:pxp_text<STRING>> is just another way of writing "STRING". In
our DTD, we have
let dtd_text =
<:pxp_text<
<!ELEMENT book (title,author)>
<!ATTLIST book id CDATA #REQUIRED>
<!ELEMENT title (#PCDATA)>
<!ATTLIST title lang CDATA "en">
<!ELEMENT author (#PCDATA)>
>>;;
let config = default_config;;
let dtd = Pxp_dtd_parser.parse_dtd_entity config (from_string dtd_text);;
Note that "pxp_text" is not restricted to DTDs, as it can be used for any kind
of string.
After we have the DTD, we can validate the trees. One option is to call the
"validate" function:
let book =
<:pxp_tree<
<book>
[ <title>[ "The Lord of The Rings" ]
<author>[ "J.R.R. Tolkien" ]
]
>>;;
Pxp_document.validate book;;
(This example is invalid, as the "id" attribute is missing.)
Note that it is a misunderstanding that "pxp_tree" builds XML trees in
well-formed mode. You can create any tree with it, and the fact is that
"pxp_tree" just does not invoke the validator. So if the DTD enforces
validation, the tree is validated when the "validate" function is called. If
the DTD is in well-formedness mode, the tree is effectively not validated, even
when the "validate" function is invoked. Btw, the following statements would
create a DTD in well-formedness mode:
let dtd = Pxp_dtd.create_dtd `Enc_iso88591;;
dtd # allow_arbitrary;
As an alternative of calling the "validate" function, one can also use
"pxp_vtree" instead. It immediately validates every XML element it creates.
However, "injected" subtrees are not validated, i.e. validation does not
proceed recursively to subnodes as the "validate" function does it.
------------------------------------------------------------------------------
Generating Events
------------------------------------------------------------------------------
As PXP has also an event model to represent XML, the preprocessor can also
produce such events. In particular, there are two modes: The "pxp_evlist"
notation outputs lists of events (type "event list") representing the XML
expression. The "pxp_evpull" notation creates an automaton from which one can
"pull" events (like from a pull parser).
These two notations work very much like "pxp_tree". For example,
let book =
<:pxp_evlist<
<book>
[ <title>[ "The Lord of The Rings" ]
<author>[ "J.R.R. Tolkien" ]
]
>>
generates
[ E_start_tag ("book", [], None, <obj>);
E_start_tag ("title", [], None, <obj>);
E_char_data "The Lord of The Rings";
E_end_tag ("title", <obj>);
E_start_tag ("author", [], None, <obj>);
E_char_data "J.R.R. Tolkien";
E_end_tag ("author", <obj>);
E_end_tag ("book", <obj>)
]
Note that you neither need a "dtd" variable nor a "spec" variable. There is one
important difference, however: Both nodes and lists of nodes are represented by
the same type, "event list". That has the consequence that in the following
example x1 and x2 have the same type "event list":
<:pxp_evlist< <element>x1 >>
<:pxp_evlist< <element>[x2] >>
<:pxp_evlist< <element><*>x3 >>
In principle, it could be checked at runtime whether x1 and x2 have the right
structure. However, this is not done because of performance reasons.
As mentioned, "pxp_evpull" works like a pull parser. After defining
let book =
<:pxp_evpull<
<book>
[ <title>[ "The Lord of The Rings" ]
<author>[ "J.R.R. Tolkien" ]
]
>>
"book" is a function 'a->event. One can call it to get the events one after the
other:
let e1 = book();; (* = Some(E_start_tag ("book", [], None, <obj>)) *)
let e2 = book();; (* = Some(E_start_tag ("title", [], None, <obj>)) *)
...
After the last event, "book" returns None to indicate the end of the event
stream.
As for "pxp_evlist", it is not possible to distinguish between nodes and node
lists. In this example, both x1 and x2 are assumed to have type 'a->event:
<:pxp_evlist< <element>x1 >>
<:pxp_evlist< <element>[x2] >>
<:pxp_evlist< <element><*>x3 >>
Note that "<element>x1" actually means to build a new pull automaton around the
existing pull automaton x1: The children of "element" are retrieved by pulling
events from x1 until "None" is returned.
A consequence of the pull semantics is that once an event is obtained from an
automaton, the state of the automaton is modified such that it is not possible
to get the same event again. If you need an automaton that can be reset to the
beginning, just wrap the "pxp_evlist" notation into a functional abstraction:
let book_maker() =
<:pxp_evpull< <book ...> ... >>;;
let book1 = book_maker();;
let book2 = book_maker();;
This way, "book1" and "book2" are independent event streams.
There is another implication of the nature of the automatons: Subexpressions
are lazily evaluated. For example, in
<:pxp_evpull< <element>[ <*> (: get_data_contents() :) ] >>
the call of get_data_contents is performed just before the event for the data
node is constructed.
------------------------------------------------------------------------------
Namespaces
------------------------------------------------------------------------------
By default, the preprocessor does not generate nodes or events that support
namespaces. It can, however, be configured to create namespace-aware XML
aggregations.
In any case, you need a namespace manager. This is an object that tracks the
usage of namespace prefixes in XML nodes. For example, we can create a
namespace manager that knows the "html" prefix:
let mng = new namespace_manager in
mng # add_namespace "html" "http://www.w3.org/1999/xhtml"
Here, we declare that we want to use the "html" prefix for the internal
representation of the XML nodes. This kind of prefix is called normalized
prefix, or normprefix for short. It is possible to configure different prefixes
for the external representation, i.e. when the XML tree is printed to a file.
This other kind of prefix is called display prefix. We will have a look at them
later.
Next, we must tell the DTD object that we have a namespace manager:
let dtd = Pxp_dtd.create_dtd `Enc_iso88591;;
dtd # set_namespace_manager mng;;
For "pxp_evlist" and "pxp_evpull" we are now prepared (note that we need now a
"dtd" variable, as the DTD object knows the namespace manager). For "pxp_tree"
and "pxp_vtree", it is required to use a namespace-aware specification:
let spec = Pxp_tree_parser.default_namespace_spec
(Normal specifications do not work, you would get "Namespace method not
applicable" errors if you tried to use them.)
The special notation "<:autoscope>" enables namespace mode in this example:
let list =
<:pxp_tree<
<:autoscope>
<html:ul>
[ <html:li>"Item1"
<html:li>"Item2"
]
>>
In particular, "<:autoscope>" defines a new O'Caml variable for its
subexpression: "scope". This variable contains the namespace scope object,
which contains the namespace declarations for the subexpression. "<:autoscope>"
initialises this variable from the namespace manager such that it contains now
a declaration for the "html" prefix.
In general, the namespace scope object contains the prefixes to use for the
external representation. For this simple example, we have chosen to use the
same prefixes as for the internal representation, and "<:autoscope>" performs
the right initialisations for this.
Print the tree by
list # display (`Out_channel stdout) `Enc_iso88591
The point is to call the "display" method and not the "write" method. The
latter would not respect the display prefixes.
Alternatively, we can also create the "scope" variable manually:
let scope = Pxp_dtd.create_namespace_scope
~decl:[ "", "http://www.w3.org/1999/xhtml" ]
mng;;
let list =
<:pxp_tree<
<:scope>
<html:ul>
[ <html:li>"Item1"
<html:li>"Item2"
]
>>
Note that we now use "<:scope>". In this simple form, this construct just
enables namespace mode, and takes the "scope" variable from the environment.
Furthermore, the namespace scope contains now a different namespace
declaration: The display prefix "" is used for HTML. The empty prefix just
means to declare a default prefix (by xmlns="URI"). The effect can be seen when
the XML tree is printed by calling the "display" method.
Here is a third variant of the same example:
let scope = Pxp_dtd.create_namespace_scope mng ;;
let list =
<:pxp_tree<
<:scope ("")="http://www.w3.org/1999/xhtml">
<html:ul>
[ <html:li>"Item1"
<html:li>"Item2"
]
>>
The "scope" is now initially empty. The "<:scope>" notation is used to extend
the scope for the time the subexpression is evaluated.
There is also a notation "<:emptyscope" that creates an empty scope object, so
one could even write
let list =
<:pxp_tree<
<:emptyscope>
<:scope ("")="http://www.w3.org/1999/xhtml">
<html:ul>
[ <html:li>"Item1"
<html:li>"Item2"
]
>>
It is recommended to create the "scope" variable manually with a reasonable
initial declaration, and to use "<:scope>" to enable namespace processing, and
to extend the scope when necessary. The advantage of this approach is that the
same scope object can be shared by many XML nodes, so you need less memory.
One tip: To get a namespace scope that is initialised with all prefixes of the
namespace manager (as <:autoscope> does it), define
let scope = create_namespace_scope ~decl: mng#as_declaration mng
For event-based processing of XML, the namespace mode works in the same way as
described here, there is no difference.
|