1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 939 940 941 942 943 944 945 946 947 948 949 950 951 952 953 954 955 956 957 958 959 960 961 962 963 964 965 966 967 968 969 970 971 972 973 974 975 976 977 978 979 980 981 982 983 984 985 986 987 988 989 990 991 992 993 994 995 996 997 998 999 1000 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 1011
|
@code{(require 'xml-parse)} or @code{(require 'ssax)}
@noindent
The XML standard document referred to in this module is@*
@url{http://www.w3.org/TR/1998/REC-xml-19980210.html}.
@noindent
The present frameworks fully supports the XML Namespaces
Recommendation@*
@url{http://www.w3.org/TR/REC-xml-names}.
@subsection String Glue
@defun ssax:reverse-collect-str list-of-frags
Given the list of fragments (some of which are text strings),
reverse the list and concatenate adjacent text strings. If
LIST-OF-FRAGS has zero or one element, the result of the procedure
is @code{equal?} to its argument.
@end defun
@defun ssax:reverse-collect-str-drop-ws list-of-frags
Given the list of fragments (some of which are text strings),
reverse the list and concatenate adjacent text strings while
dropping "unsignificant" whitespace, that is, whitespace in front,
behind and between elements. The whitespace that is included in
character data is not affected.
Use this procedure to "intelligently" drop "insignificant"
whitespace in the parsed SXML. If the strict compliance with the
XML Recommendation regarding the whitespace is desired, use the
@code{ssax:reverse-collect-str} procedure instead.
@end defun
@subsection Character and Token Functions
The following functions either skip, or build and return tokens,
according to inclusion or delimiting semantics. The list of
characters to expect, include, or to break at may vary from one
invocation of a function to another. This allows the functions to
easily parse even context-sensitive languages.
Exceptions are mentioned specifically. The list of expected
characters (characters to skip until, or break-characters) may
include an EOF "character", which is coded as symbol *eof*
The input stream to parse is specified as a PORT, which is the last
argument.
@defun ssax:assert-current-char char-list string port
Reads a character from the @var{port} and looks it up in the
@var{char-list} of expected characters. If the read character was
found among expected, it is returned. Otherwise, the
procedure writes a message using @var{string} as a comment
and quits.
@end defun
@defun ssax:skip-while char-list port
Reads characters from the @var{port} and disregards them, as long as they
are mentioned in the @var{char-list}. The first character (which may be EOF)
peeked from the stream that is @emph{not} a member of the @var{char-list} is
returned.
@end defun
@defun ssax:init-buffer
Returns an initial buffer for @code{ssax:next-token*} procedures.
@code{ssax:init-buffer} may allocate a new buffer at each invocation.
@end defun
@defun ssax:next-token prefix-char-list break-char-list comment-string port
Skips any number of the prefix characters (members of the @var{prefix-char-list}), if
any, and reads the sequence of characters up to (but not including)
a break character, one of the @var{break-char-list}.
The string of characters thus read is returned. The break character
is left on the input stream. @var{break-char-list} may include the symbol @code{*eof*};
otherwise, EOF is fatal, generating an error message including a
specified @var{comment-string}.
@end defun
@noindent
@code{ssax:next-token-of} is similar to @code{ssax:next-token}
except that it implements an inclusion rather than delimiting
semantics.
@defun ssax:next-token-of inc-charset port
Reads characters from the @var{port} that belong to the list of characters
@var{inc-charset}. The reading stops at the first character which is not a member
of the set. This character is left on the stream. All the read
characters are returned in a string.
@end defun
@defun ssax:next-token-of pred port
Reads characters from the @var{port} for which @var{pred} (a procedure of
one argument) returns non-#f. The reading stops at the first
character for which @var{pred} returns #f. That character is left
on the stream. All the results of evaluating of @var{pred} up to #f
are returned in a string.
@var{pred} is a procedure that takes one argument (a character or
the EOF object) and returns a character or #f. The returned
character does not have to be the same as the input argument to the
@var{pred}. For example,
@example
(ssax:next-token-of (lambda (c)
(cond ((eof-object? c) #f)
((char-alphabetic? c) (char-downcase c))
(else #f)))
(current-input-port))
@end example
will try to read an alphabetic token from the current input port,
and return it in lower case.
@end defun
@defun ssax:read-string len port
Reads @var{len} characters from the @var{port}, and returns them in a string. If
EOF is encountered before @var{len} characters are read, a shorter string
will be returned.
@end defun
@subsection Data Types
@table @code
@item TAG-KIND
A symbol @samp{START}, @samp{END}, @samp{PI}, @samp{DECL},
@samp{COMMENT}, @samp{CDSECT}, or @samp{ENTITY-REF} that identifies
a markup token
@item UNRES-NAME
a name (called GI in the XML Recommendation) as given in an XML
document for a markup token: start-tag, PI target, attribute name.
If a GI is an NCName, UNRES-NAME is this NCName converted into a
Scheme symbol. If a GI is a QName, @samp{UNRES-NAME} is a pair of
symbols: @code{(@var{PREFIX} . @var{LOCALPART})}.
@item RES-NAME
An expanded name, a resolved version of an @samp{UNRES-NAME}. For
an element or an attribute name with a non-empty namespace URI,
@samp{RES-NAME} is a pair of symbols,
@code{(@var{URI-SYMB} . @var{LOCALPART})}.
Otherwise, it's a single symbol.
@item ELEM-CONTENT-MODEL
A symbol:
@table @samp
@item ANY
anything goes, expect an END tag.
@item EMPTY-TAG
no content, and no END-tag is coming
@item EMPTY
no content, expect the END-tag as the next token
@item PCDATA
expect character data only, and no children elements
@item MIXED
@item ELEM-CONTENT
@end table
@item URI-SYMB
A symbol representing a namespace URI -- or other symbol chosen by
the user to represent URI. In the former case, @code{URI-SYMB} is
created by %-quoting of bad URI characters and converting the
resulting string into a symbol.
@item NAMESPACES
A list representing namespaces in effect. An element of the list
has one of the following forms:
@table @code
@item (@var{prefix} @var{uri-symb} . @var{uri-symb}) or
@item (@var{prefix} @var{user-prefix} . @var{uri-symb})
@var{user-prefix} is a symbol chosen by the user to represent the URI.
@item (#f @var{user-prefix} . @var{uri-symb})
Specification of the user-chosen prefix and a URI-SYMBOL.
@item (*DEFAULT* @var{user-prefix} . @var{uri-symb})
Declaration of the default namespace
@item (*DEFAULT* #f . #f)
Un-declaration of the default namespace. This notation
represents overriding of the previous declaration
@end table
A NAMESPACES list may contain several elements for the same @var{prefix}.
The one closest to the beginning of the list takes effect.
@item ATTLIST
An ordered collection of (@var{NAME} . @var{VALUE}) pairs, where
@var{NAME} is a RES-NAME or an UNRES-NAME. The collection is an ADT.
@item STR-HANDLER
A procedure of three arguments: @var{string1} @var{string2}
@var{seed} returning a new @var{seed}. The procedure is supposed to
handle a chunk of character data @var{string1} followed by a chunk
of character data @var{string2}. @var{string2} is a short string,
often @samp{"\n"} and even @samp{""}.
@item ENTITIES
An assoc list of pairs:
@lisp
(@var{named-entity-name} . @var{named-entity-body})
@end lisp
where @var{named-entity-name} is a symbol under which the entity was
declared, @var{named-entity-body} is either a string, or (for an
external entity) a thunk that will return an input port (from which
the entity can be read). @var{named-entity-body} may also be #f.
This is an indication that a @var{named-entity-name} is currently
being expanded. A reference to this @var{named-entity-name} will be
an error: violation of the WFC nonrecursion.
@item XML-TOKEN
This record represents a markup, which is, according to the XML
Recommendation, "takes the form of start-tags, end-tags,
empty-element tags, entity references, character references,
comments, CDATA section delimiters, document type declarations, and
processing instructions."
@table @asis
@item kind
a TAG-KIND
@item head
an UNRES-NAME. For XML-TOKENs of kinds 'COMMENT and 'CDSECT, the
head is #f.
@end table
For example,
@example
<P> => kind=START, head=P
</P> => kind=END, head=P
<BR/> => kind=EMPTY-EL, head=BR
<!DOCTYPE OMF ...> => kind=DECL, head=DOCTYPE
<?xml version="1.0"?> => kind=PI, head=xml
&my-ent; => kind=ENTITY-REF, head=my-ent
@end example
Character references are not represented by xml-tokens as these
references are transparently resolved into the corresponding
characters.
@item XML-DECL
The record represents a datatype of an XML document: the list of
declared elements and their attributes, declared notations, list of
replacement strings or loading procedures for parsed general
entities, etc. Normally an XML-DECL record is created from a DTD or
an XML Schema, although it can be created and filled in in many
other ways (e.g., loaded from a file).
@table @var
@item elems
an (assoc) list of decl-elem or #f. The latter instructs
the parser to do no validation of elements and attributes.
@item decl-elem
declaration of one element:
@code{(@var{elem-name} @var{elem-content} @var{decl-attrs})}
@var{elem-name} is an UNRES-NAME for the element.
@var{elem-content} is an ELEM-CONTENT-MODEL.
@var{decl-attrs} is an @code{ATTLIST}, of
@code{(@var{attr-name} . @var{value})} associations.
This element can declare a user procedure to handle parsing of an
element (e.g., to do a custom validation, or to build a hash of IDs
as they're encountered).
@item decl-attr
an element of an @code{ATTLIST}, declaration of one attribute:
@code{(@var{attr-name} @var{content-type} @var{use-type} @var{default-value})}
@var{attr-name} is an UNRES-NAME for the declared attribute.
@var{content-type} is a symbol: @code{CDATA}, @code{NMTOKEN},
@code{NMTOKENS}, @dots{} or a list of strings for the enumerated
type.
@var{use-type} is a symbol: @code{REQUIRED}, @code{IMPLIED}, or
@code{FIXED}.
@var{default-value} is a string for the default value, or #f if not
given.
@end table
@end table
@subsection Low-Level Parsers and Scanners
@noindent
These procedures deal with primitive lexical units (Names,
whitespaces, tags) and with pieces of more generic productions.
Most of these parsers must be called in appropriate context. For
example, @code{ssax:complete-start-tag} must be called only when the
start-tag has been detected and its GI has been read.
@defun ssax:skip-s port
Skip the S (whitespace) production as defined by
@example
[3] S ::= (#x20 | #x09 | #x0D | #x0A)
@end example
@code{ssax:skip-s} returns the first not-whitespace character it encounters while
scanning the @var{port}. This character is left on the input stream.
@end defun
@defun ssax:read-ncname port
Read a NCName starting from the current position in the @var{port} and
return it as a symbol.
@example
[4] NameChar ::= Letter | Digit | '.' | '-' | '_' | ':'
| CombiningChar | Extender
[5] Name ::= (Letter | '_' | ':') (NameChar)*
@end example
This code supports the XML Namespace Recommendation REC-xml-names,
which modifies the above productions as follows:
@example
[4] NCNameChar ::= Letter | Digit | '.' | '-' | '_'
| CombiningChar | Extender
[5] NCName ::= (Letter | '_') (NCNameChar)*
@end example
As the Rec-xml-names says,
@quotation
"An XML document conforms to this specification if all other tokens
[other than element types and attribute names] in the document which
are required, for XML conformance, to match the XML production for
Name, match this specification's production for NCName."
@end quotation
Element types and attribute names must match the production QName,
defined below.
@end defun
@defun ssax:read-qname port
Read a (namespace-) Qualified Name, QName, from the current position
in @var{port}; and return an UNRES-NAME.
From REC-xml-names:
@example
[6] QName ::= (Prefix ':')? LocalPart
[7] Prefix ::= NCName
[8] LocalPart ::= NCName
@end example
@end defun
@defun ssax:read-markup-token port
This procedure starts parsing of a markup token. The current
position in the stream must be @samp{<}. This procedure scans
enough of the input stream to figure out what kind of a markup token
it is seeing. The procedure returns an XML-TOKEN structure
describing the token. Note, generally reading of the current markup
is not finished! In particular, no attributes of the start-tag
token are scanned.
Here's a detailed break out of the return values and the position in
the PORT when that particular value is returned:
@table @asis
@item PI-token
only PI-target is read. To finish the Processing-Instruction and
disregard it, call @code{ssax:skip-pi}. @code{ssax:read-attributes}
may be useful as well (for PIs whose content is attribute-value
pairs).
@item END-token
The end tag is read completely; the current position is right after
the terminating @samp{>} character.
@item COMMENT
is read and skipped completely. The current position is right after
@samp{-->} that terminates the comment.
@item CDSECT
The current position is right after @samp{<!CDATA[}. Use
@code{ssax:read-cdata-body} to read the rest.
@item DECL
We have read the keyword (the one that follows @samp{<!})
identifying this declaration markup. The current position is after
the keyword (usually a whitespace character)
@item START-token
We have read the keyword (GI) of this start tag. No attributes are
scanned yet. We don't know if this tag has an empty content either.
Use @code{ssax:complete-start-tag} to finish parsing of the token.
@end table
@end defun
@defun ssax:skip-pi port
The current position is inside a PI. Skip till the rest of the PI
@end defun
@defun ssax:read-pi-body-as-string port
The current position is right after reading the PITarget. We read
the body of PI and return is as a string. The port will point to
the character right after @samp{?>} combination that terminates PI.
@example
[16] PI ::= '<?' PITarget (S (Char* - (Char* '?>' Char*)))? '?>'
@end example
@end defun
@defun ssax:skip-internal-dtd port
The current pos in the port is inside an internal DTD subset (e.g.,
after reading @samp{#\[} that begins an internal DTD subset) Skip
until the @samp{]>} combination that terminates this DTD.
@end defun
@defun ssax:read-cdata-body port str-handler seed
This procedure must be called after we have read a string
@samp{<![CDATA[} that begins a CDATA section. The current position
must be the first position of the CDATA body. This function reads
@emph{lines} of the CDATA body and passes them to a @var{str-handler}, a character
data consumer.
@var{str-handler} is a procedure taking arguments: @var{string1}, @var{string2},
and @var{seed}. The first @var{string1} argument to @var{str-handler} never
contains a newline; the second @var{string2} argument often will.
On the first invocation of @var{str-handler}, @var{seed} is the one passed to @code{ssax:read-cdata-body} as the
third argument. The result of this first invocation will be passed
as the @var{seed} argument to the second invocation of the line
consumer, and so on. The result of the last invocation of the @var{str-handler} is
returned by the @code{ssax:read-cdata-body}. Note a similarity to the fundamental @dfn{fold}
@cindex fold
iterator.
Within a CDATA section all characters are taken at their face value,
with three exceptions:
@itemize @bullet
@item
CR, LF, and CRLF are treated as line delimiters, and passed
as a single @samp{#\newline} to @var{str-handler}
@item
@samp{]]>} combination is the end of the CDATA section.
@samp{>} is treated as an embedded @samp{>} character.
@item
@samp{<} and @samp{&} are not specially recognized (and are
not expanded)!
@end itemize
@end defun
@defun ssax:read-char-ref port
@example
[66] CharRef ::= '&#' [0-9]+ ';'
| '&#x' [0-9a-fA-F]+ ';'
@end example
This procedure must be called after we we have read @samp{&#} that
introduces a char reference. The procedure reads this reference and
returns the corresponding char. The current position in PORT will
be after the @samp{;} that terminates the char reference.
Faults detected:@*
WFC: XML-Spec.html#wf-Legalchar
According to Section @cite{4.1 Character and Entity References}
of the XML Recommendation:
@quotation
"[Definition: A character reference refers to a specific character
in the ISO/IEC 10646 character set, for example one not directly
accessible from available input devices.]"
@end quotation
@c Therefore, we use a @code{ucscode->char} function to convert a
@c character code into the character -- *regardless* of the current
@c character encoding of the input stream.
@end defun
@defun ssax:handle-parsed-entity port name entities content-handler str-handler seed
Expands and handles a parsed-entity reference.
@var{name} is a symbol, the name of the parsed entity to expand.
@c entities - see ENTITIES
@var{content-handler} is a procedure of arguments @var{port}, @var{entities}, and
@var{seed} that returns a seed.
@var{str-handler} is called if the entity in question is a pre-declared entity.
@code{ssax:handle-parsed-entity} returns the result returned by @var{content-handler} or @var{str-handler}.
Faults detected:@*
WFC: XML-Spec.html#wf-entdeclared@*
WFC: XML-Spec.html#norecursion
@end defun
@defun attlist-add attlist name-value
Add a @var{name-value} pair to the existing @var{attlist}, preserving its sorted ascending
order; and return the new list. Return #f if a pair with the same
name already exists in @var{attlist}
@end defun
@defun attlist-remove-top attlist
Given an non-null @var{attlist}, return a pair of values: the top and the rest.
@end defun
@defun ssax:read-attributes port entities
This procedure reads and parses a production @dfn{Attribute}.
@cindex Attribute
@example
[41] Attribute ::= Name Eq AttValue
[10] AttValue ::= '"' ([^<&"] | Reference)* '"'
| "'" ([^<&'] | Reference)* "'"
[25] Eq ::= S? '=' S?
@end example
The procedure returns an ATTLIST, of Name (as UNRES-NAME), Value (as
string) pairs. The current character on the @var{port} is a non-whitespace
character that is not an NCName-starting character.
Note the following rules to keep in mind when reading an
@dfn{AttValue}:
@cindex AttValue
@quotation
Before the value of an attribute is passed to the application or
checked for validity, the XML processor must normalize it as
follows:
@itemize @bullet
@item
A character reference is processed by appending the referenced
character to the attribute value.
@item
An entity reference is processed by recursively processing the
replacement text of the entity. The named entities @samp{amp},
@samp{lt}, @samp{gt}, @samp{quot}, and @samp{apos} are pre-declared.
@item
A whitespace character (#x20, #x0D, #x0A, #x09) is processed by
appending #x20 to the normalized value, except that only a single
#x20 is appended for a "#x0D#x0A" sequence that is part of an
external parsed entity or the literal entity value of an internal
parsed entity.
@item
Other characters are processed by appending them to the normalized
value.
@end itemize
@end quotation
Faults detected:@*
WFC: XML-Spec.html#CleanAttrVals@*
WFC: XML-Spec.html#uniqattspec
@end defun
@defun ssax:resolve-name port unres-name namespaces apply-default-ns?
Convert an @var{unres-name} to a RES-NAME, given the appropriate @var{namespaces} declarations.
The last parameter, @var{apply-default-ns?}, determines if the default namespace applies
(for instance, it does not for attribute names).
Per REC-xml-names/#nsc-NSDeclared, the "xml" prefix is considered
pre-declared and bound to the namespace name
"http://www.w3.org/XML/1998/namespace".
@code{ssax:resolve-name} tests for the namespace constraints:@*
@url{http://www.w3.org/TR/REC-xml-names/#nsc-NSDeclared}
@end defun
@defun ssax:complete-start-tag tag port elems entities namespaces
Complete parsing of a start-tag markup. @code{ssax:complete-start-tag} must be called after the
start tag token has been read. @var{tag} is an UNRES-NAME. @var{elems} is an
instance of the ELEMS slot of XML-DECL; it can be #f to tell the
function to do @emph{no} validation of elements and their
attributes.
@code{ssax:complete-start-tag} returns several values:
@itemize @bullet
@item ELEM-GI:
a RES-NAME.
@item ATTRIBUTES:
element's attributes, an ATTLIST of (RES-NAME . STRING) pairs.
The list does NOT include xmlns attributes.
@item NAMESPACES:
the input list of namespaces amended with namespace
(re-)declarations contained within the start-tag under parsing
@item ELEM-CONTENT-MODEL
@end itemize
On exit, the current position in @var{port} will be the first character
after @samp{>} that terminates the start-tag markup.
Faults detected:@*
VC: XML-Spec.html#enum@*
VC: XML-Spec.html#RequiredAttr@*
VC: XML-Spec.html#FixedAttr@*
VC: XML-Spec.html#ValueType@*
WFC: XML-Spec.html#uniqattspec (after namespaces prefixes are resolved)@*
VC: XML-Spec.html#elementvalid@*
WFC: REC-xml-names/#dt-NSName
@emph{Note}: although XML Recommendation does not explicitly say it,
xmlns and xmlns: attributes don't have to be declared (although they
can be declared, to specify their default value).
@end defun
@defun ssax:read-external-id port
Parses an ExternalID production:
@example
[75] ExternalID ::= 'SYSTEM' S SystemLiteral
| 'PUBLIC' S PubidLiteral S SystemLiteral
[11] SystemLiteral ::= ('"' [^"]* '"') | ("'" [^']* "'")
[12] PubidLiteral ::= '"' PubidChar* '"'
| "'" (PubidChar - "'")* "'"
[13] PubidChar ::= #x20 | #x0D | #x0A | [a-zA-Z0-9]
| [-'()+,./:=?;!*#@@$_%]
@end example
Call @code{ssax:read-external-id} when an ExternalID is expected; that is, the current
character must be either #\S or #\P that starts correspondingly a
SYSTEM or PUBLIC token. @code{ssax:read-external-id} returns the @var{SystemLiteral} as a
string. A @var{PubidLiteral} is disregarded if present.
@end defun
@subsection Mid-Level Parsers and Scanners
@noindent
These procedures parse productions corresponding to the whole
(document) entity or its higher-level pieces (prolog, root element,
etc).
@defun ssax:scan-misc port
Scan the Misc production in the context:
@example
[1] document ::= prolog element Misc*
[22] prolog ::= XMLDecl? Misc* (doctypedec l Misc*)?
[27] Misc ::= Comment | PI | S
@end example
Call @code{ssax:scan-misc} in the prolog or epilog contexts. In these contexts,
whitespaces are completely ignored. The return value from @code{ssax:scan-misc} is
either a PI-token, a DECL-token, a START token, or *EOF*. Comments
are ignored and not reported.
@end defun
@defun ssax:read-char-data port expect-eof? str-handler iseed
Read the character content of an XML document or an XML element.
@example
[43] content ::=
(element | CharData | Reference | CDSect | PI | Comment)*
@end example
To be more precise, @code{ssax:read-char-data} reads CharData, expands CDSect and character
entities, and skips comments. @code{ssax:read-char-data} stops at a named reference, EOF,
at the beginning of a PI, or a start/end tag.
@var{expect-eof?} is a boolean indicating if EOF is normal; i.e., the character
data may be terminated by the EOF. EOF is normal while processing a
parsed entity.
@var{iseed} is an argument passed to the first invocation of @var{str-handler}.
@code{ssax:read-char-data} returns two results: @var{seed} and @var{token}. The @var{seed}
is the result of the last invocation of @var{str-handler}, or the original @var{iseed} if @var{str-handler}
was never called.
@var{token} can be either an eof-object (this can happen only if @var{expect-eof?}
was #t), or:
@itemize @bullet
@item
an xml-token describing a START tag or an END-tag;
For a start token, the caller has to finish reading it.
@item
an xml-token describing the beginning of a PI. It's up to an
application to read or skip through the rest of this PI;
@item
an xml-token describing a named entity reference.
@end itemize
CDATA sections and character references are expanded inline and
never returned. Comments are silently disregarded.
As the XML Recommendation requires, all whitespace in character data
must be preserved. However, a CR character (#x0D) must be
disregarded if it appears before a LF character (#x0A), or replaced
by a #x0A character otherwise. See Secs. 2.10 and 2.11 of the XML
Recommendation. See also the canonical XML Recommendation.
@end defun
@defun ssax:assert-token token kind gi error-cont
Make sure that @var{token} is of anticipated @var{kind} and has anticipated @var{gi}. Note
that the @var{gi} argument may actually be a pair of two symbols,
Namespace-URI or the prefix, and of the localname. If the assertion
fails, @var{error-cont} is evaluated by passing it three arguments: @var{token} @var{kind} @var{gi}. The
result of @var{error-cont} is returned.
@end defun
@subsection High-level Parsers
These procedures are to instantiate a SSAX parser. A user can
instantiate the parser to do the full validation, or no validation,
or any particular validation. The user specifies which PI he wants
to be notified about. The user tells what to do with the parsed
character and element data. The latter handlers determine if the
parsing follows a SAX or a DOM model.
@defun ssax:make-pi-parser my-pi-handlers
Create a parser to parse and process one Processing Element (PI).
@var{my-pi-handlers} is an association list of pairs
@code{(@var{pi-tag} . @var{pi-handler})} where @var{pi-tag} is an
NCName symbol, the PI target; and @var{pi-handler} is a procedure
taking arguments @var{port}, @var{pi-tag}, and @var{seed}.
@var{pi-handler} should read the rest of the PI up to and including
the combination @samp{?>} that terminates the PI. The handler
should return a new seed. One of the @var{pi-tag}s may be the
symbol @code{*DEFAULT*}. The corresponding handler will handle PIs
that no other handler will. If the *DEFAULT* @var{pi-tag} is not
specified, @code{ssax:make-pi-parser} will assume the default handler that skips the body of
the PI.
@code{ssax:make-pi-parser} returns a procedure of arguments @var{port}, @var{pi-tag}, and
@var{seed}; that will parse the current PI according to @var{my-pi-handlers}.
@end defun
@defun ssax:make-elem-parser my-new-level-seed my-finish-element my-char-data-handler my-pi-handlers
Create a parser to parse and process one element, including its
character content or children elements. The parser is typically
applied to the root element of a document.
@table @asis
@item @var{my-new-level-seed}
is a procedure taking arguments:
@var{elem-gi} @var{attributes} @var{namespaces} @var{expected-content} @var{seed}
where @var{elem-gi} is a RES-NAME of the element about to be
processed.
@var{my-new-level-seed} is to generate the seed to be passed to handlers that process the
content of the element.
@item @var{my-finish-element}
is a procedure taking arguments:
@var{elem-gi} @var{attributes} @var{namespaces} @var{parent-seed} @var{seed}
@var{my-finish-element} is called when parsing of @var{elem-gi} is finished.
The @var{seed} is the result from the last content parser (or
from @var{my-new-level-seed} if the element has the empty content).
@var{parent-seed} is the same seed as was passed to @var{my-new-level-seed}.
@var{my-finish-element} is to generate a seed that will be the result
of the element parser.
@item @var{my-char-data-handler}
is a STR-HANDLER as described in Data Types above.
@item @var{my-pi-handlers}
is as described for @code{ssax:make-pi-handler} above.
@end table
The generated parser is a procedure taking arguments:
@var{start-tag-head} @var{port} @var{elems} @var{entities} @var{namespaces} @var{preserve-ws?} @var{seed}
The procedure must be called after the start tag token has been
read. @var{start-tag-head} is an UNRES-NAME from the start-element
tag. ELEMS is an instance of ELEMS slot of XML-DECL.
Faults detected:@*
VC: XML-Spec.html#elementvalid@*
WFC: XML-Spec.html#GIMatch
@end defun
@defun ssax:make-parser user-handler-tag user-handler @dots{}
Create an XML parser, an instance of the XML parsing framework.
This will be a SAX, a DOM, or a specialized parser depending on the
supplied user-handlers.
@code{ssax:make-parser} takes an even number of arguments; @var{user-handler-tag} is a symbol that identifies
a procedure (or association list for @code{PROCESSING-INSTRUCTIONS})
(@var{user-handler}) that follows the tag. Given below are tags and signatures of
the corresponding procedures. Not all tags have to be specified.
If some are omitted, reasonable defaults will apply.
@table @samp
@item DOCTYPE
handler-procedure: @var{port} @var{docname} @var{systemid} @var{internal-subset?} @var{seed}
If @var{internal-subset?} is #t, the current position in the port is
right after we have read @samp{[} that begins the internal DTD
subset. We must finish reading of this subset before we return (or
must call @code{skip-internal-dtd} if we aren't interested in
reading it). @var{port} at exit must be at the first symbol after
the whole DOCTYPE declaration.
The handler-procedure must generate four values:
@quotation
@var{elems} @var{entities} @var{namespaces} @var{seed}
@end quotation
@var{elems} is as defined for the ELEMS slot of XML-DECL. It may be
#f to switch off validation. @var{namespaces} will typically
contain @var{user-prefix}es for selected @var{uri-symb}s. The
default handler-procedure skips the internal subset, if any, and
returns @code{(values #f '() '() seed)}.
@item UNDECL-ROOT
procedure: @var{elem-gi} @var{seed}
where @var{elem-gi} is an UNRES-NAME of the root element. This
procedure is called when an XML document under parsing contains
@emph{no} DOCTYPE declaration.
The handler-procedure, as a DOCTYPE handler procedure above,
must generate four values:
@quotation
@var{elems} @var{entities} @var{namespaces} @var{seed}
@end quotation
The default handler-procedure returns (values #f '() '() seed)
@item DECL-ROOT
procedure: @var{elem-gi} @var{seed}
where @var{elem-gi} is an UNRES-NAME of the root element. This
procedure is called when an XML document under parsing does contains
the DOCTYPE declaration. The handler-procedure must generate a new
@var{seed} (and verify that the name of the root element matches the
doctype, if the handler so wishes). The default handler-procedure
is the identity function.
@item NEW-LEVEL-SEED
procedure: see ssax:make-elem-parser, my-new-level-seed
@item FINISH-ELEMENT
procedure: see ssax:make-elem-parser, my-finish-element
@item CHAR-DATA-HANDLER
procedure: see ssax:make-elem-parser, my-char-data-handler
@item PROCESSING-INSTRUCTIONS
association list as is passed to @code{ssax:make-pi-parser}.
The default value is '()
@end table
The generated parser is a procedure of arguments @var{port} and
@var{seed}.
This procedure parses the document prolog and then exits to an
element parser (created by @code{ssax:make-elem-parser}) to handle
the rest.
@example
[1] document ::= prolog element Misc*
[22] prolog ::= XMLDecl? Misc* (doctypedec | Misc*)?
[27] Misc ::= Comment | PI | S
[28] doctypedecl ::= '<!DOCTYPE' S Name (S ExternalID)? S?
('[' (markupdecl | PEReference | S)* ']' S?)? '>'
[29] markupdecl ::= elementdecl | AttlistDecl
| EntityDecl
| NotationDecl | PI
| Comment
@end example
@end defun
@subsection Parsing XML to SXML
@defun ssax:xml->sxml port namespace-prefix-assig
This is an instance of the SSAX parser that returns an SXML
representation of the XML document to be read from @var{port}. @var{namespace-prefix-assig} is a list
of @code{(@var{user-prefix} . @var{uri-string})} that assigns
@var{user-prefix}es to certain namespaces identified by particular
@var{uri-string}s. It may be an empty list. @code{ssax:xml->sxml} returns an SXML
tree. The port points out to the first character after the root
element.
@end defun
|