1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933
|
\input texiplus @c -*-texinfo-*-
@c %**start of header
@setfilename xmlada.info
@settitle The Ada95 XML Library
@syncodeindex fn cp
@dircategory GNU Ada tools
@direntry
* XML/Ada: (xmlada). The Ada95 Unicode and XML Library
@end direntry
@set XMLVersion 1.0
@titlepage
@title The Ada95 Unicode and XML Library
@subtitle Version @value{XMLVersion}
@subtitle Document revision level $Revision: 1.7 $
@subtitle Date: $Date: 2003/01/10 13:07:44 $
@author Emmanuel Briot
@page
@vskip 0pt plus 1filll
Copyright @copyright{} 2000-2002, Emmanuel Briot
This document may be copied, in whole or in part, in any form or by any
means, as is or with alterations, provided that (1) alterations are clearly
marked as alterations and (2) this copyright notice is included
unmodified in any copy.
@end titlepage
@ifinfo
@node Top, Introduction, (dir), (dir)
@top The Ada95 Unicode and XML Library
The Ada95 XML Library
Version @value{XMLVersion}
Date: $Date: 2003/01/10 13:07:44 $
Copyright @copyright{} 2000-2002, Emmanuel Briot
This document may be copied, in whole or in part, in any form or by any
means, as is or with alterations, provided that (1) alterations are clearly
marked as alterations and (2) this copyright notice is included
unmodified in any copy.
@menu
* Introduction::
* The Unicode module::
* The Input module::
* The SAX module::
* The DOM module::
* Using the library::
@detailmenu
--- The Detailed Node Listing ---
The Unicode module
* Glyphs::
* Repertoires and subsets::
* Character sets::
* Character encoding schemes::
* Misc. functions::
The Input module
The SAX module
* SAX Description::
* SAX Examples::
* SAX Parser::
* SAX Handlers::
The DOM module
Using the library
@end detailmenu
@end menu
@end ifinfo
@c -------------------------------------------------------------------
@node Introduction
@chapter Introduction
@c -------------------------------------------------------------------
@noindent
The Extensible Markup Language (XML) is a subset of SGML that is
completely described in this document. Its goal is to enable generic
SGML to be served, received, and processed on the Web in the way that is
now possible with HTML. XML has been designed for ease of implementation
and for interoperability with both SGML and HTML.
This library includes a set of Ada95 packages to manipulate XML input. It
implements the XML 1.0 standard (see the references at the end of this
document), as well as support for namespaces and a number of other
optional standards related to XML.
We have tried to follow as closely as possible the XML standard, so that
you can easily analyze and reuse languages produced for other languages.
This document isn't a tutorial on what XML is, nor on the various
standards like DOM and SAX. Although we will try and give a few
examples, we refer the reader to the standards themselves, which are all
easily readable.
@b{??? Explain what XML is}
@c -------------------------------------------------------------------
@node The Unicode module
@chapter The Unicode module
@c -------------------------------------------------------------------
@c --- The following comes directly from www.unicode.org ----
@noindent
Unicode provides a unique number for every character, no matter what the
platform, no matter what the program, no matter what the language.
Fundamentally, computers just deal with numbers. They store letters and
other characters by assigning a number for each one. Before Unicode was
invented, there were hundreds of different encoding systems for
assigning these numbers. No single encoding could contain enough
characters: for example, the European Union alone requires several
different encodings to cover all its languages. Even for a single
language like English no single encoding was adequate for all the
letters, punctuation, and technical symbols in common use.
These encoding systems also conflict with one another. That is, two
encodings can use the same number for two different characters, or use
different numbers for the same character. Any given computer (especially
servers) needs to support many different encodings; yet whenever data is
passed between different encodings or platforms, that data always runs
the risk of corruption.
Unicode provides a unique number for every character, no matter what the
platform, no matter what the program, no matter what the language. The
Unicode Standard has been adopted by such industry leaders as Apple, HP,
IBM, JustSystem, Microsoft, Oracle, SAP, Sun, Sybase, Unisys and many
others. Unicode is required by modern standards such as XML, Java,
ECMAScript (JavaScript), LDAP, CORBA 3.0, WML, etc., and is the official
way to implement ISO/IEC 10646. It is supported in many operating
systems, all modern browsers, and many other products. The emergence of
the Unicode Standard, and the availability of tools supporting it, are
among the most significant recent global software technology trends.
@c --- End of www.unicode.org ---
The following sections explain the basic vocabulary and concepts
associated with Unicode and encodings.
Most of the information comes from the official Unicode Web site, at
@url{http://www.unicode.org/unicode/reports/tr17}.
Part of this documentation comes from @url{http://www.unicode.org}, the
official web site for Unicode.
Some information was also extracted from the "UTF-8 and Unicode FAQ"
by M. Kuhn, available at @url{???}.
@menu
* Glyphs::
* Repertoires and subsets::
* Character sets::
* Character encoding schemes::
* Misc. functions::
@end menu
@c -------------------------------------------------------------------
@node Glyphs
@section Glyphs
@c -------------------------------------------------------------------
@noindent
A glyph is a particular representation of a character or part of a
character.
Several representations are possible, mostly depending on the exact font
used at that time. A single glyph can correspond to a sequence of characters,
or a single character to a sequence of glyphs.
The Unicode standard doesn't deal with glyphs, although a suggested
representation is given for each character in the standard. Likewise, this
module doesn't provide any graphical support for Unicode, and will just
deal with textual memory representation and encodings.
Take a look at the @b{GtkAda} library that provides the graphical interface
for unicode in the upcoming 2.0 version.
@c -------------------------------------------------------------------
@node Repertoires and subsets
@section Repertoires and subsets
@c -------------------------------------------------------------------
@noindent
A repertoire is a set of abstract characters to be encoded, normally
a familiar alphabet or symbol set. For instance, the alphabet used to
spell English words, or the one used for the Russian alphabet are two
such repertoires.
There exist two types of repertoires, close and open ones. The former
is the most common one, and the two examples above are such repertoires.
No character is ever added to them.
Unicode is also a repertoire, but an open one. New entries are
added to it. However, it is guaranteed that none will ever be deleted from it.
Unicode intends to be a universal repertoire, with all possible
characters currently used in the world. It currently contains all the
alphabets, including a number of alphabets associated with dead languages
like hieroglyphs. It also contains a number of often used symbols, like
mathematical signs.
The goal of this Unicode module is to convert all characters to entries in
the Unicode repertoire, so that any applications can communicate with each
other in a portable manner.
Given its size, most applications will only support a subset of Unicode.
Some of the scripts, most notably Arabic and Asian languages, require a
special support in the application (right-to-left writing,...), and thus will
not be supported by some applications.
The Unicode standard includes a set of internal catalogs, called
collections. Each character in these collections is given a special name,
in addition to its code, to improve readability.
Several child packages (@b{Unicode.Names.*}) define those names. For
instance:
@table @b
@item Unicode.Names.Basic_Latin
This contains the basic characters used in most western European languages,
including the standard ASCII subset.
@item Unicode.Names.Cyrillic
This contains the Russian alphabet.
@item Unicode.Names.Mathematical_Operators
This contains several mathematical symbols
@end table
More than 80 such packages exist.
@c -------------------------------------------------------------------
@node Character sets
@section Character sets
@c -------------------------------------------------------------------
@noindent
A character set is a mapping from a set of abstract characters to some
non-negative integers. The integer associated with a character is called
its code point, and the character itself is called the encoded character.
There exist a number of standard character sets, unfortunately not compatible
with each other. For instance, ASCII is one of these character sets, and
contains 128 characters. A super-set of it is the ISO/8859-1 character set.
Another character set is the JIS X 0208, used to encode Japanese characters.
Note that a character set is different from a repertoire. For instance, the
same character C with cedilla doesn't have the same integer value in the
ISO/8859-1 character set and the ISO/8859-1 character set.
Unicode is also such a character set, that contains all the possible
characters and associate a standard integer with them. A similar and
fully compatible character set is ISO/10646. The only addition that Unicode
does other ISO/10646 is that it also specifies algorithms for rendering
presentation forms of some scripts (say Arabic), handling of bi-directional
texts that mix for instance Latin and Hebrew, algorithms for sorting and
string comparison, and much more.
Currently, our Unicode package doesn't include any support for these
algorithms.
Unicode and ISO 10646 define formally a 31-bit character set. However,
of this huge code space, so far characters have been assigned only to
the first 65534 positions (0x0000 to 0xFFFD). The characters that are
expected to be encoded outside the 16-bit range belong all to rather
exotic scripts (e.g., Hieroglyphics) that are only used by specialists
for historic and scientific purposes
The Unicode module contains a set of packages to provide conversion from some
of the most common character sets to and from Unicode. These are the
@b{Unicode.CCS.*} packages.
All these packages have a common structure:
@enumerate
@item They define a global variable of type @code{Character_Set} with two
fields, ie the two conversion functions between the given character set and
Unicode.
These functions convert one character (actually its code point) at a time.
@item They also define a number of standard names associated with this
character set. For instance, the ISO/8859-1 set is also known as Latin1.
The function @code{Unicode.CCS.Get_Character_Set} can be used to find a
character set by its standard name.
@end enumerate
Currently, the following sets are supported:
@table @b
@item ISO/8859-1 aka Latin1
This is the standard character set used to represent most Western
European languages including: Albanian, Catalan, Danish, Dutch, English,
Faroese, Finnish, French, Galician, German, Irish, Icelandic, Italian,
Norwegian, Portuguese, Spanish and Swedish.
@item ISO/8859-2 aka Latin2
This character set supports the Slavic languages of Central Europe
which use the Latin alphabet. The ISO-8859-2 set is used for the following
languages: Czech, Croat, German, Hungarian, Polish, Romanian, Slovak and
Slovenian.
@item ISO/8859-3
This character set is used for Esperanto, Galician, Maltese and Turkish
@item ISO/8859-4
Some letters were added to the ISO-8859-4 to support languages such as
Estonian, Latvian and Lithuanian. It is an incomplete precursor of the
Latin 6 set.
@end table
@c -------------------------------------------------------------------
@node Character encoding schemes
@section Character encoding schemes
@c -------------------------------------------------------------------
@noindent
We now know how each encoded character can be represented by an integer
value (code point) depending on the character set.
Character encoding schemes deal with the representation of a sequence
of integers to a sequence of code units. A code unit is a sequence of
bytes on a computer architecture.
There exists a number of possible encoding schemes. Some of them encode
all integers on the same number of bytes. They are called fixed-width
encoding forms, and include the standard encoding for Internet emails
(@b{7bits}, but it can't encode all characters), as well as the simple
@b{8bits} scheme, or the @b{EBCDIC} scheme. Among them is also the
@b{UTF-32} scheme which is defined in the Unicode standard.
Another set of encoding schemes encode integers on a variable number of
bytes. These include two schemes that are also defined in the Unicode
standard, namely @b{Utf-8} and @b{Utf-16}.
Unicode doesn't impose any specific encoding. However, it is most often
associated with one of the Utf encodings. They each have their own
properties and advantages:
@table @b
@item Utf32
This is the simplest of all these encodings. It simply encodes all the
characters on 32 bits (4 bytes). This encodes all the possible characters
in Unicode, and is obviously straightforward to manipulate. However, given
that the first 65535 characters in Unicode are enough to encode all known
languages currently in use, Utf32 is also a waste of space in most cases.
@item Utf16
For the above reason, Utf16 was defined. Most characters are only encoded
on two bytes (which is enough for the first 65535 and most current
characters). In addition, a number of special code points have been
defined, known as @i{surrogate pairs}, that make the encoding of integers
greater than 65535 possible. The integers are then encoded on four bytes.
As a result, Utf16 is thus much more memory-efficient and requires less
space than Utf32 to encode sequences of characters. However, it is also
more complex to decode.
@item Utf8
This is an even more space-efficient encoding, but is also more complex
to decode. More important, it is compatible with the most currently used
simple 8bit encoding.
Utf8 has the following properties:
@itemize
@item Characters 0 to 127 (ASCII) are encoded simply as a single byte.
This means that files and strings which contain only 7-bit ASCII
characters have the same encoding under both ASCII and UTF-8.
@item Characters greater than 127 are encoded as a sequence of several
bytes, each of which has the most significant bit set. Therefore,
no ASCII byte can appear as part of any other character.
@item The first byte of a multibyte sequence that represents a non-ASCII
character is always in the range 0xC0 to 0xFD and it indicates how
many bytes follow for this character. All further bytes in a
multibyte sequence are in the range 0x80 to 0xBF. This allows easy
resynchronization and makes the encoding stateless and robust
against missing bytes.
@item UTF-8 encoded characters may theoretically be up to six bytes
long, however the first 16-bit characters are only up to three bytes
long.
@end itemize
@end table
Note that the encodings above, except for Utf8, have two versions, depending
on the chosen byte order on the machine.
The Ada95 Unicode module provides a set of packages that provide an easy
conversion between all the encoding schemes, as well as basic manipulations
of these byte sequences. These are the @b{Unicode.CES.*} packages.
Currently, four encoding schemes are supported, the three Utf schemes and
the basic 8bit encoding which corresponds to the standard Ada strings.
It also supports some routines to convert from one byte-order to another.
The following examples show a possible use of these packages:
@smallexample
Converting a latin1 string coded on 8 bits to a Utf8 latin2 file
involves the following steps:
Latin1 string (bytes associated with code points in Latin1)
| "use Unicode.CES.Basic_8bit.To_Utf32"
v
Utf32 latin1 string (contains code points in Latin1)
| "Convert argument to To_Utf32 should be
v Unicode.CCS.Iso_8859_1.Convert"
Utf32 Unicode string (contains code points in Unicode)
| "use Unicode.CES.Utf8.From_Utf32"
v
Utf8 Unicode string (contains code points in Unicode)
| "Convert argument to From_Utf32 should be
v Unicode.CCS.Iso_8859_2.Convert"
Utf8 Latin2 string (contains code points in Latin2)
@end smallexample
@c -------------------------------------------------------------------
@node Misc. functions
@section Misc. functions
@c -------------------------------------------------------------------
@noindent
The package @b{Unicode} contains a series of @code{Is_*} functions,
matching the Unicode standard.
@table @b
@item Is_White_Space
Return True if the character argument is a space character, ie a space,
horizontal tab, line feed or carriage return.
@item Is_Letter
Return True if the character argument is a letter. This includes the
standard English letters, as well as some less current cases defined in the
standard.
@item Is_Base_Char
Return True if the character is a base character, ie a character whose
meaning can be modified with a combining character.
@item Is_Digit
Return True if the character is a digit (numeric character)
@item Is_Combining_Char
Return True if the character is a combining character. Combining characters
are accents or other diacritical marks that are added to the previous
character.
The most important accented characters, like those used in the
orthographies of common languages, have codes of their own in Unicode to
ensure backwards compatibility with older character sets. Accented
characters that have their own code position, but could also be
represented as a pair of another character followed by a combining
character, are known as precomposed characters. Precomposed characters
are available in Unicode for backwards compatibility with older encodings
such as ISO 8859 that had no combining characters. The combining
character mechanism allows to add accents and other diacritical marks to
any character
Note however that your application must provide specific support for
combining characters, at least if you want to represent them visually.
@item Is_Extender
True if Char is an extender character.
@item Is_Ideographic
True if Char is an ideographic character. This is defined only for
Asian languages.
@end table
@c -------------------------------------------------------------------
@node The Input module
@chapter The Input module
@c -------------------------------------------------------------------
@noindent
This module provides a set of packages with a common interface to access
the characters contained in a stream. Various implementations are
provided to access files and manipulate standard Ada strings.
A top-level tagged type is provided that must be extended for the
various streams. It is assumed that the pointer to the current character
in the stream can only go forward, and never backward. As a result, it
is possible to implement this package for sockets or other strings where
it isn't even possible to go backward. This also means that one doesn't
have to provide buffers in such cases, and thus that it is possible to
provide memory-efficient readers.
Two predefined readers are available, namely @code{String_Input} to read
characters from a standard Ada string, and @code{File_Input} to read
characters from a standard text file.
They all provide the following primite operations:
@table @code
@item Open
Although this operation isn't exactly overriden, since its parameters
depend on the type of stream you want to read from, it is nice to
use a standard name for this constructor.
@item Close
This terminates the stream reader and free any associated memory. It
is no longer possible to read from the stream afterwards.
@item Next_Char
Return the next Unicode character in the stream. Note this character
doesn't have to be associated specifically with a single byte, but that
it depends on the encoding chosen for the stream (see the unicode module
documentation for more information).
The next time this function is called, it returns the following character
from the stream.
@item Eof
This function should return True when the reader has already returned the
last character from the stream. Note that it is not guarantee that a second
call to Eof will also return True.
@end table
It is the responsability of this stream reader to correctly call the
decoding functions in the unicode module so as to return one single
valid unicode character. No further processing is done on the result
of @code{Next_Char}. Note that the standard @code{File_Input} and
@code{String_Input} streams can automatically detect the encoding to
use for a file, based on a header read directly from the file.
Based on the first four bytes of the stream (assuming this is valid
XML), they will automatically detect whether the file was encoded as
Utf8, Utf16,... If you are writing your own input streams, consider
adding this automatic detection as well.
However, it is always possible to override the default through a call to
@code{Set_Encoding}. This allows you to specify both the character set
(Latin1, ...) and the character encoding scheme (Utf8,...).
The user is also encouraged to set the identifiers for the stream they
are parsing, through called to @code{Set_System_Id} and
@code{Set_Public_Id}. These are used when reporting error messages.
@c -------------------------------------------------------------------
@node The SAX module
@chapter The SAX module
@c -------------------------------------------------------------------
@menu
* SAX Description::
* SAX Examples::
* SAX Parser::
* SAX Handlers::
@end menu
@c -------------------------------------------------------------------
@node SAX Description
@section Description
@c -------------------------------------------------------------------
@noindent
Parsing XML streams can be done with two different methods. They each
have their pros and cons. Although the simplest, and probably most usual
way to manipulate XML files is to represent them in a tree and manipulate
it through the DOM interface (see next chapter).
The @b{Simple API for XML} is an other method that can be used for parsing.
It is based on a callbacks mechanism, and doesn't store any data in memory
(unless of course you choose to do so in your callbacks). It can thus be
more efficient to use SAX than DOM for some specialized algorithms.
In fact, this whole Ada XML library is based on such a SAX parser, then
creates the DOM tree through callbacks.
Note that this module supports the second release of SAX (SAX2), that fully
supports namespaces as defined in the XML standard.
SAX can also be used in cases where a tree would not be the most efficient
representation for you data. There is no point in building a tree with DOM,
then extracting the data and freeing the tree occupied by the tree. It is
much more efficient to directly store your data through SAX callbacks.
With SAX, you register a number of callback routines that the parser will
call them when certain conditions occur.
This documentation is in no way a full documentation on SAX. Instead,
you should refer to the standard itself, available at
@url{http://www.megginson.com/SAX/}.
Some of the more useful callbacks are @code{Start_Document},
@code{End_Document}, @code{Start_Element}, @code{End_Element},
@code{Get_Entity} and @code{Characters}. Most of these are
quite self explanatory. The characters callback is called when
characters outside a tag are parsed.
Consider the following XML file:
@smallexample
<?xml version="1.0"?>
<body>
<h1>Title</h1>
</body>
@end smallexample
The following events would then be generated when this file is parsed:
@smallexample
Start_Document Start parsing the file
Start_Prefix_Mapping (handling of namespaces for "xml")
Start_Prefix_Mapping Parameter is "xmlns"
Processing_Instruction Parameters are "xml" and "version="1.0""
Start_Element Parameter is "body"
Characters Parameter is ASCII.LF & " "
Start_Element Parameter is "h1"
Characters Parameter is "Title"
End_Element Parameter is "h1"
Characters Parameter is ASCII.LF & " "
End_Element Parameter is "body"
End_Prefix_Mapping Parameter is "xmlns"
End_Prefix_Mapping Parameter is "xml"
End_Document End of parsing
@end smallexample
As you can see, there is a number of events even for a very small file.
However, you can easily choose to ignore the events you don't care
about, for instance the ones related to namespace handling.
@c -------------------------------------------------------------------
@node SAX Examples
@section Examples
@c -------------------------------------------------------------------
@noindent
There are several cases where using a SAX parser rather than a DOM
parser would make sense. Here are some examples, although obvisouly
this doesn't include all the possible cases. These examples are taken
from the documentation of libxml, a GPL C toolkit for manipulating XML files.
@itemize @bullet
@item Using XML files as a database
One of the common usage for XML files is to use them as a kind of
basic database, They obviously provide a strongly structured format,
and you could for instance store a series of numbers with the following
format.
@smallexample
<array> <value>1</value> <value>2</value> ....</array>
@end smallexample
In this case, rather than reading this file into a tree, it would obviously
be easier to manipulate it through a SAX parser, that would directly create
a standard Ada array while reading the values.
This can be extended to much more complex cases that would map to Ada
records for instance.
@item Large repetitive XML files
Sometimes we have XML files with many subtrees of the same format
describing different things. An example of this is an index file for a
documentation similar to this one. This contains a lot (maybe thousands)
of similar entries, each containing for instance the name of the symbol
and a list of locations.
If the user is looking for a specific entry, there is no point in loading
the whole file in memory and then traverse the resulting tree. The memory
usage increases very fast with the size of the file, and this might even
be unfeasible for a 35 megabytes file.
@item Simple XML files
Even for simple XML files, it might make sense to use a SAX parser. For
instance, if there are some known constraints in the input file, say
there are no attributes for elements, you can save quite a lot of memory,
and maybe time, by rebuilding your own tree rather than using the full
DOM tree.
@end itemize
However, there are also a number of drawbacks to using SAX:
@itemize @bullet
@item SAX parsers generally require you to write a little bit more code than
the DOM interface
@item There is no easy way to write the XML data back to a file, unless you
build your own internal tree to save the XML.
As a result, SAX is probably not the best interface if you want to load,
modify and dump back an XML file.
Note however than in this Ada implementation, the DOM tree is built through
a set of SAX callbacks anyway, so you do not lose any power or speed by using
SAX.
@end itemize
@c -------------------------------------------------------------------
@node SAX Parser
@section The SAX parser
@c -------------------------------------------------------------------
@noindent
The basic type in the SAX module is the @b{SAX.Readers} package. It
defines a tagged type, called @code{Reader}, that represents the SAX
parser itself.
Several features are define in the SAX standard for the parsers. They
indicate which behavior can be expected from the parser. The package
@code{SAX.Readers} defines a number of constant strings for each of
these features. Some of these features are read-only, whereas others can
be modified by the user to adapt the parser. See the @code{Set_Feature}
and @code{Get_Feature} subprograms for how to manipulate them.
The main primitive operation for the parser is @code{Parse}. It takes
an input stream for argument, associated with some XML data, and then
parses it and calls the appropriate callbacks. It returns once there are
no more characters left in the stream.
Several other primitive subprograms are defined for the parser, that are
called the @b{callbacks}. They get called automatically by the @code{Parse}
procedure when some events are seen.
As a result, you should always override at least some of these subprogram
to get something done. The default implementation for these is to do nothing,
exception for the error handler that raises Ada exceptions appropriately.
An example of such an implementation of a SAX parser is available in the
DOM module, and it creates a tree in memory. As you will see if you look at
the code, the callbacks are actually very short.
Note that internally, all the strings are encoded with a unique character
encoding scheme, that is defined in the file @file{sax-encodings.ads}. The input
stream is converted on the fly to this internal encoding, and all the
subprograms from then on will receive and pass parameters with this new
encoding. You can of course freely change the encoding defined in the file
@file{sax-encodings.ads}.
The encoding used for the input stream is either automatically
detected by the stream itself (@pxref{The Input module}), or by parsing the
@smallexample
<?xml version='1.0' encoding='UTF-8' ?>
@end smallexample
processing instruction at the beginning of the document. The list of
supported encodings is the same as for the Unicode module (@pxref{The
Unicode module}).
@c -------------------------------------------------------------------
@node SAX Handlers
@section The SAX handlers
@c -------------------------------------------------------------------
@noindent
We do not intend to document the whole set of possible callbacks associated
with a SAX parser. These are all fully documented in the standard itself, and
there is little point in duplicating this information.
However, here is a list of the most frequently used callbacks, that you will
probably need to override in most of your applications.
@table @code
@item Start_Document
This callback, that doesn't receive any parameter, is called once, just before
parsing the document. It should generally be used to initialize internal
data needed later on. It is also garanteed to be called only once per input
stream.
@item End_Document
This one is the reverse of the previous one, and will also be called only
once per input stream. It should be used to release the memory you have
allocated in Start_Document.
@item Start_Element
This callback is called every time the parser encounters the start of an
element in the XML file. It is passed the name of the element, as well as
the relevant namespace information. The attributes defined in this element
are also passed as a list. Thus, you get all the required information for
this element in a single function call.
@item End_Element
This is the opposite of the previous callback, and will be called once per
element. Calls to @code{Start_Element} and @code{End_Element} are garanteed
to be properly nested (ie you can't see the end of an element before seeing
the end of all its nested children.
@item Characters and Ignore_Whitespace
This procedure will be called every time some character not part of an
element declaration are encounted. The characters themselves are passed as
an argument to the callback. Note that the white spaces (and tabulations)
are reported separately in the Ignorable_Spaces callback in case the
XML attribute @code{xml:space} was set to something else than @code{preserve}
for this element.
@end table
You should compile and run the @file{testsax} executable found in this
module to visualize the SAX events that are generated for a given XML file.
@c -------------------------------------------------------------------
@node The DOM module
@chapter The DOM module
@c -------------------------------------------------------------------
@noindent
A default SAX implementation is provided in the tree_readers file, through
its Parse function. This reads an XML stream and creates a tree in memory.
The tree can then be manipulated through the DOM module.
Note that the encodings.ads file specifies the encoding to use to store
the tree in memory. Full compatibility with the XML standard requires that
this be UTF16, however, it is generally much more memory-efficient for European
languages to use UTF8. You can freely change this and recompile.
What is the Document Object Model?
The Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the
content, structure and style of documents. The document can be further processed and the results of that processing can be incorporated back into the
presented page. This is an overview of DOM-related materials here at W3C and around the web.
Why the Document Object Model?
"Dynamic HTML" is a term used by some vendors to describe the combination of HTML, style sheets and scripts that allows documents to be
animated. The W3C has received several submissions from members companies on the way in which the object model of HTML documents should be
exposed to scripts. These submissions do not propose any new HTML tags or style sheet technology. The W3C DOM WG is working hard to make
sure interoperable and scripting-language neutral solutions are agreed upon.
The DOM (Document Object Model) is a set of subprograms to create and
manipulate XML trees in memory.
You can create such a tree through the tree_readers.Parse function.
Only the Core module of the DOM standard is currently implemented, other
modules will follow.
@c -------------------------------------------------------------------
@node Using the library
@chapter Using the library
@c -------------------------------------------------------------------
@noindent
XML/Ada is a library. When compiling an application that uses it, you
thus need to specify where the specifications are to be found, as well
as where the libraries are installed.
There are several ways to do it:
@itemize @bullet
@item The simplest is to use the @command{xmlada-config} script, and let it
provide the list of switches for @command{gnatmake}. This is more
convenient on Unix systems, where you can simply compile your application
with
@smallexample
gnatmake main.adb `xmlada-config`
@end smallexample
Note the use of backticks. This means that @command{xmlada-config} is
first executed, and then the command line is replaced with the output of
the script, thus finally executing something like:
@smallexample
gnatmake main.adb -Iprefix/include/xmlada -largs -Lprefix/lib \
-lxmlada_input_sources -lxmlada_sax -lxmlada_unicode -lxmlada_dom
@end smallexample
Unfortunately, this behavior is not available on Windows (unless of course
you use a Unix shell). The simplest in that case is to create a
@file{Makefile}, to be used with the @command{make} command, and copy-paste
the output of @command{xmlada-config} into it.
@command{xmlada-config} has several switches that might be useful:
@enumerate
@item @option{--sax}: If you this flag, your application will not be
linked against the DOM module. This might save some space, particularly
if linking statically. This also reduces the dependencies on external
tools.
@item @option{--static}: Return the list of flags to use to link your
application statically against Xml/Ada. Your application is then
standalone, and you don't need to distribute XMl/Ada at the same time.
@item @option{--static_sax}: Combines both of the above flags.
@end enumerate
@item On Windows system, you might also simply want to register once and for
all the library in the Windows registry, with the command @command{gnatreg}.
This means that @command{GNAT} will automatically find the installation
directory for the XML/Ada.
@item If you are working on a big project, particularly one that includes
sources in languages other than Ada, you generally have to run the three
steps of the compilation process separately (compile, bind and then link).
@command{xmlada-config} can also be used, provided you use one of the
following switches:
@enumerate
@item @option{--cflags}: This returns the compiler flags only, to be used
for instance with @command{gcc}.
@item @option{--libs}: This returns the linker flags only, to be used for
instance with @command{gnatlink}.
@end enumerate
@end itemize
@contents
@bye
|