1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852
|
\documentclass{howto}
\newcommand{\element}[1]{\code{#1}}
\newcommand{\attribute}[1]{\code{#1}}
\title{Python/XML HOWTO}
\release{0.05}
\author{The Python/XML Special Interest Group}
\authoraddress{\email{xml-sig@python.org}\break (edited by \email{akuchling@acm.org})}
\begin{document}
\maketitle
\begin{abstract}
\noindent
XML is the eXtensible Markup Language, a subset of SGML, intended to
allow the creation and processing of application-specific markup
languages. Python makes an excellent language for processing XML
data. This document is a tutorial for the Python/XML package. It
assumes you're already familiar with the structure and terminology of
XML.
This is a draft document; 'XXX' in the text indicates that something
has to be filled in later, or rewritten, or verified, or something.
\end{abstract}
\tableofcontents
\section{Introduction to XML}
XML, the eXtensible Markup Language, is a simplified dialect of SGML,
the Standardized General Markup Language. XML is intended to be
reasonably simple to implement and use, and is already being used for
specifying markup languages for various new standards: MathML for
expressing mathematical equations, XXX SMIL (Expand acronym) for
synchronizing multimedia objects, and so forth.
SGML and XML represent a document by tagging the document's various
components with their function, or meaning. For example, an academic
paper contains several parts: it has a title, one or more authors, an
abstract, the actual text of the paper, a list of references, and so
forth. A markup languge for writing such papers would therefore have
tags for indicating what the contents of the abstract are, what the
title is, and so forth. This should not be confused with the physical
details of how the document is actually printed on paper. The
abstract might be printed with narrow margins in a smaller font than
the rest of the document, but the markup usually won't be concerned
with details such as this; other software will translate from the
markup language to a typesetting language such as \TeX, and will
handle the details.
A markup language specified using XML looks a lot like HTML; a
document consists of a single \dfn{element}, which contains
sub-elements, which can have further sub-elements inside them.
Elements are indicated by \dfn{tags} in the text. Tags are always
inside angle brackets \code{<}~\code{>}. There are two forms of
elements. An element can contain content between opening and closing
tags, as in \code{<name>Euryale</name>}, which is a \element{name}
element containing the data \samp{Euryale}. This content may be text
data, other XML elements, or a mixture of both. Elements can also be
empty, in which case they contain nothing, and are represented as a
single tag ended with a slash, as in \code{<stop/>}, which is an empty
\element{stop} element. Unlike HTML, XML element names are
case-sensitive; \element{stop} and \element{Stop} are two different
element types.
Opening and empty tags can also contain attributes, which specify
values associated with an element. For example, text such as
\code{<name lang='greek'>Herakles</name>}, the \element{name} element
has a \attribute{lang} attribute which has a value of \samp{greek}.
This would contrast with \code{<name lang='latin'>Hercules</name>},
where the attribute's value is \samp{latin}.
A given XML language is specified with a Document Type Definition, or
\dfn{DTD}. The DTD declares the element names that are allowed, and
how elements can be nested inside each other. The DTD also specifies
the attributes that can be provided for each element, their default
values, and if they can be omitted. For example, to take an example
from HTML, the \element{LI} element, representing an entry in a list,
can only occur inside certain elements which represent lists, such as
\element{OL} or \element{UL}. A \dfn{validating parser} can be given
a DTD and a document, and verify whether a given document is legal
according to the DTD's rules, or determine that one or more rules have
been violated.
Applications that process XML can be classed into two types. The
simplest class is an application that only handles one particular
markup language. For example, a chemistry program may only need to
process Chemical Markup Language, but not MathML. This
application can therefore be written specifically for a single DTD,
and doesn't need to be capable of handling multiple markup
languages. This type is simpler to write, and can easily be
implemented with the available Python software.
The second type of application is less common, and has to be able to
handle any markup language you throw at it. An example might be a
smart XML editor that helps you to write XML that conforms to a
selected DTD; it might do so by not letting you enter an element where
it would be illegal, or by suggesting elements that can be placed at
the current cursor location. Such an application needs to handle any
possible XML-defined markup, and therefore must be able to obtain a
data structure embodying the DTD in use. XXX This type of application
can't currently be implemented in Python without difficulty (XXX but
wait and see if a DTD module is included...)
For the full details of XML's syntax, the one definitive source is the
XML 1.0 specification, available on the Web at
\url{http://www.w3.org/TR/xml-spec.html}. However, like all
specifications, it's quite formal and isn't intended to be a friendly
introduction or a tutorial. The annotated version of the standard, at
\url{http://www.xml.com/XXX}, is quite helpful in clarifying the
specification's intent. There are also various informal tutorials and
books available to introduce you to XML.
The rest of this HOWTO will assume that you're familiar with the
relevant terminology. Most section will use XML terms such as
\emph{element} and \emph{attribute}; section~\ref{DOM} on the Document
Object Model will assume that you've read the relevant Working Draft,
and are familiar with things like Iterators and Nodes.
Section~\ref{SAX} does not require that you have experience with the
Java SAX implentations.
\subsection{Related Links}
\section{Installing the XML Toolkit}
Windows users should get the precompiled version at \url{XXX}; Mac
users will use the corresponding precompiled version at \url{XXX}.
Linux users may wish to use either the Debian package from \url{XXX},
or the RPM from \url{XXX}. To compile from source on a \UNIX{} platform,
simply perform the following steps.
\begin{enumerate}
\item Get a copy of the source distribution from \url{http://www.python.org/topics/xml/download.html}. Unpack it with the following command.
\begin{verbatim}
gzip -dc xml-package.tgz | tar -xvf -
\end{verbatim}
\item
Run:
\begin{verbatim}
make -f Makefile.pre.in boot
\end{verbatim}
This creates the \file{Makefile}
and
\file{config.c} (producing various other intermediate files in the process), incorporating the values for \code{sys.prefix}, \code{sys.exec_prefix}
and \code{sys.version} from the installed Python binary. For this to work,
the Python interpreter must be on your path. If this fails, try
\begin{verbatim}
make -f Makefile.pre.in Makefile VERSION=1.5 installdir=<prefix>
\end{verbatim}
where \samp{<prefix>} is the value of \samp{installdir} used when
installing Python. You may possibly have to also set
\samp{exec_installdir} to the value of \samp{exec_prefix}.
\item
Once the Makefile has been constructed, just run \samp{make} to
compile the C modules. There's no test suite yet, but there will be
one someday.
\item
To install the code, run \samp{make install}.
The code will be installed under the \file{site-packages/} directory
as a package named \file{xml/}.
\end{enumerate}
If you have difficulty installing this software, send a problem report
to <xml-sig@python.org> describing the problem.
There are various demonstration programs in the \file{demo/} directory
of the source distribution. You may wish to look at them next to get
an impression of what's possible with the XML tools, and as a source
of example code.
% package layout
\subsection{Related Links}
\begin{definitions}
\term{\url{http://www.python.org/topics/xml/}}
%
This is the starting point for Python-related XML topics; it is
updated to refer to all software, mailing lists, documentation, etc.
\end{definitions}
\section{SAX: The Simple API for XML}
\label{SAX}
The Simple API for XML isn't a standard in the formal sense, but an
informal specification designed by David Megginson, with input from
many people on the xml-dev mailing list. SAX defines an event-driven
interface for parsing XML. To use SAX, you must create Python class
instances which implement a specified interface, and the parser will
then call various methods of those objects.
SAX is most suitable for purposes where you want to read through an
entire XML document from beginning to end, and perform some
computation, such as building a data structure representating a
document, or summarizing information in a document (computing an
average value of a certain element, for example). It's not very
useful if you want to modify the document structure in some
complicated way that involves changing how elements are nested, though
it could be used if you simply wish to change element contents or
attributes. For example, you would not want to re-order chapters in a
book using SAX, but you might want to change the contents of any
\element{name} elements with the attribute \attribute{lang} equal to
'greek' into Greek letters.
One advantage of SAX is speed and simplicity. Let's say
you've defined a complicated DTD for listing comic books, and you wish
to scan through your collection and list everything written by Neil
Gaiman. For this specialized task, there's no need to expend effort
examining elements for artists and editors and colourists, because
they're irrelevant to the search. You can therefore write a class
instance which ignores all elements that aren't \element{writer}.
Another advantage is that you don't have the whole document resident
in memory at any one time, which matters if you are processing really
huge documents.
SAX defines 4 basic interfaces; an SAX-compliant XML parser can be
passed any objects that support these interfaces, and will call
various methods as data is processed. Your task, therefore, is to
implement those interfaces that are relevant to your application.
The SAX interfaces are:
\begin{tableii}{c|p{4in}}{code}{Interface}{Purpose}
\lineii{DocumentHandler}{Called for general document events. This
interface is the heart of SAX; its methods are called for the start of
the document, the start and end of elements, and for the characters of
data contained inside elements.
}
\lineii{DTDHandler}{Called to handle DTD events required for basic
parsing. This means notation declarations (XML spec section XXX) and
unparsed entity declarations (XML spec section XXX).
}
\lineii{EntityResolver}{Called to resolve references to external
entities. If your documents will have no external entity references,
you won't need to implement this interface. }
\lineii{ErrorHandler}{Called for error handling. The parser will call
methods from this interface to report all warnings and errors.}
\end{tableii}
Python doesn't support the concept of interfaces, so the interfaces
listed above are implemented as Python classes. The default method
implementations are defined to do nothing---the method body is just a
Python \code{pass} statement--so usually you can simply ignore methods
that aren't relevant to your application. The one big exception is
the \class{ErrorHandler} interface; if you don't provide methods that
print a message or otherwise take some action, errors in the XML data
will be silently ignored. This is almost certainly \emph{not} what
you want your application to do, so always implement at least the
\method{error()} and \method{fatalError()} methods.
\module{xml.sax.saxutils} provides an \class{ErrorPrinter} class which
sends error messages to standard error, and an \class{ErrorRaiser}
class which raises an exception for any warnings or errors.
Pseudo-code for using SAX looks something like this:
\begin{verbatim}
# Define your specialized handler classes
from xml.sax import saxlib
class docHandler(saxlib.DocumentHandler):
...
# Create an instance of the handler classes
dh = docHandler()
# Create an XML parser
parser = ...
# Tell the parser to use your handler instance
parser.setDocumentHandler(dh)
# Parse the file; your handler's method will get called
parser.parseFile(sys.stdin)
\end{verbatim}
\subsection{Starting Out}
Following the earlier example, let's consider a simple XML format for
storing information about a comic book collection. Here's a sample
document for a collection consisting of a single issue:
\begin{verbatim}
<collection>
<comic title="Sandman" number='62'>
<writer>Neil Gaiman</writer>
<penciller pages='1-9,18-24'>Glyn Dillon</penciller>
<penciller pages="10-17">Charles Vess</penciller>
</comic>
</collection>
\end{verbatim}
An XML document must have a single root element; this is the
\samp{collection} element. It has one child \element{comic} element
for each issue; the book's title and number are given as attributes of
the \element{comic} element, which can have one or more children
containing the issue's writer and artists. There may be several
artists or writers for a single issue.
Let's start off with something simple: a document handler named
\class{FindIssue} that reports whether a given issue is in the
collection.
\begin{verbatim}
from xml.sax import saxlib
class FindIssue(saxlib.HandlerBase):
def __init__(self, title, number):
self.search_title, self.search_number = title, number
\end{verbatim}
The \class{HandlerBase} class inherits from all four interfaces:
\class{DocumentHandler}, \class{DTDHandler}, \class{EntityResolver},
and \class{ErrorHandler}. This is what you should use if you
want to use one class for everything. When you want separate classes
for each purpose, you can just subclass each interface individually.
Neither of the two approaches is always ``better'' than the other;
their suitability depends on what you're trying to do, and on what you
prefer.
Since this class is doing a search, an instance needs to know what to
search for. The desired title and issue number are passed to the
\class{FindIssue} constructor, and stored as part of the instance.
Now let's look at the function which actually does all the work.
This simple task only requires looking at the attributes of a given
element, so only the \method{startElement} method is relevant.
\begin{verbatim}
def startElement(self, name, attrs):
# If it's not a comic element, ignore it
if name != 'comic': return
# Look for the title and number attributes (see text)
title = attrs.get('title', None)
number = attrs.get('number', None)
if title == self.search_title and number == self.search_number:
print title, '#'+str(number), 'found'
\end{verbatim}
The \method{startElement()} method is passed a string giving the name
of the element, and an instance containing the element's attributes.
The latter implements the \class{AttributeList} interface, which
includes most of the semantics of Python dictionaries. Therefore, the
function looks for \element{comic} elements, and compares the
specified \attribute{title} and \attribute{number} attributes to the
search values. If they match, a message is printed out.
\method{startElement()} is called for every single element in the
document. If you added \code{print 'Starting element:', name} to the
top of \method{startElement()}, you would get the following output.
\begin{verbatim}
Starting element: collection
Starting element: comic
Starting element: writer
Starting element: penciller
Starting element: penciller
\end{verbatim}
To actually use the class, we need top-level code that creates
instances of a parser and of \class{FindIssue}, associates them, and
then calls a parser method to process the input.
\begin{verbatim}
from xml.sax import saxexts
if __name__ == '__main__':
# Create a parser
parser = saxexts.make_parser()
# Create the handler
dh = FindIssue('Sandman', '62')
# Tell the parser to use our handler
parser.setDocumentHandler(dh)
# Parse the input
parser.parseFile(file)
\end{verbatim}
The \class{ParserFactory} class can automate the job of creating
parsers. There are already several XML parsers available to Python,
and more might be added in future. \file{xmllib.py} is included with
Python 1.5, so it's always available, but it's also not particularly
fast. A faster version of \file{xmllib.py} is included in
\module{xml.parsers}. The \module{pyexpat} module is faster still, so
it's obviously a preferred choice if it's available.
\class{ParserFactory}'s \method{make_parser} method determines
which parsers are available and chooses the fastest one, so you don't
have to know what the different parsers are, or how they differ. (You
can also tell \method{make_parser} to use a given parser, if you want
to use a specific one.)
Once you've created a parser instance, calling
\method{setDocumentHandler} tells the parser what to use as the handler.
If you run the above code with the sample XML document, it'll output
\code{Sandman \#62 found.}
\subsection{Error Handling}
Now, try running the above code with this file as input:
\begin{verbatim}
<collection>
&foo;
<comic title="Sandman" number='62'>
</collection>
\end{verbatim}
The \code{\&foo;} entity is unknown, and the \element{comic} element
isn't closed (if it was empty, there would be a \samp{/} before the
closing \samp{>}. Why did the file get processed without complaint?
Because the default code for the \class{ErrorHandler} interface does
nothing, and no different implementation was provided, so the errors
are silently ignored.
The \class{ErrorRaiser} class automatically raises an exception for
any error; you'll usually set an instance of this class as the error
handler. Otherwise, you should provide your own version of the
\class{ErrorHandler} interface, and at minimum override the
\method{error()} and \method{fatalError()} methods. The minimal
implementation for each method can be a single line. The methods in
the \class{ErrorHandler} interface--\method{warning}, \method{error},
and \method{fatalError}--are all passed a single argument, an
exception instance. The exception will always be a subclass of
\exception{SAXException}, and calling \code{str()} on it will produce
a readable error message explaining the problem.
So, to re-implement a variant of \class{ErrorRaiser}, simply define
two of the three methods to raise the exception they're passed:
\begin{verbatim}
def error(self, exception):
raise exception
def fatalError(self, exception):
raise exception
\end{verbatim}
\method{warning()} might simply print the exception to \code{sys.stderr}
and return without raising the exception. Now the same incorrect XML
file will cause a traceback to be printed, with the error message
``xml.sax.saxlib.SAXException: reference to unknown entity''.
\subsection{Searching Element Content}
Let's tackle a slightly more complicated task, printing out all issues
written by a certain author. This now requires looking at element
content, because the writer's name is inside a \element{writer}
element: \code{<writer>Peter Milligan</writer>}.
The search will be performed using the following algorithm:
\begin{enumerate}
\item
The \method{startElement} method will be more complicated. For
\element{comic} elements, the handler has to save the title and
number, in case this comic is later found to match the search
criterion. For \element{writer} elements, it sets a
\code{inWriterContent} flag to true, and sets a \code{writerName}
attribute to the empty string.
\item Characters outside of XML tags must be processed. When
\code{inWriterContent} is true, these characters must be added to the
\code{writerName} string.
\item When the \element{writer} element is finished, we've now
collected all of the element's content in the \code{writerName}
attribute, so we can check if the name matches the one we're searching
for, and if so, print the information about this comic. We must also
set \code{inWriterContent} back to false.
\end{enumerate}
Here's the first part of the code; this implements step 1.
\begin{verbatim}
from xml.sax import saxlib
import string
def normalize_whitespace(text):
"Remove redundant whitespace from a string"
return string.join( string.split(text), ' ')
class FindWriter(saxlib.HandlerBase):
def __init__(self, search_name):
# Save the name we're looking for
self.search_name = normalize_whitespace( search_name )
# Initialize the flag to false
self.inWriterContent = 0
def startElement(self, name, attrs):
# If it's a comic element, save the title and issue
if name == 'comic':
title = normalize_whitespace( attrs.get('title', "") )
number = normalize_whitespace( attrs.get('number', "") )
self.this_title = title
self.this_number = number
# If it's the start of a writer element, set flag
elif name == 'writer':
self.inWriterContent = 1
self.writerName = ""
\end{verbatim}
The \method{startElement()} method has been discussed previously. Now
we have to look at how the content of elements is processed.
The \function{normalize_whitespace()} function is important, and
you'll probably use it in your own code. XML treats whitespace very
flexibly; you can include extra spaces or newlines wherever you like.
This means that you must normalize the whitespace before comparing
attribute values or element content; otherwise the comparision might
produce a wrong result due to the content of two elements having
different amounts of whitespace.
\begin{verbatim}
def characters(self, ch, start, length):
if self.inWriterContent:
self.writerName = self.writerName + ch[start:start+length]
\end{verbatim}
The \method{characters()} method is called for characters that aren't
inside XML tags. \var{ch} is a string of characters, and \var{start}
is the point in the string where the characters
start. \var{length} is the length of the character data. You should
not assume that \var{start} is equal to 0, or that all of \var{ch} is
the character data. An XML parser could be implemented to read the
entire document into memory as a string, and then operate by indexing
into the string. This would mean that \var{ch} would always contain
the entire document, and only the values of \var{start} and
\var{length} would be changed.
You also shouldn't assume that all the characters are passed in a
single function call. In the example above, there might be only one
call to \method{characters()} for the string \samp{Peter Milligan}, or
it might call \method{characters()} once for each character. More
realistically, if the content contains an entity reference, as in
\samp{Wagner
\& Seagle}, the parser might call the method three times; once for
\samp{Wagner\ }, once for \samp{\&}, represented by the entity
reference, and again for \samp{\ Seagle}.
For step 2 of \class{FindWriter}, \method{characters()} only has to
check \code{inWriterContent}, and if it's true, add the characters to
the string being built up.
Finally, when the \element{writer} element ends, the entire name has
been collected, so we can compare it to the name we're searching for.
\begin{verbatim}
def endElement(self, name):
if name == 'writer':
self.inWriterContent = 0
self.writerName = normalize_whitespace(self.writerName)
if self.search_name == self.writerName:
print 'Found:', self.this_title, self.this_number
\end{verbatim}
This is an unrealistically stupid comparison function that will be
fooled by differing whitespace, but it's good enough for an example.
End tags can't have attributes on them, so there's no \var{attrs}
parameter. Empty elements with attributes, such as \samp{<arc
name="Season of Mists"/>}, will result in a call to
\method{startElement()}, followed immediately by a call to \method{endElement()}.
XXX how are external entities handled? Anything special need to be
done for them?
\subsection{Related Links}
\begin{definitions}
\term{\url{http://www.megginson.com/SAX/}}
%
The SAX home page. This has the most recent copy of the
specification, and lists SAX implementations for various languages and
platforms. At the moment it's somewhat Java-centric.
\end{definitions}
\section{DOM: The Document Object Model}
\label{DOM}
\emph{The Document Object Model is currently at first draft stage, and
isn't even close to being a standard. The Python DOM is therefore not
yet documented. If you want to look at the code and use it anyway,
feel free (report any bugs you find), but be aware that your code may
need to be changed for future DOM drafts.}
The Document Object Model specifies a tree-based representation for an
XML document. A top-level Document instance is the root of the tree,
and has a single child which is the top-level Element instance; this
instance has children nodes representing the content and any
sub-elements. These sub-element nodes can have further children, and
so forth. Functions are defined which let you traverse the resulting
tree any way you like, access element and attribute values, insert and
delete nodes, and convert the tree back into XML.
The DOM is useful for modifying the tree; you can remove a node from
one place in the tree, and insert it somewhere else. You can also
construct a DOM tree yourself, and convert it to XML; this is often a
more flexible way of producing XML output than simply writing
\code{<tag1>}...\code{</tag1>} to a file.
While the DOM doesn't require that the entire tree be resident in
memory at one time, the Python implementation currently keeps the
whole tree in RAM. This means you may not have enough memory to
process very large documents, measuring tens or hundreds of megabytes.
It's possible to write a DOM implementation that stores most of the
tree on disk or in a database, and reads in new sections as they're
accessed, but this hasn't been done yet, and such implementations
often impose limitations on how the tree can be accessed.
%Explanations, sample code, ...
\subsection{Related Links}
\begin{definitions}
\term{\url{http://www.w3.org/DOM/}}
%
The World Wide Web Consortium's DOM page.
\end{definitions}
\section{xmlarch: Architectural Forms}
The xmlarch module contains an XML architectural forms processor
written in Python. It allows you to process XML architectural forms
using any parser that uses the SAX interfaces. The module allows
you to process several architectures in one parsing
pass. Architectural document events for an architecture can even be
broadcasted to multiple DocumentHandlers. (e.g. you can have 2
handlers for the RDF architecture, 3 for the XLink architecture and
perhaps one for the HyTime architecture.)
The architecture processor uses the SAX \class{DocumentHandler} interface
which means that you can register the architecture handler
(ArchDocHandler) with any SAX 1.0 compliant parser.
It currently does not process any meta document type definition
documents (meta-DTDs). When a DTD parser module is available the code
will be modified to use that in order to process meta-DTD information.
Please note that validating and well-formed parsers may report
different SAX events when parsing documents.
The \module{xmlarch} module contains six classes:
\class{ArchDocHandler}, \class{Architecture}, \class{ArchParseState},
\class{ArchException}, \class{AttributeParser} and \class{Normalizer}.
\begin{itemize}
\item \class{ArchDocHandler} is a subclass of the \class{saxlib.DocumentHandler}
interface. This is the class used for processing an architectural
document.
\item \class{Architecture} contains information about an architecture.
\item \class{ArchParseState} holds information about an architecture's parse
state when parsing a document.
\item \class{AttributeParser} parses architecture use declaration PIs (attribute
strings).
\item \class{ArchException} holds information about an architectural exception
thrown by an \class{ArchDocHandler} instance.
\item \class{Normalizer} is a document handler that outputs "normalized" XML.
\end{itemize}
Using the xmlarch module usually means that you have to do the
following things:
\begin{itemize}
\item Import the required SAX modules; saxexts, saxlib, saxutils.
\item Import the xmlarch module.
\item Create a SAX compliant parser object.
\item Create an XML architectures processor handler.
\item Register this handler with the parser.
\item Add document handlers for the architectures you want to process.
\item Register a default document handler with the architecture
processor handler.
\item Parse a document.
\end{itemize}
A simple example
Python code:
\begin{verbatim}
# Import needed modules
from xml.sax import saxexts, saxlib, saxutils
import sys, xmlarch
# Create architecture processor handler
arch_handler = xmlarch.ArchDocHandler()
# Create parser and register architecture processor with it
parser = saxexts.XMLParserFactory.make_parser()
parser.setDocumentHandler(arch_handler)
# Add an document handler to process the html architecture
arch_handler.addArchDocumentHandler("html", xmlarch.Normalizer(sys.stdout))
# Parse (and process) the document
parser.parse("simple.xml")
\end{verbatim}
A sample XML document:
\begin{verbatim}
<?xml version="1.0"?>
<?IS10744:arch name="html"?>
<doc>
<title html="h1">My first architectual document</title>
<author html="address">Geir Ove Gronmo, grove@infotek.no</author>
<para>This is the first paragraph in this document</para>
<para html="p">This is the second paragraph</para>
</doc>
\end{verbatim}
The result:
\begin{verbatim}
<html>
<h1>My first architectual document</h1>
<address>Geir Ove Gronmo, grove@infotek.no</address>
<p>This is the second paragraph</p>
</html>
\end{verbatim}
See also the files \file{simple.py} and \file{simple.xml} in the
\file{demo/arch} directory of the Python/XML distribution.
If you try to process the persons architecture in this document
instead you get the following output:
\begin{verbatim}
<persons>
<author>Geir Ove Grnmo</author><mentioned>Eliot Kimber</mentioned><mentioned>D
avid Megginson</mentioned><mentioned>Lars Marius Garshol</mentioned>
</persons>
\end{verbatim}
A more complex example:
Python code:
\begin{verbatim}
# Import needed modules
from xml.sax import saxexts, saxlib, saxutils
import sys, xmlarch
# create architecture processor handler
arch_handler = xmlarch.ArchDocHandler()
# Create parser and register architecture processor with it
parser = saxexts.XMLParserFactory.make_parser()
parser.setDocumentHandler(arch_handler)
# Add an document handlers to process the html and biblio architectures
arch_handler.addArchDocumentHandler("html", xmlarch.Normalizer(open("html.out",
"w")))
arch_handler.addArchDocumentHandler("biblio", saxutils.ESISDocHandler(open("bib
lio1.out", "w")))
arch_handler.addArchDocumentHandler("biblio", saxutils.Canonizer(open("biblio2.
out", "w")))
# Register a default document handler that just passes through any incoming eve
nts
arch_handler.setDefaultDocumentHandler(xmlarch.Normalizer(sys.stdout))
# Parse (and process) the document
parser.parse("complex.xml")
\end{verbatim}
Because this causes a lot of output I've not included the XML document
and the results. See instead the files \file{complex.py} and
\file{complex.xml} in the \file{demo/xml} directory of the Python/XML
distribution and try it yourself.
\subsection{Related Links}
\section{Glossary}
XML has given rise to a sea of acronyms and terms. This section will
list the most significant terms, and sketch their relevance.
Many of the following definitions are taken from Lars Marius Garshol's
SGML glossary, at \url{http://www.stud.ifi.uio.no/\~larsga/download/diverse/sgmlglos.html}.
\begin{definitions}
\term{DOM (Document Object Model)}
%
The Document Object Model is intended to a platform- and
language-neutral interface that will allow programs and scripts to
dynamically access and update the content, structure and style of
documents. Documents will be represented as tree structures which can
be traversed and modified.
\term{DTD (Document Type Definition)}
%
A Document Type Definition (nearly always called DTD) defines
an XML document type, complete with element types, entities
and an XML declaration.
In other words: a DTD completely describes one particular kind
of XML document, such as, for instance, HTML 3.2.
\term{SAX (Simple API for XML)}
%
SAX is a simple standardized API for XML parsers developed by the
contributors to the xml-dev mailing list. The interface is mostly
language-independent, as long as the language is object-oriented; the
first implementation was written for Java, but a Python implementation
is also available. SAX is supported by many XML parsers.
\term{XML (eXtensible Markup Language)}
%
XML is an SGML application profile specialized for use on the
web and has its own standards for linking and stylesheets under development.
%XML-Data
\term{XSL (eXtensible Style Language)}
%
XSL is a proposal for a stylesheet language for XML, which
enables browsers to lay out XML documents in an attractive
manner, and also provides a way to convert XML documents to
HTML.
\end{definitions}
%\section{Related Links}
%
%This section collects all
%the links from the preceding sections.
\end{document}
|