File: xml-howto.tex

package info (click to toggle)
qm 1.1.3-1
  • links: PTS
  • area: main
  • in suites: woody
  • size: 8,628 kB
  • ctags: 10,249
  • sloc: python: 41,482; ansic: 20,611; xml: 12,837; sh: 485; makefile: 226
file content (841 lines) | stat: -rw-r--r-- 34,005 bytes parent folder | download
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
\documentclass{howto}

\newcommand{\element}[1]{\code{#1}}
\newcommand{\attribute}[1]{\code{#1}}

\title{Python/XML HOWTO}

\release{0.6.1}

\author{The Python/XML Special Interest Group}
\authoraddress{\email{xml-sig@python.org}\break (edited by \email{akuchling@acm.org})}

\begin{document}
\maketitle

\begin{abstract}
\noindent
XML is the eXtensible Markup Language, a subset of SGML, intended to
allow the creation and processing of application-specific markup
languages.  Python makes an excellent language for processing XML
data.  This document is a tutorial for the Python/XML package.  It
assumes you're already familiar with the structure and terminology of
XML.

This is a draft document; 'XXX' in the text indicates that something
has to be filled in later, or rewritten, or verified, or something.  
\end{abstract}

\tableofcontents

\section{Introduction to XML}

XML, the eXtensible Markup Language, is a simplified dialect of SGML,
the Standardized General Markup Language.  XML is intended to be
reasonably simple to implement and use, and is already being used for
specifying markup languages for various new standards: MathML for
expressing mathematical equations, Synchronized Multimedia
   Integration Language for
multimedia presentations, and so forth.

SGML and XML represent a document by tagging the document's various
components with their function, or meaning.  For example, an academic
paper contains several parts: it has a title, one or more authors, an
abstract, the actual text of the paper, a list of references, and so
forth.  A markup languge for writing such papers would therefore have
tags for indicating what the contents of the abstract are, what the
title is, and so forth.  This should not be confused with the physical
details of how the document is actually printed on paper.  The
abstract might be printed with narrow margins in a smaller font than
the rest of the document, but the markup usually won't be concerned
with details such as this; other software will translate from the
markup language to a typesetting language such as \TeX, and will
handle the details.

A markup language specified using XML looks a lot like HTML; a
document consists of a single \dfn{element}, which contains
sub-elements, which can have further sub-elements inside them.
Elements are indicated by \dfn{tags} in the text.  Tags are always
inside angle brackets \code{<}~\code{>}.  There are two forms of
elements.  An element can contain content between opening and closing
tags, as in \code{<name>Euryale</name>}, which is a \element{name}
element containing the data \samp{Euryale}. This content may be text
data, other XML elements, or a mixture of both.  Elements can also be
empty, in which case they contain nothing, and are represented as a
single tag ended with a slash, as in \code{<stop/>}, which is an empty
\element{stop} element.  Unlike HTML, XML element names are
case-sensitive; \element{stop} and \element{Stop} are two different
element types.

Opening and empty tags can also contain attributes, which specify
values associated with an element.  For example, text such as
\code{<name lang='greek'>Herakles</name>}, the \element{name} element
has a \attribute{lang} attribute which has a value of \samp{greek}.
This would contrast with \code{<name lang='latin'>Hercules</name>},
where the attribute's value is \samp{latin}.

A given XML language is specified with a Document Type Definition, or
\dfn{DTD}.  The DTD declares the element names that are allowed, and
how elements can be nested inside each other.  The DTD also specifies
the attributes that can be provided for each element, their default
values, and if they can be omitted.  For example, to take an example
from HTML, the \element{LI} element, representing an entry in a list,
can only occur inside certain elements which represent lists, such as
\element{OL} or \element{UL}.  A \dfn{validating parser} can be given
a DTD and a document, and verify whether a given document is legal
according to the DTD's rules, or determine that one or more rules have
been violated.

Applications that process XML can be classed into two types.  The
simplest class is an application that only handles one particular
markup language.  For example, a chemistry program may only need to
process Chemical Markup Language, but not MathML.  This
application can therefore be written specifically for a single DTD,
and doesn't need to be capable of handling multiple markup
languages.  This type is simpler to write, and can easily be
implemented with the available Python software.

The second type of application is less common, and has to be able to
handle any markup language you throw at it.  An example might be a
smart XML editor that helps you to write XML that conforms to a
selected DTD; it might do so by not letting you enter an element where
it would be illegal, or by suggesting elements that can be placed at
the current cursor location.  Such an application needs to handle any
possible XML-defined markup, and therefore must be able to obtain a
data structure embodying the DTD in use.  XXX This type of application
can't currently be implemented in Python without difficulty (XXX but
wait and see if a DTD module is included...)

For the full details of XML's syntax, the one definitive source is the
XML 1.0 specification, available on the Web at
\url{http://www.w3.org/TR/xml-spec.html}.  However, like all
specifications, it's quite formal and isn't intended to be a friendly
introduction or a tutorial.  The annotated version of the standard, at
\url{http://www.xml.com/xml/pub/axml/axmlintro.html}, is quite helpful
in clarifying the specification's intent.  There are also various
informal tutorials and books available to introduce you to XML.

The rest of this HOWTO will assume that you're familiar with the
relevant terminology.  Most section will use XML terms such as
\emph{element} and \emph{attribute}; section~\ref{DOM} on the Document
Object Model will assume that you've read the relevant Working Draft,
and are familiar with things like Iterators and Nodes.
Section~\ref{SAX} does not require that you have experience with the
Java SAX implentations.

\subsection{Related Links}

\section{Installing the XML Toolkit}

Windows users should get the precompiled version at
\url{http://sourceforge.net/projects/pyxml}; Mac users will use the
corresponding precompiled version at \url{XXX}.  Linux users may wish
to use either the Debian package from \url{XXX}, or the RPM from
\url{http://sourceforge.net/projects/pyxml}.  To compile from source
on a \UNIX{} platform, simply perform the following steps.

\begin{enumerate}
\item If you have are using Python 1.5, you need to install the
distutils first, which are available from
\url{http://www.python.org/sigs/distutils-sig}. Python 1.6 and later
already includes the distutils, so you can skip this step.

\item Get a copy of the source distribution from
\url{http://sourceforge.net/projects/pyxml}.  Unpack it with the
following command.

\begin{verbatim}
gzip -dc xml-package.tgz | tar -xvf -
\end{verbatim}

\item
 Run:
\begin{verbatim}
python setup.py install
\end{verbatim}

To properly execute this operation, a C compiler is required - the
same that was used to build Python itself. On a Unix system, this
operation may require superuser permissions. \code{setup.py} supports
a number of different commands and options, invoke \code{setup.py}
without any arguments to obtain help.

\end{enumerate}

If you have difficulty installing this software, send a problem report
to <xml-sig@python.org> describing the problem, or submit a bug report
at \url{http://sourceforget.net/projects/pyxml}.

There are various demonstration programs in the \file{demo/} directory
of the source distribution.  You may wish to look at them next to get
an impression of what's possible with the XML tools, and as a source
of example code.

% package layout

\subsection{Related Links}

\begin{definitions}
\term{\url{http://www.python.org/topics/xml/}}
%
This is the starting point for Python-related XML topics; it is
updated to refer to all software, mailing lists, documentation, etc. 

\end{definitions}

\section{SAX: The Simple API for XML}
\label{SAX}

The Simple API for XML isn't a standard in the formal sense, but an
informal specification designed by David Megginson, with input from
many people on the xml-dev mailing list.  SAX defines an event-driven
interface for parsing XML.  To use SAX, you must create Python class
instances which implement a specified interface, and the parser will
then call various methods of those objects.

This howto describes version 2 of SAX (also referred to as
SAX2). Earlier versions of this text did explain SAX1, which is
primarily of historical interest only.

SAX is most suitable for purposes where you want to read through an
entire XML document from beginning to end, and perform some
computation, such as building a data structure representating a
document, or summarizing information in a document (computing an
average value of a certain element, for example).  It's not very
useful if you want to modify the document structure in some
complicated way that involves changing how elements are nested, though
it could be used if you simply wish to change element contents or
attributes.  For example, you would not want to re-order chapters in a
book using SAX, but you might want to change the contents of any
\element{name} elements with the attribute \attribute{lang} equal to
'greek' into Greek letters.

One advantage of SAX is speed and simplicity.  Let's say
you've defined a complicated DTD for listing comic books, and you wish
to scan through your collection and list everything written by Neil
Gaiman.  For this specialized task, there's no need to expend effort
examining elements for artists and editors and colourists, because
they're irrelevant to the search.  You can therefore write a class
instance which ignores all elements that aren't \element{writer}.

Another advantage is that you don't have the whole document resident
in memory at any one time, which matters if you are processing really
huge documents.

SAX defines 4 basic interfaces; an SAX-compliant XML parser can be
passed any objects that support these interfaces, and will call
various methods as data is processed.  Your task, therefore, is to
implement those interfaces that are relevant to your application.

The SAX interfaces are:

\begin{tableii}{c|p{4in}}{code}{Interface}{Purpose}

\lineii{ContentHandler}{Called for general document events.  This
interface is the heart of SAX; its methods are called for the start of
the document, the start and end of elements, and for the characters of
data contained inside elements.
}

\lineii{DTDHandler}{Called to handle DTD events required for basic
parsing.  This means notation declarations (XML spec section 4.7) and
unparsed entity declarations (XML spec section 4).
}

\lineii{EntityResolver}{Called to resolve references to external
entities.  If your documents will have no external entity references,
you won't need to implement this interface. }

\lineii{ErrorHandler}{Called for error handling.  The parser will call
methods from this interface to report all warnings and errors.}

\end{tableii}

Python doesn't support the concept of interfaces, so the interfaces
listed above are implemented as Python classes.  The default method
implementations are defined to do nothing---the method body is just a
Python \code{pass} statement--so usually you can simply ignore methods
that aren't relevant to your application. 

Pseudo-code for using SAX looks something like this:
\begin{verbatim}
# Define your specialized handler classes
from xml.sax import Contenthandler, ...
class docHandler(ContentHandler):
    ...

# Create an instance of the handler classes
dh = docHandler()

# Create an XML parser
parser = ...

# Tell the parser to use your handler instance
parser.setContentHandler(dh)

# Parse the file; your handler's method will get called
parser.parse(sys.stdin)

\end{verbatim}

\subsection{Starting Out}

Following the earlier example, let's consider a simple XML format for
storing information about a comic book collection.  Here's a sample
document for a collection consisting of a single issue:

\begin{verbatim}
<collection>
  <comic title="Sandman" number='62'>
    <writer>Neil Gaiman</writer>
    <penciller pages='1-9,18-24'>Glyn Dillon</penciller>
    <penciller pages="10-17">Charles Vess</penciller>
  </comic>
</collection>
\end{verbatim}

An XML document must have a single root element; this is the
\samp{collection} element.  It has one child \element{comic} element
for each issue; the book's title and number are given as attributes of
the \element{comic} element, which can have one or more children
containing the issue's writer and artists.  There may be several
artists or writers for a single issue.

Let's start off with something simple: a document handler named
\class{FindIssue} that reports whether a given issue is in the
collection.

\begin{verbatim}
from xml.sax import saxutils

class FindIssue(saxutils.DefaultHandler):
    def __init__(self, title, number):
        self.search_title, self.search_number = title, number
\end{verbatim}

The \class{DefaultHandler} class inherits from all four interfaces:
\class{ContentHandler}, \class{DTDHandler}, \class{EntityResolver},
and \class{ErrorHandler}.  This is what you should use if you want to
use one class for everything.  When you want separate classes for each
purpose, or if you want to implement only a single interface, you can
just subclass each interface individually.  Neither of the two
approaches is always ``better'' than the other; their suitability
depends on what you're trying to do, and on what you prefer.

Since this class is doing a search, an instance needs to know what to
search for.  The desired title and issue number are passed to the
\class{FindIssue} constructor, and stored as part of the instance.

Now let's look at the function which actually does all the work.
This simple task only requires looking at the attributes of a given
element, so only the \method{startElement} method is relevant.

\begin{verbatim}
    def startElement(self, name, attrs):
        # If it's not a comic element, ignore it
        if name != 'comic': return

        # Look for the title and number attributes (see text)
        title = attrs.get('title', None)
        number = attrs.get('number', None)
        if title == self.search_title and number == self.search_number:
            print title, '#'+str(number), 'found'
\end{verbatim}

The \method{startElement()} method is passed a string giving the name
of the element, and an instance containing the element's attributes.
The latter implements the \class{AttributeList} interface, which
includes most of the semantics of Python dictionaries.  Therefore, the 
function looks for \element{comic} elements, and compares the
specified \attribute{title} and \attribute{number} attributes to the
search values.  If they match, a message is printed out.

\method{startElement()} is called for every single element in the
document.  If you added \code{print 'Starting element:', name} to the
top of  \method{startElement()}, you would get the following output.

\begin{verbatim}
Starting element: collection
Starting element: comic
Starting element: writer
Starting element: penciller
Starting element: penciller
\end{verbatim}

To actually use the class, we need top-level code that creates 
instances of a parser and of \class{FindIssue}, associates them, and
then calls a parser method to process the input.

\begin{verbatim}
from xml.sax import make_parser
from xml.sax.handler import feature_namespaces

if __name__ == '__main__':
    # Create a parser
    parser = make_parser()
    # Tell the parser we are not interested in XML namespaces
    parser.setFeature(feature_namespaces, 0)

    # Create the handler
    dh = FindIssue('Sandman', '62')

    # Tell the parser to use our handler
    parser.setContentHandler(dh)

    # Parse the input
    parser.parse(file)
\end{verbatim}

The \function{make_parser} class can automate the job of creating
parsers.  There are already several XML parsers available to Python,
and more might be added in future.  \file{xmllib.py} is included with
Python 1.5, so it's always available, but it's also not particularly
fast.  A faster version of \file{xmllib.py} is included in
\module{xml.parsers}.  The \module{xml.parsers.expat} module is faster
still, so it's obviously a preferred choice if it's available.
\function{make_parser} determines which parsers are available and
chooses the fastest one, so you don't have to know what the different
parsers are, or how they differ. (You can also tell
\function{make_parser} to try a list of parsers, if you want to use a
specific one).

In SAX2, XML namespace are supported. Parsers will not call
\method{startElement}, but \method{startElementNS} if namespace
processing is active. Since our content handler does not implement the
namespace-aware methods, we request that namespace processing is
deactivated. The default of this setting varies from parser to parser,
so you should always set it to a safe value -- unless your handlers
support either method.

Once you've created a parser instance, calling
\method{setContentHandler} tells the parser what to use as the
handler.

If you run the above code with the sample XML document, it'll output
\code{Sandman \#62 found.}  

\subsection{Error Handling}

Now, try running the above code with this file as input:
\begin{verbatim}
<collection>
  &foo;
  <comic title="Sandman" number='62'>
</collection>
\end{verbatim}

The \code{\&foo;} entity is unknown, and the \element{comic} element
isn't closed (if it was empty, there would be a \samp{/} before the
closing \samp{>}. As a result, you get a SAXParseException, e.g.

\begin{verbatim}
xml.sax._exceptions.SAXParseException: undefined entity at None:2:2
\end{verbatim}

The default code for the \class{ErrorHandler} interface automatically
raises an exception for any error; if that is what you want in case of
an error, you don't need to change the error handler.  Otherwise, you
should provide your own version of the \class{ErrorHandler} interface,
and at minimum override the \method{error()} and \method{fatalError()}
methods.  The minimal implementation for each method can be a single
line.  The methods in the \class{ErrorHandler}
interface--\method{warning}, \method{error}, and
\method{fatalError}--are all passed a single argument, an exception
instance.  The exception will always be a subclass of
\exception{SAXException}, and calling \code{str()} on it will produce
a readable error message explaining the problem.

So, to re-implement a variant of \class{ErrorRaiser}, simply define
one of the three methods to print the exception they're passed:

\begin{verbatim}
    def error(self, exception):
        import sys
        sys.stderr.write("\%s\n" \% exception)
\end{verbatim}

With this definition, non-fatal errors will result in an error message,
whereas fatal errors will continue to produce a traceback.

\subsection{Searching Element Content}

Let's tackle a slightly more complicated task, printing out all issues
written by a certain author.  This now requires looking at element
content, because the writer's name is inside a \element{writer}
element: \code{<writer>Peter Milligan</writer>}.

The search will be performed using the following algorithm:

\begin{enumerate}
\item 
The \method{startElement} method will be more complicated.  For
\element{comic} elements, the handler has to save the title and
number, in case this comic is later found to match the search
criterion.  For \element{writer} elements, it sets a
\code{inWriterContent} flag to true, and sets a \code{writerName}
attribute to the empty string.

\item Characters outside of XML tags must be processed.  When
\code{inWriterContent} is true, these characters must be added to the
\code{writerName} string.

\item When the \element{writer} element is finished, we've now
collected all of the element's content in the \code{writerName}
attribute, so we can check if the name matches the one we're searching 
for, and if so, print the information about this comic.  We must also
set \code{inWriterContent} back to false.
\end{enumerate}

Here's the first part of the code; this implements step 1.

\begin{verbatim}
from xml.sax import ContentHandler
import string

def normalize_whitespace(text):
    "Remove redundant whitespace from a string"
    return string.join(string.split(text), ' ')

class FindWriter(ContentHandler):
    def __init__(self, search_name):
        # Save the name we're looking for
        self.search_name = normalize_whitespace(search_name)

        # Initialize the flag to false
        self.inWriterContent = 0

    def startElement(self, name, attrs):
        # If it's a comic element, save the title and issue
        if name == 'comic':
            title = normalize_whitespace(attrs.get('title', ""))
            number = normalize_whitespace(attrs.get('number', ""))
            self.this_title = title
            self.this_number = number

        # If it's the start of a writer element, set flag
        elif name == 'writer':
            self.inWriterContent = 1
            self.writerName = ""
\end{verbatim}

The \method{startElement()} method has been discussed previously.  Now
we have to look at how the content of elements is processed.  

The \function{normalize_whitespace()} function is important, and
you'll probably use it in your own code.  XML treats whitespace very
flexibly; you can include extra spaces or newlines wherever you like.
This means that you must normalize the whitespace before comparing
attribute values or element content; otherwise the comparision might
produce a wrong result due to the content of two elements having
different amounts of whitespace.

\begin{verbatim}
    def characters(self, ch):
        if self.inWriterContent:
            self.writerName = self.writerName + ch
\end{verbatim}

The \method{characters()} method is called for characters that aren't
inside XML tags.  \var{ch} is a string of characters. It is not
necessarily a byte string; parsers may also provide a buffer object
that is a slice of the full document, or they may pass Unicode
objects (as the expat parser does in Python 2.0).

You also shouldn't assume that all the characters are passed in a
single function call.  In the example above, there might be only one
call to \method{characters()} for the string \samp{Peter Milligan}, or
it might call \method{characters()} once for each character.  More
realistically, if the content contains an entity reference, as in
\samp{Wagner
\&amp; Seagle}, the parser might call the method three times; once for 
\samp{Wagner\ }, once for \samp{\&}, represented by the entity
reference, and again for \samp{\ Seagle}.

For step 2 of \class{FindWriter}, \method{characters()} only has to
check \code{inWriterContent}, and if it's true, add the characters to
the string being built up.

Finally, when the \element{writer} element ends, the entire name has
been collected, so we can compare it to the name we're searching for.

\begin{verbatim}
    def endElement(self, name):
        if name == 'writer':
            self.inWriterContent = 0
            self.writerName = normalize_whitespace(self.writerName)
            if self.search_name == self.writerName:
                print 'Found:', self.this_title, self.this_number
\end{verbatim}

To avoid being confused by differing whitespace, the
\function{normalize_whitespace()} function is called.  This can be
done because we know that leading and trailing whitespace are
insignificant for this element, in this DTD.  

End tags can't have attributes on them, so there's no \var{attrs}
parameter.  Empty elements with attributes, such as \samp{<arc
name="Season of Mists"/>}, will result in a call to
\method{startElement()}, followed immediately by a call to \method{endElement()}.

XXX how are external entities handled?  Anything special need to be
done for them?

\subsection{Related Links}

\begin{definitions}
\term{\url{http://www.megginson.com/SAX/}}
%
The SAX home page.  This has the most recent copy of the
specification, and lists SAX implementations for various languages and
platforms.  At the moment it's somewhat Java-centric.

\end{definitions}

\section{DOM: The Document Object Model}
\label{DOM}

The Document Object Model specifies a tree-based representation for an
XML document.  A top-level \class{Document} instance is the root of
the tree, and has a single child which is the top-level
\class{Element} instance; this \class{Element} has children nodes
representing the content and any sub-elements, which may have further
children, and so forth.  Functions are defined which let you traverse
the resulting tree any way you like, access element and attribute
values, insert and delete nodes, and convert the tree back into XML.

The DOM is useful for modifying XML documents, because you can create
a DOM tree, modify it by adding new nodes and moving subtrees around,
and then produce a new XML document as output.  You can also construct
a DOM tree yourself, and convert it to XML; this is often a more
flexible way of producing XML output than simply writing
\code{<tag1>}...\code{</tag1>} to a file.

While the DOM doesn't require that the entire tree be resident in
memory at one time, the Python DOM implementation currently does keep
the whole tree in RAM.  It's possible to write an implementation that
stores most of the tree on disk or in a database, and reads in new
sections as they're accessed, but this hasn't been done yet.
This means you may not have enough memory to process very large
documents as a DOM tree.  A SAX handler, on the other hand, can
potentially churn through amounts of data far larger than the
available RAM.

\subsection{Getting A DOM Tree}

The easiest way to get a DOM tree is to have it built for you. PyXML
offers two alternative implementations of the DOM,
\module{xml.dom.minidom} and \code{4DOM}. \module{xml.dom.minidom} is
included in Python 2. It is a minimalistic implementation, which means
it does not provide all interfaces and operations required by the DOM
standard. \code{4DOM} (XXX reference) is a complete implementation of
DOM Level 2 (which is currently work in progress), so we will use that
in the examples.

One of the modules in the \module{xml.dom} package is
\module{xml.dom.ext.reader.Sax2}, which provides the functions
\function{FromXmlStream}, \function{FromXml}, \function{FromXmlFile},
and \function{FromXmlUrl} which will construct a DOM tree from their
input (a file-like object, a string, a file name, and a URL,
respectively). They all return a DOM \class{Document} object.

\begin{verbatim}
import sys
from xml.dom.ext.reader.Sax import FromXmlStream
from xml.dom.ext import PrettyPrint

# parse the document
doc = FromXmlStream(sys.stdin)
\end{verbatim}

\subsection{Manipulating The Tree}

This HOWTO can't be a complete introduction to the Document Object
Model, because there are lots of interfaces and lots of
methods. Luckily, the DOM Recommendation is quite a readable document,
so I'd recommend that you read it to get a complete picture of the
available interfaces; this will only be a partial overview.

The Document Object Model represents a XML document as a tree of
nodes, represented by an instance of some subclass of the \class{Node}
class.  Some subclasses of \class{Node} are \class{Element},
\class{Text}, and \class{Comment}.  

We'll use a single example document throughout this section.  Here's the sample:

\begin{verbatim}
<?xml version="1.0" encoding="iso-8859-1"?>
<xbel>  
  <?processing instruction?>
  <desc>No description</desc>
  <folder>
    <title>XML bookmarks</title>
    <bookmark href="http://www.python.org/sigs/xml-sig/" >
      <title>SIG for XML Processing in Python</title>
    </bookmark>
  </folder>
</xbel>
\end{verbatim}

Converted to a DOM tree, this document could produce the following tree:

\begin{verbatim}
Element xbel None
   Text #text '  \012  '
   ProcessingInstruction processing 'instruction'
   Text #text '\012  '
   Element desc None
      Text #text 'No description'
   Text #text '\012  '
   Element folder None
      Text #text '\012    '
      Element title None
         Text #text 'XML bookmarks'
      Text #text '\012    '
      Element bookmark None
         Text #text '\012      '
         Element title None
            Text #text 'SIG for XML Processing in Python'
         Text #text '\012    '
      Text #text '\012  '
   Text #text '\012'
\end{verbatim}

This isn't the only possible tree, because different parsers may
differ in how they generate \class{Text} nodes; any of the
\class{Text} nodes in the above tree might be split into multiple nodes.)

\subsubsection{The \class{Node} class}

We'll start by considering the basic \class{Node} class.  All the
other DOM nodes --- \class{Document}, \class{Element}, \class{Text},
and so forth --- are subclasses of \class{Node}.  It's possible to
perform many tasks using just the interface provided by \class{Node}.

XXX table of attributes and methods
            readonly attribute  DOMString            nodeName;
                     attribute  DOMString            nodeValue;
                                                           // raises(DOMException) on setting
                                                           // raises(DOMException) on retrieval
            readonly attribute  unsigned short       nodeType;
            readonly attribute  Node                 parentNode;
            readonly attribute  NodeList             childNodes;
            readonly attribute  Node                 firstChild;
            readonly attribute  Node                 lastChild;
            readonly attribute  Node                 previousSibling;
            readonly attribute  Node                 nextSibling;
            readonly attribute  NamedNodeMap         attributes;
            readonly attribute  Document             ownerDocument;

            Node                      insertBefore(in Node newChild,
                                                   in Node refChild)
                                                   raises(DOMException);
            Node                      replaceChild(in Node newChild,
                                                   in Node oldChild)
                                                   raises(DOMException);
            Node                      removeChild(in Node oldChild)
                                                  raises(DOMException);
            Node                      appendChild(in Node newChild)
                                                  raises(DOMException);
            boolean                   hasChildNodes();
            Node                      cloneNode(in boolean deep);


\subsubsection{\class{Document}, \class{Element}, and \class{Text} nodes}

The base of the entire tree is the \class{Document} node.  Its
\member{documentElement} attribute contains the \class{Element} node
for the root element.  The \class{Document} node may have additional
children, such as \class{ProcessingInstruction} nodes; the complete list of children XXX.


\subsection{Walking Over The Entire Tree}

The \module{xml.dom} package also includes various helper classes for
common tasks such as walking over trees.

The \class{Walker} class

Introduction to the walker class

\subsection{Building A Document}

Intro to builder

\subsection{Processing HTML}

Intro to HTML builder

%Explanations, sample code, ...

\subsection{Related Links}

\begin{definitions}
\term{\url{http://www.w3.org/DOM/}}
%
The World Wide Web Consortium's DOM page.

\term{\url{http://www.w3.org/TR/1998/REC-DOM-Level-1-19981001/}}
%
The DOM Level 1 Recommendation.  Unlike most standards, this one is
actually pretty readable, particularly if you're only interested in
the Core XML interfaces.  

\end{definitions}



\section{Glossary}

XML has given rise to a sea of acronyms and terms.  This section will
list the most significant terms, and sketch their relevance.

Many of the following definitions are taken from Lars Marius Garshol's
SGML glossary, at \url{http://www.stud.ifi.uio.no/\~larsga/download/diverse/sgmlglos.html}.

\begin{definitions}
\term{DOM (Document Object Model)}
%
The Document Object Model is intended to be a platform- and
language-neutral interface that will allow programs and scripts to
dynamically access and update the content, structure and style of
documents. Documents will be represented as tree structures which can
be traversed and modified.

\term{DTD (Document Type Definition)}
%
A Document Type Definition (nearly always called DTD) defines
an XML document type, complete with element types, entities
and an XML declaration.
       
In other words: a DTD completely describes one particular kind
of XML document, such as, for instance, HTML 3.2.
        
\term{SAX (Simple API for XML)}
%
SAX is a simple standardized API for XML parsers developed by the
contributors to the xml-dev mailing list. The interface is mostly
language-independent, as long as the language is object-oriented; the
first implementation was written for Java, but a Python implementation
is also available.  SAX is supported by many XML parsers.
          
\term{XML (eXtensible Markup Language)}
%
XML is an SGML application profile specialized for use on the
web and has its own standards for linking and stylesheets under development.
          
%XML-Data

\term{XSL (eXtensible Style Language)}
%
XSL is a proposal for a stylesheet language for XML, which
enables browsers to lay out XML documents in an attractive
manner, and also provides a way to convert XML documents to
HTML.
\end{definitions}
          
%\section{Related Links}
%
%This section collects all
%the links from the preceding sections.

\end{document}