File: xml-howto.tex

package info (click to toggle)
python-xml 0.4.19981014-1
  • links: PTS
  • area: main
  • in suites: slink
  • size: 2,124 kB
  • ctags: 3,099
  • sloc: ansic: 9,075; python: 8,150; xml: 7,940; makefile: 84; sh: 41
file content (852 lines) | stat: -rw-r--r-- 34,076 bytes parent folder | download
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
\documentclass{howto}

\newcommand{\element}[1]{\code{#1}}
\newcommand{\attribute}[1]{\code{#1}}

\title{Python/XML HOWTO}

\release{0.05}

\author{The Python/XML Special Interest Group}
\authoraddress{\email{xml-sig@python.org}\break (edited by \email{akuchling@acm.org})}

\begin{document}
\maketitle

\begin{abstract}
\noindent
XML is the eXtensible Markup Language, a subset of SGML, intended to
allow the creation and processing of application-specific markup
languages.  Python makes an excellent language for processing XML
data.  This document is a tutorial for the Python/XML package.  It
assumes you're already familiar with the structure and terminology of
XML.

This is a draft document; 'XXX' in the text indicates that something
has to be filled in later, or rewritten, or verified, or something.  
\end{abstract}

\tableofcontents

\section{Introduction to XML}

XML, the eXtensible Markup Language, is a simplified dialect of SGML,
the Standardized General Markup Language.  XML is intended to be
reasonably simple to implement and use, and is already being used for
specifying markup languages for various new standards: MathML for
expressing mathematical equations, XXX SMIL (Expand acronym) for
synchronizing multimedia objects, and so forth.

SGML and XML represent a document by tagging the document's various
components with their function, or meaning.  For example, an academic
paper contains several parts: it has a title, one or more authors, an
abstract, the actual text of the paper, a list of references, and so
forth.  A markup languge for writing such papers would therefore have
tags for indicating what the contents of the abstract are, what the
title is, and so forth.  This should not be confused with the physical
details of how the document is actually printed on paper.  The
abstract might be printed with narrow margins in a smaller font than
the rest of the document, but the markup usually won't be concerned
with details such as this; other software will translate from the
markup language to a typesetting language such as \TeX, and will
handle the details.

A markup language specified using XML looks a lot like HTML; a
document consists of a single \dfn{element}, which contains
sub-elements, which can have further sub-elements inside them.
Elements are indicated by \dfn{tags} in the text.  Tags are always
inside angle brackets \code{<}~\code{>}.  There are two forms of
elements.  An element can contain content between opening and closing
tags, as in \code{<name>Euryale</name>}, which is a \element{name}
element containing the data \samp{Euryale}. This content may be text
data, other XML elements, or a mixture of both.  Elements can also be
empty, in which case they contain nothing, and are represented as a
single tag ended with a slash, as in \code{<stop/>}, which is an empty
\element{stop} element.  Unlike HTML, XML element names are
case-sensitive; \element{stop} and \element{Stop} are two different
element types.

Opening and empty tags can also contain attributes, which specify
values associated with an element.  For example, text such as
\code{<name lang='greek'>Herakles</name>}, the \element{name} element
has a \attribute{lang} attribute which has a value of \samp{greek}.
This would contrast with \code{<name lang='latin'>Hercules</name>},
where the attribute's value is \samp{latin}.

A given XML language is specified with a Document Type Definition, or
\dfn{DTD}.  The DTD declares the element names that are allowed, and
how elements can be nested inside each other.  The DTD also specifies
the attributes that can be provided for each element, their default
values, and if they can be omitted.  For example, to take an example
from HTML, the \element{LI} element, representing an entry in a list,
can only occur inside certain elements which represent lists, such as
\element{OL} or \element{UL}.  A \dfn{validating parser} can be given
a DTD and a document, and verify whether a given document is legal
according to the DTD's rules, or determine that one or more rules have
been violated.

Applications that process XML can be classed into two types.  The
simplest class is an application that only handles one particular
markup language.  For example, a chemistry program may only need to
process Chemical Markup Language, but not MathML.  This
application can therefore be written specifically for a single DTD,
and doesn't need to be capable of handling multiple markup
languages.  This type is simpler to write, and can easily be
implemented with the available Python software.

The second type of application is less common, and has to be able to
handle any markup language you throw at it.  An example might be a
smart XML editor that helps you to write XML that conforms to a
selected DTD; it might do so by not letting you enter an element where
it would be illegal, or by suggesting elements that can be placed at
the current cursor location.  Such an application needs to handle any
possible XML-defined markup, and therefore must be able to obtain a
data structure embodying the DTD in use.  XXX This type of application
can't currently be implemented in Python without difficulty (XXX but
wait and see if a DTD module is included...)

For the full details of XML's syntax, the one definitive source is the
XML 1.0 specification, available on the Web at
\url{http://www.w3.org/TR/xml-spec.html}.  However, like all
specifications, it's quite formal and isn't intended to be a friendly
introduction or a tutorial.  The annotated version of the standard, at
\url{http://www.xml.com/XXX}, is quite helpful in clarifying the
specification's intent.  There are also various informal tutorials and
books available to introduce you to XML.

The rest of this HOWTO will assume that you're familiar with the
relevant terminology.  Most section will use XML terms such as
\emph{element} and \emph{attribute}; section~\ref{DOM} on the Document
Object Model will assume that you've read the relevant Working Draft,
and are familiar with things like Iterators and Nodes.
Section~\ref{SAX} does not require that you have experience with the
Java SAX implentations.

\subsection{Related Links}

\section{Installing the XML Toolkit}

Windows users should get the precompiled version at \url{XXX}; Mac
users will use the corresponding precompiled version at \url{XXX}.
Linux users may wish to use either the Debian package from \url{XXX},
or the RPM from \url{XXX}.  To compile from source on a \UNIX{} platform,
simply perform the following steps.

\begin{enumerate}
\item Get a copy of the source distribution from \url{http://www.python.org/topics/xml/download.html}.  Unpack it with the following command.

\begin{verbatim}
gzip -dc xml-package.tgz | tar -xvf -
\end{verbatim}

\item
 Run:
\begin{verbatim}
make -f Makefile.pre.in boot
\end{verbatim}

This creates the \file{Makefile}
and
\file{config.c} (producing various other intermediate files in the process), incorporating the values for \code{sys.prefix}, \code{sys.exec_prefix}
and \code{sys.version} from the installed Python binary.  For this to work,
the Python interpreter must be on your path.  If this fails, try

\begin{verbatim}
   make -f Makefile.pre.in Makefile VERSION=1.5 installdir=<prefix>
\end{verbatim}

where \samp{<prefix>} is the value of \samp{installdir} used when
installing Python.  You may possibly have to also set
\samp{exec_installdir} to the value of \samp{exec_prefix}.

\item
 Once the Makefile has been constructed, just run \samp{make} to
 compile the C modules.  There's no test suite yet, but there will be
 one someday.

\item
To install the code, run \samp{make install}.
 The code will be installed under the \file{site-packages/} directory
 as a package named \file{xml/}.  

\end{enumerate}

If you have difficulty installing this software, send a problem report
to <xml-sig@python.org> describing the problem.

There are various demonstration programs in the \file{demo/} directory
of the source distribution.  You may wish to look at them next to get
an impression of what's possible with the XML tools, and as a source
of example code.

% package layout

\subsection{Related Links}

\begin{definitions}
\term{\url{http://www.python.org/topics/xml/}}
%
This is the starting point for Python-related XML topics; it is
updated to refer to all software, mailing lists, documentation, etc. 

\end{definitions}

\section{SAX: The Simple API for XML}
\label{SAX}

The Simple API for XML isn't a standard in the formal sense, but an
informal specification designed by David Megginson, with input from
many people on the xml-dev mailing list.  SAX defines an event-driven
interface for parsing XML.  To use SAX, you must create Python class
instances which implement a specified interface, and the parser will
then call various methods of those objects.

SAX is most suitable for purposes where you want to read through an
entire XML document from beginning to end, and perform some
computation, such as building a data structure representating a
document, or summarizing information in a document (computing an
average value of a certain element, for example).  It's not very
useful if you want to modify the document structure in some
complicated way that involves changing how elements are nested, though
it could be used if you simply wish to change element contents or
attributes.  For example, you would not want to re-order chapters in a
book using SAX, but you might want to change the contents of any
\element{name} elements with the attribute \attribute{lang} equal to
'greek' into Greek letters.

One advantage of SAX is speed and simplicity.  Let's say
you've defined a complicated DTD for listing comic books, and you wish
to scan through your collection and list everything written by Neil
Gaiman.  For this specialized task, there's no need to expend effort
examining elements for artists and editors and colourists, because
they're irrelevant to the search.  You can therefore write a class
instance which ignores all elements that aren't \element{writer}.

Another advantage is that you don't have the whole document resident
in memory at any one time, which matters if you are processing really
huge documents.

SAX defines 4 basic interfaces; an SAX-compliant XML parser can be
passed any objects that support these interfaces, and will call
various methods as data is processed.  Your task, therefore, is to
implement those interfaces that are relevant to your application.

The SAX interfaces are:

\begin{tableii}{c|p{4in}}{code}{Interface}{Purpose}

\lineii{DocumentHandler}{Called for general document events.  This
interface is the heart of SAX; its methods are called for the start of
the document, the start and end of elements, and for the characters of
data contained inside elements.
}

\lineii{DTDHandler}{Called to handle DTD events required for basic
parsing.  This means notation declarations (XML spec section XXX) and
unparsed entity declarations (XML spec section XXX).
}

\lineii{EntityResolver}{Called to resolve references to external
entities.  If your documents will have no external entity references,
you won't need to implement this interface. }

\lineii{ErrorHandler}{Called for error handling.  The parser will call
methods from this interface to report all warnings and errors.}

\end{tableii}

Python doesn't support the concept of interfaces, so the interfaces
listed above are implemented as Python classes.  The default method
implementations are defined to do nothing---the method body is just a
Python \code{pass} statement--so usually you can simply ignore methods
that aren't relevant to your application.  The one big exception is
the \class{ErrorHandler} interface; if you don't provide methods that
print a message or otherwise take some action, errors in the XML data
will be silently ignored.  This is almost certainly \emph{not} what
you want your application to do, so always implement at least the
\method{error()} and \method{fatalError()} methods.
\module{xml.sax.saxutils} provides an \class{ErrorPrinter} class which
sends error messages to standard error, and an \class{ErrorRaiser}
class which raises an exception for any warnings or errors.

Pseudo-code for using SAX looks something like this:
\begin{verbatim}
# Define your specialized handler classes
from xml.sax import saxlib
class docHandler(saxlib.DocumentHandler):
    ...

# Create an instance of the handler classes
dh = docHandler()

# Create an XML parser
parser = ...

# Tell the parser to use your handler instance
parser.setDocumentHandler(dh)

# Parse the file; your handler's method will get called
parser.parseFile(sys.stdin)
\end{verbatim}

\subsection{Starting Out}

Following the earlier example, let's consider a simple XML format for
storing information about a comic book collection.  Here's a sample
document for a collection consisting of a single issue:

\begin{verbatim}
<collection>
  <comic title="Sandman" number='62'>
    <writer>Neil Gaiman</writer>
    <penciller pages='1-9,18-24'>Glyn Dillon</penciller>
    <penciller pages="10-17">Charles Vess</penciller>
  </comic>
</collection>
\end{verbatim}

An XML document must have a single root element; this is the
\samp{collection} element.  It has one child \element{comic} element
for each issue; the book's title and number are given as attributes of
the \element{comic} element, which can have one or more children
containing the issue's writer and artists.  There may be several
artists or writers for a single issue.

Let's start off with something simple: a document handler named
\class{FindIssue} that reports whether a given issue is in the
collection.

\begin{verbatim}
from xml.sax import saxlib

class FindIssue(saxlib.HandlerBase):
    def __init__(self, title, number):
        self.search_title, self.search_number = title, number
\end{verbatim}

The \class{HandlerBase} class inherits from all four interfaces:
\class{DocumentHandler}, \class{DTDHandler}, \class{EntityResolver},
and \class{ErrorHandler}.  This is what you should use if you
want to use one class for everything.  When you want separate classes
for each purpose, you can just subclass each interface individually.
Neither of the two approaches is always ``better'' than the other;
their suitability depends on what you're trying to do, and on what you
prefer.  

Since this class is doing a search, an instance needs to know what to
search for.  The desired title and issue number are passed to the
\class{FindIssue} constructor, and stored as part of the instance.

Now let's look at the function which actually does all the work.
This simple task only requires looking at the attributes of a given
element, so only the \method{startElement} method is relevant.

\begin{verbatim}
    def startElement(self, name, attrs):
        # If it's not a comic element, ignore it
        if name != 'comic': return

        # Look for the title and number attributes (see text)
        title = attrs.get('title', None)
        number = attrs.get('number', None)
        if title == self.search_title and number == self.search_number:
            print title, '#'+str(number), 'found'
\end{verbatim}

The \method{startElement()} method is passed a string giving the name
of the element, and an instance containing the element's attributes.
The latter implements the \class{AttributeList} interface, which
includes most of the semantics of Python dictionaries.  Therefore, the 
function looks for \element{comic} elements, and compares the
specified \attribute{title} and \attribute{number} attributes to the
search values.  If they match, a message is printed out.

\method{startElement()} is called for every single element in the
document.  If you added \code{print 'Starting element:', name} to the
top of  \method{startElement()}, you would get the following output.

\begin{verbatim}
Starting element: collection
Starting element: comic
Starting element: writer
Starting element: penciller
Starting element: penciller
\end{verbatim}

To actually use the class, we need top-level code that creates 
instances of a parser and of \class{FindIssue}, associates them, and
then calls a parser method to process the input.

\begin{verbatim}
from xml.sax import saxexts

if __name__ == '__main__':
    # Create a parser
    parser = saxexts.make_parser()

    # Create the handler
    dh = FindIssue('Sandman', '62')

    # Tell the parser to use our handler
    parser.setDocumentHandler(dh)

    # Parse the input
    parser.parseFile(file)
\end{verbatim}

The \class{ParserFactory} class can automate the job of creating
parsers.  There are already several XML parsers available to Python,
and more might be added in future.  \file{xmllib.py} is included with
Python 1.5, so it's always available, but it's also not particularly
fast.  A faster version of \file{xmllib.py} is included in
\module{xml.parsers}.  The \module{pyexpat} module is faster still, so
it's obviously a preferred choice if it's available.
\class{ParserFactory}'s \method{make_parser} method determines
which parsers are available and chooses the fastest one, so you don't
have to know what the different parsers are, or how they differ. (You
can also tell \method{make_parser} to use a given parser, if you want
to use a specific one.)

Once you've created a parser instance, calling
\method{setDocumentHandler} tells the parser what to use as the handler.

If you run the above code with the sample XML document, it'll output
\code{Sandman \#62 found.}  

\subsection{Error Handling}

Now, try running the above code with this file as input:
\begin{verbatim}
<collection>
  &foo;
  <comic title="Sandman" number='62'>
</collection>
\end{verbatim}

The \code{\&foo;} entity is unknown, and the \element{comic} element
isn't closed (if it was empty, there would be a \samp{/} before the
closing \samp{>}.  Why did the file get processed without complaint?
Because the default code for the \class{ErrorHandler} interface does
nothing, and no different implementation was provided, so the errors
are silently ignored.

The \class{ErrorRaiser} class automatically raises an exception for
any error; you'll usually set an instance of this class as the error
handler.  Otherwise, you should provide your own version of the
\class{ErrorHandler} interface, and at minimum override the
\method{error()} and \method{fatalError()} methods.  The minimal
implementation for each method can be a single line.  The methods in
the \class{ErrorHandler} interface--\method{warning}, \method{error},
and \method{fatalError}--are all passed a single argument, an
exception instance.  The exception will always be a subclass of
\exception{SAXException}, and calling \code{str()} on it will produce
a readable error message explaining the problem.

So, to re-implement a variant of \class{ErrorRaiser}, simply define
two of the three methods to raise the exception they're passed:

\begin{verbatim}
    def error(self, exception):
        raise exception
    def fatalError(self, exception):
        raise exception
\end{verbatim}

\method{warning()} might simply print the exception to \code{sys.stderr} 
and return without raising the exception.  Now the same incorrect XML
file will cause a traceback to be printed, with the error message
``xml.sax.saxlib.SAXException: reference to unknown entity''.  

\subsection{Searching Element Content}

Let's tackle a slightly more complicated task, printing out all issues
written by a certain author.  This now requires looking at element
content, because the writer's name is inside a \element{writer}
element: \code{<writer>Peter Milligan</writer>}.

The search will be performed using the following algorithm:

\begin{enumerate}
\item 
The \method{startElement} method will be more complicated.  For
\element{comic} elements, the handler has to save the title and
number, in case this comic is later found to match the search
criterion.  For \element{writer} elements, it sets a
\code{inWriterContent} flag to true, and sets a \code{writerName}
attribute to the empty string.

\item Characters outside of XML tags must be processed.  When
\code{inWriterContent} is true, these characters must be added to the
\code{writerName} string.

\item When the \element{writer} element is finished, we've now
collected all of the element's content in the \code{writerName}
attribute, so we can check if the name matches the one we're searching 
for, and if so, print the information about this comic.  We must also
set \code{inWriterContent} back to false.
\end{enumerate}

Here's the first part of the code; this implements step 1.

\begin{verbatim}
from xml.sax import saxlib
import string

def normalize_whitespace(text):
    "Remove redundant whitespace from a string"
    return string.join( string.split(text), ' ')

class FindWriter(saxlib.HandlerBase):
    def __init__(self, search_name):
        # Save the name we're looking for
        self.search_name = normalize_whitespace( search_name )

        # Initialize the flag to false
        self.inWriterContent = 0

    def startElement(self, name, attrs):
        # If it's a comic element, save the title and issue
        if name == 'comic':
            title = normalize_whitespace( attrs.get('title', "") )
            number = normalize_whitespace( attrs.get('number', "") )
            self.this_title = title
            self.this_number = number

        # If it's the start of a writer element, set flag
        elif name == 'writer':
            self.inWriterContent = 1
            self.writerName = ""
\end{verbatim}

The \method{startElement()} method has been discussed previously.  Now
we have to look at how the content of elements is processed.  

The \function{normalize_whitespace()} function is important, and
you'll probably use it in your own code.  XML treats whitespace very
flexibly; you can include extra spaces or newlines wherever you like.
This means that you must normalize the whitespace before comparing
attribute values or element content; otherwise the comparision might
produce a wrong result due to the content of two elements having
different amounts of whitespace.

\begin{verbatim}
    def characters(self, ch, start, length):
        if self.inWriterContent:
            self.writerName = self.writerName + ch[start:start+length]
\end{verbatim}

The \method{characters()} method is called for characters that aren't
inside XML tags.  \var{ch} is a string of characters, and \var{start}
is the point in the string where the characters
start.  \var{length} is the length of the character data.  You should
not assume that \var{start} is equal to 0, or that all of \var{ch} is
the character data.  An XML parser could be implemented to read the
entire document into memory as a string, and then operate by indexing
into the string.   This would mean that \var{ch} would always contain
the entire document, and only the values of \var{start} and
\var{length} would be changed.  

You also shouldn't assume that all the characters are passed in a
single function call.  In the example above, there might be only one
call to \method{characters()} for the string \samp{Peter Milligan}, or
it might call \method{characters()} once for each character.  More
realistically, if the content contains an entity reference, as in
\samp{Wagner
\&amp; Seagle}, the parser might call the method three times; once for 
\samp{Wagner\ }, once for \samp{\&}, represented by the entity
reference, and again for \samp{\ Seagle}.

For step 2 of \class{FindWriter}, \method{characters()} only has to
check \code{inWriterContent}, and if it's true, add the characters to
the string being built up.

Finally, when the \element{writer} element ends, the entire name has
been collected, so we can compare it to the name we're searching for.

\begin{verbatim}
    def endElement(self, name):
        if name == 'writer':
            self.inWriterContent = 0
            self.writerName = normalize_whitespace(self.writerName)
            if self.search_name == self.writerName:
                print 'Found:', self.this_title, self.this_number
\end{verbatim}

This is an unrealistically stupid comparison function that will be
fooled by differing whitespace, but it's good enough for an example.

End tags can't have attributes on them, so there's no \var{attrs}
parameter.  Empty elements with attributes, such as \samp{<arc
name="Season of Mists"/>}, will result in a call to
\method{startElement()}, followed immediately by a call to \method{endElement()}.

XXX how are external entities handled?  Anything special need to be
done for them?

\subsection{Related Links}

\begin{definitions}
\term{\url{http://www.megginson.com/SAX/}}
%
The SAX home page.  This has the most recent copy of the
specification, and lists SAX implementations for various languages and
platforms.  At the moment it's somewhat Java-centric.

\end{definitions}

\section{DOM: The Document Object Model}
\label{DOM}

\emph{The Document Object Model is currently at first draft stage, and 
isn't even close to being a standard.  The Python DOM is therefore not 
yet documented.  If you want to look at the code and use it anyway,
feel free (report any bugs you find), but be aware that your code may
need to be changed for future DOM drafts.}

The Document Object Model specifies a tree-based representation for an
XML document.  A top-level Document instance is the root of the tree,
and has a single child which is the top-level Element instance; this
instance has children nodes representing the content and any
sub-elements.  These sub-element nodes can have further children, and
so forth.  Functions are defined which let you traverse the resulting
tree any way you like, access element and attribute values, insert and
delete nodes, and convert the tree back into XML.

The DOM is useful for modifying the tree; you can remove a node from
one place in the tree, and insert it somewhere else.  You can also
construct a DOM tree yourself, and convert it to XML; this is often a
more flexible way of producing XML output than simply writing
\code{<tag1>}...\code{</tag1>} to a file.

While the DOM doesn't require that the entire tree be resident in
memory at one time, the Python implementation currently keeps the
whole tree in RAM.  This means you may not have enough memory to
process very large documents, measuring tens or hundreds of megabytes.
It's possible to write a DOM implementation that stores most of the
tree on disk or in a database, and reads in new sections as they're
accessed, but this hasn't been done yet, and such implementations
often impose limitations on how the tree can be accessed.

%Explanations, sample code, ...

\subsection{Related Links}

\begin{definitions}
\term{\url{http://www.w3.org/DOM/}}
%
The World Wide Web Consortium's DOM page.

\end{definitions}

\section{xmlarch: Architectural Forms}

The xmlarch module contains an XML architectural forms processor
   written in Python. It allows you to process XML architectural forms
   using any parser that uses the SAX interfaces. The module allows
   you to process several architectures in one parsing
   pass. Architectural document events for an architecture can even be
   broadcasted to multiple DocumentHandlers. (e.g. you can have 2
   handlers for the RDF architecture, 3 for the XLink architecture and
   perhaps one for the HyTime architecture.)
   
The architecture processor uses the SAX \class{DocumentHandler} interface
   which means that you can register the architecture handler
   (ArchDocHandler) with any SAX 1.0 compliant parser.
   
It currently does not process any meta document type definition
documents (meta-DTDs). When a DTD parser module is available the code
will be modified to use that in order to process meta-DTD information.
   
Please note that validating and well-formed parsers may report
different SAX events when parsing documents.

The \module{xmlarch} module contains six classes:
\class{ArchDocHandler}, \class{Architecture}, \class{ArchParseState},
\class{ArchException}, \class{AttributeParser} and \class{Normalizer}.

\begin{itemize}
     \item \class{ArchDocHandler} is a subclass of the \class{saxlib.DocumentHandler}
       interface. This is the class used for processing an architectural
       document.

     \item \class{Architecture} contains information about an architecture.

     \item \class{ArchParseState} holds information about an architecture's parse
       state when parsing a document.

     \item \class{AttributeParser} parses architecture use declaration PIs (attribute
       strings).

     \item \class{ArchException} holds information about an architectural exception
       thrown by an \class{ArchDocHandler} instance.

     \item \class{Normalizer} is a document handler that outputs "normalized" XML.
\end{itemize}

Using the xmlarch module usually means that you have to do the
following things:

\begin{itemize}
     \item Import the required SAX modules; saxexts, saxlib, saxutils.
     \item Import the xmlarch module.
     \item Create a SAX compliant parser object.
     \item Create an XML architectures processor handler.
     \item Register this handler with the parser.
     \item Add document handlers for the architectures you want to process.
     \item Register a default document handler with the architecture
       processor handler.
     \item Parse a document.
\end{itemize}
       
A simple example

   Python code:
\begin{verbatim}
# Import needed modules
from xml.sax import saxexts, saxlib, saxutils
import sys, xmlarch

# Create architecture processor handler
arch_handler = xmlarch.ArchDocHandler()

# Create parser and register architecture processor with it
parser = saxexts.XMLParserFactory.make_parser()
parser.setDocumentHandler(arch_handler)

# Add an document handler to process the html architecture
arch_handler.addArchDocumentHandler("html", xmlarch.Normalizer(sys.stdout))

# Parse (and process) the document
parser.parse("simple.xml")
\end{verbatim}

A sample XML document:
\begin{verbatim}
<?xml version="1.0"?>
<?IS10744:arch name="html"?>
<doc>
<title html="h1">My first architectual document</title>
<author html="address">Geir Ove Gronmo, grove@infotek.no</author>
<para>This is the first paragraph in this document</para>
<para html="p">This is the second paragraph</para>
</doc>
\end{verbatim}

The result:

\begin{verbatim}
<html>
<h1>My first architectual document</h1>
<address>Geir Ove Gronmo, grove@infotek.no</address>

<p>This is the second paragraph</p>
</html>
\end{verbatim}

See also the files \file{simple.py} and \file{simple.xml} in the
\file{demo/arch} directory of the Python/XML distribution.
   
If you try to process the persons architecture in this document
instead you get the following output:
\begin{verbatim}
<persons>

<author>Geir Ove Grnmo</author><mentioned>Eliot Kimber</mentioned><mentioned>D
avid Megginson</mentioned><mentioned>Lars Marius Garshol</mentioned>
</persons>
\end{verbatim}

A more complex example:

   Python code:
\begin{verbatim}
# Import needed modules
from xml.sax import saxexts, saxlib, saxutils
import sys, xmlarch

# create architecture processor handler
arch_handler = xmlarch.ArchDocHandler()

# Create parser and register architecture processor with it
parser = saxexts.XMLParserFactory.make_parser()
parser.setDocumentHandler(arch_handler)

# Add an document handlers to process the html and biblio architectures
arch_handler.addArchDocumentHandler("html", xmlarch.Normalizer(open("html.out",
 "w")))
arch_handler.addArchDocumentHandler("biblio", saxutils.ESISDocHandler(open("bib
lio1.out", "w")))
arch_handler.addArchDocumentHandler("biblio", saxutils.Canonizer(open("biblio2.
out", "w")))

# Register a default document handler that just passes through any incoming eve
nts
arch_handler.setDefaultDocumentHandler(xmlarch.Normalizer(sys.stdout))

# Parse (and process) the document
parser.parse("complex.xml")
\end{verbatim}

Because this causes a lot of output I've not included the XML document
and the results. See instead the files \file{complex.py} and
\file{complex.xml} in the \file{demo/xml} directory of the Python/XML
distribution and try it yourself.
   
\subsection{Related Links}


\section{Glossary}

XML has given rise to a sea of acronyms and terms.  This section will
list the most significant terms, and sketch their relevance.

Many of the following definitions are taken from Lars Marius Garshol's
SGML glossary, at \url{http://www.stud.ifi.uio.no/\~larsga/download/diverse/sgmlglos.html}.

\begin{definitions}
\term{DOM (Document Object Model)}
%
The Document Object Model is intended to a platform- and
language-neutral interface that will allow programs and scripts to
dynamically access and update the content, structure and style of
documents. Documents will be represented as tree structures which can
be traversed and modified.

\term{DTD (Document Type Definition)}
%
A Document Type Definition (nearly always called DTD) defines
an XML document type, complete with element types, entities
and an XML declaration.
       
In other words: a DTD completely describes one particular kind
of XML document, such as, for instance, HTML 3.2.
        
\term{SAX (Simple API for XML)}
%
SAX is a simple standardized API for XML parsers developed by the
contributors to the xml-dev mailing list. The interface is mostly
language-independent, as long as the language is object-oriented; the
first implementation was written for Java, but a Python implementation
is also available.  SAX is supported by many XML parsers.
          
\term{XML (eXtensible Markup Language)}
%
XML is an SGML application profile specialized for use on the
web and has its own standards for linking and stylesheets under development.
          
%XML-Data

\term{XSL (eXtensible Style Language)}
%
XSL is a proposal for a stylesheet language for XML, which
enables browsers to lay out XML documents in an attractive
manner, and also provides a way to convert XML documents to
HTML.
\end{definitions}
          
%\section{Related Links}
%
%This section collects all
%the links from the preceding sections.

\end{document}