File: xml-howto.txt

package info (click to toggle)
qm 1.1.3-1
  • links: PTS
  • area: main
  • in suites: woody
  • size: 8,628 kB
  • ctags: 10,249
  • sloc: python: 41,482; ansic: 20,611; xml: 12,837; sh: 485; makefile: 226
file content (770 lines) | stat: -rw-r--r-- 34,327 bytes parent folder | download | duplicates (2)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770

                              Python/XML HOWTO
     _________________________________________________________________
   
                              Python/XML HOWTO
                                      
                   The Python/XML Special Interest Group
                                      
                             xml-sig@python.org
                       (edited by akuchling@acm.org)
                                      
  Abstract:
  
   XML is the eXtensible Markup Language, a subset of SGML, intended to
   allow the creation and processing of application-specific markup
   languages. Python makes an excellent language for processing XML data.
   This document is a tutorial for the Python/XML package. It assumes
   you're already familiar with the structure and terminology of XML.
   
   This is a draft document; 'XXX' in the text indicates that something
   has to be filled in later, or rewritten, or verified, or something.
   
   #0#
   
Contents

     * Contents
     * 1. Introduction to XML
          + 1.1 Related Links
     * 2. Installing the XML Toolkit
          + 2.1 Related Links
     * 3. SAX: The Simple API for XML
          + 3.1 Starting Out
          + 3.2 Error Handling
          + 3.3 Searching Element Content
          + 3.4 Related Links
     * 4. DOM: The Document Object Model
          + 4.1 Getting A DOM Tree
          + 4.2 Manipulating The Tree
          + 4.3 Walking Over The Entire Tree
          + 4.4 Building A Document
          + 4.5 Processing HTML
          + 4.6 Related Links
     * 5. Glossary
       
                            1. Introduction to XML
                                       
   XML, the eXtensible Markup Language, is a simplified dialect of SGML,
   the Standardized General Markup Language. XML is intended to be
   reasonably simple to implement and use, and is already being used for
   specifying markup languages for various new standards: MathML for
   expressing mathematical equations, Synchronized Multimedia Integration
   Language for multimedia presentations, and so forth.
   
   SGML and XML represent a document by tagging the document's various
   components with their function, or meaning. For example, an academic
   paper contains several parts: it has a title, one or more authors, an
   abstract, the actual text of the paper, a list of references, and so
   forth. A markup languge for writing such papers would therefore have
   tags for indicating what the contents of the abstract are, what the
   title is, and so forth. This should not be confused with the physical
   details of how the document is actually printed on paper. The abstract
   might be printed with narrow margins in a smaller font than the rest
   of the document, but the markup usually won't be concerned with
   details such as this; other software will translate from the markup
   language to a typesetting language such as TEX, and will handle the
   details.
   
   A markup language specified using XML looks a lot like HTML; a
   document consists of a single element, which contains sub-elements,
   which can have further sub-elements inside them. Elements are
   indicated by tags in the text. Tags are always inside angle brackets
   < >. There are two forms of elements. An element can contain content
   between opening and closing tags, as in <name>Euryale</name>, which is
   a name element containing the data "Euryale". This content may be text
   data, other XML elements, or a mixture of both. Elements can also be
   empty, in which case they contain nothing, and are represented as a
   single tag ended with a slash, as in <stop/>, which is an empty stop
   element. Unlike HTML, XML element names are case-sensitive; stop and
   Stop are two different element types.
   
   Opening and empty tags can also contain attributes, which specify
   values associated with an element. For example, text such as <name
   lang='greek'>Herakles</name>, the name element has a lang attribute
   which has a value of "greek". This would contrast with <name
   lang='latin'>Hercules</name>, where the attribute's value is "latin".
   
   A given XML language is specified with a Document Type Definition, or
   DTD. The DTD declares the element names that are allowed, and how
   elements can be nested inside each other. The DTD also specifies the
   attributes that can be provided for each element, their default
   values, and if they can be omitted. For example, to take an example
   from HTML, the LI element, representing an entry in a list, can only
   occur inside certain elements which represent lists, such as OL or UL.
   A validating parser can be given a DTD and a document, and verify
   whether a given document is legal according to the DTD's rules, or
   determine that one or more rules have been violated.
   
   Applications that process XML can be classed into two types. The
   simplest class is an application that only handles one particular
   markup language. For example, a chemistry program may only need to
   process Chemical Markup Language, but not MathML. This application can
   therefore be written specifically for a single DTD, and doesn't need
   to be capable of handling multiple markup languages. This type is
   simpler to write, and can easily be implemented with the available
   Python software.
   
   The second type of application is less common, and has to be able to
   handle any markup language you throw at it. An example might be a
   smart XML editor that helps you to write XML that conforms to a
   selected DTD; it might do so by not letting you enter an element where
   it would be illegal, or by suggesting elements that can be placed at
   the current cursor location. Such an application needs to handle any
   possible XML-defined markup, and therefore must be able to obtain a
   data structure embodying the DTD in use. XXX This type of application
   can't currently be implemented in Python without difficulty (XXX but
   wait and see if a DTD module is included...)
   
   For the full details of XML's syntax, the one definitive source is the
   XML 1.0 specification, available on the Web at
   http://www.w3.org/TR/xml-spec.html. However, like all specifications,
   it's quite formal and isn't intended to be a friendly introduction or
   a tutorial. The annotated version of the standard, at
   http://www.xml.com/xml/pub/axml/axmlintro.html, is quite helpful in
   clarifying the specification's intent. There are also various informal
   tutorials and books available to introduce you to XML.
   
   The rest of this HOWTO will assume that you're familiar with the
   relevant terminology. Most section will use XML terms such as element
   and attribute; section 4 on the Document Object Model will assume that
   you've read the relevant Working Draft, and are familiar with things
   like Iterators and Nodes. Section 3 does not require that you have
   experience with the Java SAX implentations.
   
1.1 Related Links

                         2. Installing the XML Toolkit
                                       
   Windows users should get the precompiled version at
   http://sourceforge.net/projects/pyxml; Mac users will use the
   corresponding precompiled version at XXX. Linux users may wish to use
   either the Debian package from XXX, or the RPM from
   http://sourceforge.net/projects/pyxml. To compile from source on a
   Unix platform, simply perform the following steps.
   
   1.
          If you have are using Python 1.5, you need to install the
          distutils first, which are available from
          http://www.python.org/sigs/distutils-sig. Python 1.6 and later
          already includes the distutils, so you can skip this step.
   2.
          Get a copy of the source distribution from
          http://sourceforge.net/projects/pyxml. Unpack it with the
          following command.
          
gzip -dc xml-package.tgz | tar -xvf -

   3.
   Run:
   
python setup.py install

   To properly execute this operation, a C compiler is required - the
   same that was used to build Python itself. On a Unix system, this
   operation may require superuser permissions. setup.py supports a
   number of different commands and options, invoke setup.py without any
   arguments to obtain help.
   
   If you have difficulty installing this software, send a problem report
   to <xml-sig@python.org> describing the problem, or submit a bug report
   at http://sourceforget.net/projects/pyxml.
   
   There are various demonstration programs in the demo/ directory of the
   source distribution. You may wish to look at them next to get an
   impression of what's possible with the XML tools, and as a source of
   example code.
   
2.1 Related Links

   http://www.python.org/topics/xml/
          This is the starting point for Python-related XML topics; it is
          updated to refer to all software, mailing lists, documentation,
          etc.
          
                        3. SAX: The Simple API for XML
                                       
   The Simple API for XML isn't a standard in the formal sense, but an
   informal specification designed by David Megginson, with input from
   many people on the xml-dev mailing list. SAX defines an event-driven
   interface for parsing XML. To use SAX, you must create Python class
   instances which implement a specified interface, and the parser will
   then call various methods of those objects.
   
   This howto describes version 2 of SAX (also referred to as SAX2).
   Earlier versions of this text did explain SAX1, which is primarily of
   historical interest only.
   
   SAX is most suitable for purposes where you want to read through an
   entire XML document from beginning to end, and perform some
   computation, such as building a data structure representating a
   document, or summarizing information in a document (computing an
   average value of a certain element, for example). It's not very useful
   if you want to modify the document structure in some complicated way
   that involves changing how elements are nested, though it could be
   used if you simply wish to change element contents or attributes. For
   example, you would not want to re-order chapters in a book using SAX,
   but you might want to change the contents of any name elements with
   the attribute lang equal to 'greek' into Greek letters.
   
   One advantage of SAX is speed and simplicity. Let's say you've defined
   a complicated DTD for listing comic books, and you wish to scan
   through your collection and list everything written by Neil Gaiman.
   For this specialized task, there's no need to expend effort examining
   elements for artists and editors and colourists, because they're
   irrelevant to the search. You can therefore write a class instance
   which ignores all elements that aren't writer.
   
   Another advantage is that you don't have the whole document resident
   in memory at any one time, which matters if you are processing really
   huge documents.
   
   SAX defines 4 basic interfaces; an SAX-compliant XML parser can be
   passed any objects that support these interfaces, and will call
   various methods as data is processed. Your task, therefore, is to
   implement those interfaces that are relevant to your application.
   
   The SAX interfaces are:
   
                             Interface  Purpose
                                      
    ContentHandler Called for general document events. This interface is
       the heart of SAX; its methods are called for the start of the
     document, the start and end of elements, and for the characters of
                      data contained inside elements.
                                      
     DTDHandler Called to handle DTD events required for basic parsing.
    This means notation declarations (XML spec section 4.7) and unparsed
                 entity declarations (XML spec section 4).
                                      
    EntityResolver Called to resolve references to external entities. If
   your documents will have no external entity references, you won't need
                        to implement this interface.
                                      
    ErrorHandler Called for error handling. The parser will call methods
           from this interface to report all warnings and errors.
                                      
   Python doesn't support the concept of interfaces, so the interfaces
   listed above are implemented as Python classes. The default method
   implementations are defined to do nothing--the method body is just a
   Python pass statement-so usually you can simply ignore methods that
   aren't relevant to your application.
   
   Pseudo-code for using SAX looks something like this:
   
# Define your specialized handler classes
from xml.sax import Contenthandler, ...
class docHandler(ContentHandler):
    ...

# Create an instance of the handler classes
dh = docHandler()

# Create an XML parser
parser = ...

# Tell the parser to use your handler instance
parser.setContentHandler(dh)

# Parse the file; your handler's method will get called
parser.parse(sys.stdin)

3.1 Starting Out

   Following the earlier example, let's consider a simple XML format for
   storing information about a comic book collection. Here's a sample
   document for a collection consisting of a single issue:
   
<collection>
  <comic title="Sandman" number='62'>
    <writer>Neil Gaiman</writer>
    <penciller pages='1-9,18-24'>Glyn Dillon</penciller>
    <penciller pages="10-17">Charles Vess</penciller>
  </comic>
</collection>

   An XML document must have a single root element; this is the
   "collection" element. It has one child comic element for each issue;
   the book's title and number are given as attributes of the comic
   element, which can have one or more children containing the issue's
   writer and artists. There may be several artists or writers for a
   single issue.
   
   Let's start off with something simple: a document handler named
   FindIssue that reports whether a given issue is in the collection.
   
from xml.sax import saxutils

class FindIssue(saxutils.DefaultHandler):
    def __init__(self, title, number):
        self.search_title, self.search_number = title, number

   The DefaultHandler class inherits from all four interfaces:
   ContentHandler, DTDHandler, EntityResolver, and ErrorHandler. This is
   what you should use if you want to use one class for everything. When
   you want separate classes for each purpose, or if you want to
   implement only a single interface, you can just subclass each
   interface individually. Neither of the two approaches is always
   ``better'' than the other; their suitability depends on what you're
   trying to do, and on what you prefer.
   
   Since this class is doing a search, an instance needs to know what to
   search for. The desired title and issue number are passed to the
   FindIssue constructor, and stored as part of the instance.
   
   Now let's look at the function which actually does all the work. This
   simple task only requires looking at the attributes of a given
   element, so only the startElement method is relevant.
   
    def startElement(self, name, attrs):
        # If it's not a comic element, ignore it
        if name != 'comic': return

        # Look for the title and number attributes (see text)
        title = attrs.get('title', None)
        number = attrs.get('number', None)
        if title == self.search_title and number == self.search_number:
            print title, '#'+str(number), 'found'

   The startElement() method is passed a string giving the name of the
   element, and an instance containing the element's attributes. The
   latter implements the AttributeList interface, which includes most of
   the semantics of Python dictionaries. Therefore, the function looks
   for comic elements, and compares the specified title and number
   attributes to the search values. If they match, a message is printed
   out.
   
   startElement() is called for every single element in the document. If
   you added print 'Starting element:', name to the top of
   startElement(), you would get the following output.
   
Starting element: collection
Starting element: comic
Starting element: writer
Starting element: penciller
Starting element: penciller

   To actually use the class, we need top-level code that creates
   instances of a parser and of FindIssue, associates them, and then
   calls a parser method to process the input.
   
from xml.sax import make_parser
from xml.sax.handler import feature_namespaces

if __name__ == '__main__':
    # Create a parser
    parser = make_parser()
    # Tell the parser we are not interested in XML namespaces
    parser.setFeature(feature_namespaces, 0)

    # Create the handler
    dh = FindIssue('Sandman', '62')

    # Tell the parser to use our handler
    parser.setContentHandler(dh)

    # Parse the input
    parser.parse(file)

   The make_parser class can automate the job of creating parsers. There
   are already several XML parsers available to Python, and more might be
   added in future. xmllib.py is included with Python 1.5, so it's always
   available, but it's also not particularly fast. A faster version of
   xmllib.py is included in xml.parsers. The xml.parsers.expat module is
   faster still, so it's obviously a preferred choice if it's available.
   make_parser determines which parsers are available and chooses the
   fastest one, so you don't have to know what the different parsers are,
   or how they differ. (You can also tell make_parser to try a list of
   parsers, if you want to use a specific one).
   
   In SAX2, XML namespace are supported. Parsers will not call
   startElement, but startElementNS if namespace processing is active.
   Since our content handler does not implement the namespace-aware
   methods, we request that namespace processing is deactivated. The
   default of this setting varies from parser to parser, so you should
   always set it to a safe value - unless your handlers support either
   method.
   
   Once you've created a parser instance, calling setContentHandler tells
   the parser what to use as the handler.
   
   If you run the above code with the sample XML document, it'll output
   Sandman #62 found.
   
3.2 Error Handling

   Now, try running the above code with this file as input:
   
<collection>
  &foo;
  <comic title="Sandman" number='62'>
</collection>

   The &foo; entity is unknown, and the comic element isn't closed (if it
   was empty, there would be a "/" before the closing ">". As a result,
   you get a SAXParseException, e.g.
   
xml.sax._exceptions.SAXParseException: undefined entity at None:2:2

   The default code for the ErrorHandler interface automatically raises
   an exception for any error; if that is what you want in case of an
   error, you don't need to change the error handler. Otherwise, you
   should provide your own version of the ErrorHandler interface, and at
   minimum override the error() and fatalError() methods. The minimal
   implementation for each method can be a single line. The methods in
   the ErrorHandler interface-warning, error, and fatalError-are all
   passed a single argument, an exception instance. The exception will
   always be a subclass of SAXException, and calling str() on it will
   produce a readable error message explaining the problem.
   
   So, to re-implement a variant of ErrorRaiser, simply define one of the
   three methods to print the exception they're passed:
   
    def error(self, exception):
        import sys
        sys.stderr.write("\%s\n" \% exception)

   With this definition, non-fatal errors will result in an error
   message, whereas fatal errors will continue to produce a traceback.
   
3.3 Searching Element Content

   Let's tackle a slightly more complicated task, printing out all issues
   written by a certain author. This now requires looking at element
   content, because the writer's name is inside a writer element:
   <writer>Peter Milligan</writer>.
   
   The search will be performed using the following algorithm:
   
   1.
          The startElement method will be more complicated. For comic
          elements, the handler has to save the title and number, in case
          this comic is later found to match the search criterion. For
          writer elements, it sets a inWriterContent flag to true, and
          sets a writerName attribute to the empty string.
   2.
          Characters outside of XML tags must be processed. When
          inWriterContent is true, these characters must be added to the
          writerName string.
   3.
          When the writer element is finished, we've now collected all of
          the element's content in the writerName attribute, so we can
          check if the name matches the one we're searching for, and if
          so, print the information about this comic. We must also set
          inWriterContent back to false.
          
   Here's the first part of the code; this implements step 1.
   
from xml.sax import ContentHandler
import string

def normalize_whitespace(text):
    "Remove redundant whitespace from a string"
    return string.join(string.split(text), ' ')

class FindWriter(ContentHandler):
    def __init__(self, search_name):
        # Save the name we're looking for
        self.search_name = normalize_whitespace(search_name)

        # Initialize the flag to false
        self.inWriterContent = 0

    def startElement(self, name, attrs):
        # If it's a comic element, save the title and issue
        if name == 'comic':
            title = normalize_whitespace(attrs.get('title', ""))
            number = normalize_whitespace(attrs.get('number', ""))
            self.this_title = title
            self.this_number = number

        # If it's the start of a writer element, set flag
        elif name == 'writer':
            self.inWriterContent = 1
            self.writerName = ""

   The startElement() method has been discussed previously. Now we have
   to look at how the content of elements is processed.
   
   The normalize_whitespace() function is important, and you'll probably
   use it in your own code. XML treats whitespace very flexibly; you can
   include extra spaces or newlines wherever you like. This means that
   you must normalize the whitespace before comparing attribute values or
   element content; otherwise the comparision might produce a wrong
   result due to the content of two elements having different amounts of
   whitespace.
   
    def characters(self, ch):
        if self.inWriterContent:
            self.writerName = self.writerName + ch

   The characters() method is called for characters that aren't inside
   XML tags. ch is a string of characters. It is not necessarily a byte
   string; parsers may also provide a buffer object that is a slice of
   the full document, or they may pass Unicode objects (as the expat
   parser does in Python 2.0).
   
   You also shouldn't assume that all the characters are passed in a
   single function call. In the example above, there might be only one
   call to characters() for the string "Peter Milligan", or it might call
   characters() once for each character. More realistically, if the
   content contains an entity reference, as in "Wagner &amp; Seagle", the
   parser might call the method three times; once for "Wagner ", once for
   "&", represented by the entity reference, and again for " Seagle".
   
   For step 2 of FindWriter, characters() only has to check
   inWriterContent, and if it's true, add the characters to the string
   being built up.
   
   Finally, when the writer element ends, the entire name has been
   collected, so we can compare it to the name we're searching for.
   
    def endElement(self, name):
        if name == 'writer':
            self.inWriterContent = 0
            self.writerName = normalize_whitespace(self.writerName)
            if self.search_name == self.writerName:
                print 'Found:', self.this_title, self.this_number

   To avoid being confused by differing whitespace, the
   normalize_whitespace() function is called. This can be done because we
   know that leading and trailing whitespace are insignificant for this
   element, in this DTD.
   
   End tags can't have attributes on them, so there's no attrs parameter.
   Empty elements with attributes, such as "<arc name="Season of
   Mists"/>", will result in a call to startElement(), followed
   immediately by a call to endElement().
   
   XXX how are external entities handled? Anything special need to be
   done for them?
   
3.4 Related Links

   http://www.megginson.com/SAX/
          The SAX home page. This has the most recent copy of the
          specification, and lists SAX implementations for various
          languages and platforms. At the moment it's somewhat
          Java-centric.
          
                       4. DOM: The Document Object Model
                                       
   The Document Object Model specifies a tree-based representation for an
   XML document. A top-level Document instance is the root of the tree,
   and has a single child which is the top-level Element instance; this
   Element has children nodes representing the content and any
   sub-elements, which may have further children, and so forth. Functions
   are defined which let you traverse the resulting tree any way you
   like, access element and attribute values, insert and delete nodes,
   and convert the tree back into XML.
   
   The DOM is useful for modifying XML documents, because you can create
   a DOM tree, modify it by adding new nodes and moving subtrees around,
   and then produce a new XML document as output. You can also construct
   a DOM tree yourself, and convert it to XML; this is often a more
   flexible way of producing XML output than simply writing
   <tag1>...</tag1> to a file.
   
   While the DOM doesn't require that the entire tree be resident in
   memory at one time, the Python DOM implementation currently does keep
   the whole tree in RAM. It's possible to write an implementation that
   stores most of the tree on disk or in a database, and reads in new
   sections as they're accessed, but this hasn't been done yet. This
   means you may not have enough memory to process very large documents
   as a DOM tree. A SAX handler, on the other hand, can potentially churn
   through amounts of data far larger than the available RAM.
   
4.1 Getting A DOM Tree

   The easiest way to get a DOM tree is to have it built for you. PyXML
   offers two alternative implementations of the DOM, xml.dom.minidom and
   4DOM. xml.dom.minidom is included in Python 2. It is a minimalistic
   implementation, which means it does not provide all interfaces and
   operations required by the DOM standard. 4DOM (XXX reference) is a
   complete implementation of DOM Level 2 (which is currently work in
   progress), so we will use that in the examples.
   
   One of the modules in the xml.dom package is xml.dom.ext.reader.Sax2,
   which provides the functions FromXmlStream, FromXml, FromXmlFile, and
   FromXmlFile which will construct a DOM tree from their input (a
   file-like object, a string, a file name, and a URL, respectively).
   They all return a DOM Document object.
   
import sys
from xml.dom.ext.reader.Sax import FromXmlStream
from xml.dom.ext import PrettyPrint

# parse the document
doc = FromXmlStream(sys.stdin)

4.2 Manipulating The Tree

   This HOWTO can't be a complete introduction to the Document Object
   Model, because there are lots of interfaces and lots of methods.
   Luckily, the DOM Recommendation is quite a readable document, so I'd
   recommend that you read it to get a complete picture of the available
   interfaces; this will only be a partial overview.
   
   The Document Object Model represents a XML document as a tree of
   nodes, represented by an instance of some subclass of the Node class.
   Some subclasses of Node are Element, Text, and Comment.
   
   We'll use a single example document throughout this section. Here's
   the sample:
   
<?xml version="1.0" encoding="iso-8859-1"?>
<xbel>
  <?processing instruction?>
  <desc>No description</desc>
  <folder>
    <title>XML bookmarks</title>
    <bookmark href="http://www.python.org/sigs/xml-sig/" >
      <title>SIG for XML Processing in Python</title>
    </bookmark>
  </folder>
</xbel>

   Converted to a DOM tree, this document could produce the following
   tree:
   
Element xbel None
   Text #text '  \012  '
   ProcessingInstruction processing 'instruction'
   Text #text '\012  '
   Element desc None
      Text #text 'No description'
   Text #text '\012  '
   Element folder None
      Text #text '\012    '
      Element title None
         Text #text 'XML bookmarks'
      Text #text '\012    '
      Element bookmark None
         Text #text '\012      '
         Element title None
            Text #text 'SIG for XML Processing in Python'
         Text #text '\012    '
      Text #text '\012  '
   Text #text '\012'

   This isn't the only possible tree, because different parsers may
   differ in how they generate Text nodes; any of the Text nodes in the
   above tree might be split into multiple nodes.)
   
  4.2.1 The Node class
  
   We'll start by considering the basic Node class. All the other DOM
   nodes -- Document, Element, Text, and so forth -- are subclasses of
   Node. It's possible to perform many tasks using just the interface
   provided by Node.
   
   XXX table of attributes and methods readonly attribute DOMString
   nodeName; attribute DOMString nodeValue; // raises(DOMException) on
   setting // raises(DOMException) on retrieval readonly attribute
   unsigned short nodeType; readonly attribute Node parentNode; readonly
   attribute NodeList childNodes; readonly attribute Node firstChild;
   readonly attribute Node lastChild; readonly attribute Node
   previousSibling; readonly attribute Node nextSibling; readonly
   attribute NamedNodeMap attributes; readonly attribute Document
   ownerDocument;
   
   Node insertBefore(in Node newChild, in Node refChild)
   raises(DOMException); Node replaceChild(in Node newChild, in Node
   oldChild) raises(DOMException); Node removeChild(in Node oldChild)
   raises(DOMException); Node appendChild(in Node newChild)
   raises(DOMException); boolean hasChildNodes(); Node cloneNode(in
   boolean deep);
   
  4.2.2 Document, Element, and Text nodes
  
   The base of the entire tree is the Document node. Its documentElement
   attribute contains the Element node for the root element. The Document
   node may have additional children, such as ProcessingInstruction
   nodes; the complete list of children XXX.
   
4.3 Walking Over The Entire Tree

   The xml.dom package also includes various helper classes for common
   tasks such as walking over trees.
   
   The Walker class
   
   Introduction to the walker class
   
4.4 Building A Document

   Intro to builder
   
4.5 Processing HTML

   Intro to HTML builder
   
4.6 Related Links

   http://www.w3.org/DOM/
          The World Wide Web Consortium's DOM page.
          
   http://www.w3.org/TR/1998/REC-DOM-Level-1-19981001/
          The DOM Level 1 Recommendation. Unlike most standards, this one
          is actually pretty readable, particularly if you're only
          interested in the Core XML interfaces.
          
                                  5. Glossary
                                       
   XML has given rise to a sea of acronyms and terms. This section will
   list the most significant terms, and sketch their relevance.
   
   Many of the following definitions are taken from Lars Marius Garshol's
   SGML glossary, at
   http://www.stud.ifi.uio.no/larsga/download/diverse/sgmlglos.html.
   
   DOM (Document Object Model)
          The Document Object Model is intended to be a platform- and
          language-neutral interface that will allow programs and scripts
          to dynamically access and update the content, structure and
          style of documents. Documents will be represented as tree
          structures which can be traversed and modified.
          
   DTD (Document Type Definition)
          A Document Type Definition (nearly always called DTD) defines
          an XML document type, complete with element types, entities and
          an XML declaration. In other words: a DTD completely describes
          one particular kind of XML document, such as, for instance,
          HTML 3.2.
          
   SAX (Simple API for XML)
          SAX is a simple standardized API for XML parsers developed by
          the contributors to the xml-dev mailing list. The interface is
          mostly language-independent, as long as the language is
          object-oriented; the first implementation was written for Java,
          but a Python implementation is also available. SAX is supported
          by many XML parsers.
          
   XML (eXtensible Markup Language)
          XML is an SGML application profile specialized for use on the
          web and has its own standards for linking and stylesheets under
          development.
          
   XSL (eXtensible Style Language)
          XSL is a proposal for a stylesheet language for XML, which
          enables browsers to lay out XML documents in an attractive
          manner, and also provides a way to convert XML documents to
          HTML.
          
                            About this document ...
                                       
   Python/XML HOWTO
   
   This document was generated using the LaTeX2HTML translator.
   
   LaTeX2HTML is Copyright  1993, 1994, 1995, 1996, 1997, Nikos Drakos,
   Computer Based Learning Unit, University of Leeds, and Copyright 
   1997, 1998, Ross Moore, Mathematics Department, Macquarie University,
   Sydney.
   
   The application of LaTeX2HTML to the Python documentation has been
   heavily tailored by Fred L. Drake, Jr. Original navigation icons were
   contributed by Christopher Petrilli.
     _________________________________________________________________
   
                              Python/XML HOWTO