[1]SourceForge.net Logo

                                  pullparser

   A simple "pull API" for HTML parsing, after Perl's HTML::TokeParser.
   Many simple HTML parsing tasks are simpler this way than with the
   HTMLParser module. pullparser.PullParser is a subclass of
   HTMLParser.HTMLParser.

   Examples:

   This program extracts all links from a document. It will print one
   line for each link, containing the URL and the textual description
   between the <a>...</a> tags:
import pullparser, sys
f = file(sys.argv[1])
p = pullparser.PullParser(f)
for token in p.tags("a"):
    if token.type == "endtag": continue
    url = dict(token.attrs).get("href", "-")
    text = p.get_compressed_text(endat=("endtag", "a"))
    print "%s\t%s" % (url, text)

   This program extracts the <title> from the document:
import pullparser, sys
f = file(sys.argv[1])
p = pullparser.PullParser(f)
if p.get_tag("title"):
    title = p.get_compressed_text()
    print "Title: %s" % title

   Thanks to Gisle Aas, who wrote HTML::TokeParser.

Download

   All documentation (including this web page) is included in the
   distribution.

   Stable release.
     * [2]pullparser-0.1.0.tar.gz
     * [3]pullparser-0.1.0.zip
     * [4]Change Log (included in distribution)
     * [5]Older versions.

   For installation instructions, see the INSTALL file included in the
   distribution.

Subversion

   The [6]Subversion (SVN) trunk is
   [7]http://codespeak.net/svn/wwwsearch/pullparser/trunk, so to check
   out the source:
svn co http://codespeak.net/svn/wwwsearch/pullparser/trunk pullparser

See also

   [8]Beautiful Soup is widely recommended. More robust than this module.

   I recommend [9]Beautiful Soup over pullparser for new web scraping
   code. More robust and flexible than this module.

FAQs

     * Which version of Python do I need?
       2.2.1 or above.
     * Which license?
       pullparser is dual-licensed: you may pick either the [10]BSD
       license, or the [11]ZPL 2.1 (both are included in the
       distribution).
     * Why does it fail to parse my HTML?
       Because module HTMLParser is fussy. Try
       pullparser.TolerantPullParser instead, which uses module sgmllib
       instead. Note that self-closing tags (<foo/>) will show up as
       'starttag' tags, not 'startendtag' tags if you use this class -
       this is a limitation of module sgmllib.
     * Why don't I see the tokens I expect?
          + Are there missing end-tags in your HTML? (Maybe this will
            improve in future.)
          + Element names passed to methods such as
            PullParser.get_token() must be given in lower case - maybe
            you forgot that? (Element names in the HTML can be any case,
            of course.)
          + HTMLParser.HTMLParser isn't very robust. Would be fairly easy
            to (perhaps optionally) rebase on the other standard library
            HTML parsing module, sgmllib.SGMLParser (which is really an
            HTML parser, not a full SGML parser, despite the name). I'm
            not going to do that, though.

   I prefer questions and comments to be sent to the [12]mailing list
   rather than direct to me.

   [13]John J. Lee, April 2006.
     _________________________________________________________________

   [14]Home
   [15]General FAQs
   [16]mechanize
   pullparser
   [17]ClientCookie
   [18]ClientCookie docs
   [19]ClientForm
   [20]DOMForm
   [21]python-spidermonkey
   [22]ClientTable
   [23]1.5.2 urllib2.py
   [24]1.5.2 urllib.py
   [25]Download
   [26]FAQs

References

   1. http://sourceforge.net/
   2. http://wwwsearch.sourceforge.net/pullparser/src/pullparser-0.1.0.tar.gz
   3. http://wwwsearch.sourceforge.net/pullparser/src/pullparser-0.1.0.zip
   4. http://wwwsearch.sourceforge.net/pullparser/src/ChangeLog.txt
   5. http://wwwsearch.sourceforge.net/pullparser/src/
   6. http://subversion.tigris.org/
   7. http://codespeak.net/svn/wwwsearch/pullparser/trunk#egg=pullparser-dev
   8. http://www.crummy.com/software/BeautifulSoup/
   9. http://www.crummy.com/software/BeautifulSoup/
  10. http://www.opensource.org/licenses/bsd-license.php
  11. http://www.zope.org/Resources/ZPL
  12. http://lists.sourceforge.net/lists/listinfo/wwwsearch-general
  13. mailto:jjl@pobox.com
  14. http://wwwsearch.sourceforge.net/
  15. http://wwwsearch.sourceforge.net/bits/GeneralFAQ.html
  16. http://wwwsearch.sourceforge.net/mechanize/
  17. http://wwwsearch.sourceforge.net/ClientCookie/
  18. http://wwwsearch.sourceforge.net/ClientCookie/doc.html
  19. http://wwwsearch.sourceforge.net/ClientForm/
  20. http://wwwsearch.sourceforge.net/DOMForm/
  21. http://wwwsearch.sourceforge.net/python-spidermonkey/
  22. http://wwwsearch.sourceforge.net/ClientTable/
  23. http://wwwsearch.sourceforge.net/bits/urllib2_152.py
  24. http://wwwsearch.sourceforge.net/bits/urllib_152.py
  25. http://wwwsearch.sourceforge.net/pullparser/#download
  26. http://wwwsearch.sourceforge.net/pullparser/#faq