[1]SourceForge.net Logo
pullparser
A simple "pull API" for HTML parsing, after Perl's HTML::TokeParser.
Many simple HTML parsing tasks are simpler this way than with the
HTMLParser module. pullparser.PullParser is a subclass of
HTMLParser.HTMLParser.
Examples:
This program extracts all links from a document. It will print one
line for each link, containing the URL and the textual description
between the ... tags:
import pullparser, sys
f = file(sys.argv[1])
p = pullparser.PullParser(f)
for token in p.tags("a"):
if token.type == "endtag": continue
url = dict(token.attrs).get("href", "-")
text = p.get_compressed_text(endat=("endtag", "a"))
print "%s\t%s" % (url, text)
This program extracts the
from the document:
import pullparser, sys
f = file(sys.argv[1])
p = pullparser.PullParser(f)
if p.get_tag("title"):
title = p.get_compressed_text()
print "Title: %s" % title
Thanks to Gisle Aas, who wrote HTML::TokeParser.
Download
All documentation (including this web page) is included in the
distribution.
Stable release.
* [2]pullparser-0.1.0.tar.gz
* [3]pullparser-0.1.0.zip
* [4]Change Log (included in distribution)
* [5]Older versions.
For installation instructions, see the INSTALL file included in the
distribution.
Subversion
The [6]Subversion (SVN) trunk is
[7]http://codespeak.net/svn/wwwsearch/pullparser/trunk, so to check
out the source:
svn co http://codespeak.net/svn/wwwsearch/pullparser/trunk pullparser
See also
[8]Beautiful Soup is widely recommended. More robust than this module.
I recommend [9]Beautiful Soup over pullparser for new web scraping
code. More robust and flexible than this module.
FAQs
* Which version of Python do I need?
2.2.1 or above.
* Which license?
pullparser is dual-licensed: you may pick either the [10]BSD
license, or the [11]ZPL 2.1 (both are included in the
distribution).
* Why does it fail to parse my HTML?
Because module HTMLParser is fussy. Try
pullparser.TolerantPullParser instead, which uses module sgmllib
instead. Note that self-closing tags () will show up as
'starttag' tags, not 'startendtag' tags if you use this class -
this is a limitation of module sgmllib.
* Why don't I see the tokens I expect?
+ Are there missing end-tags in your HTML? (Maybe this will
improve in future.)
+ Element names passed to methods such as
PullParser.get_token() must be given in lower case - maybe
you forgot that? (Element names in the HTML can be any case,
of course.)
+ HTMLParser.HTMLParser isn't very robust. Would be fairly easy
to (perhaps optionally) rebase on the other standard library
HTML parsing module, sgmllib.SGMLParser (which is really an
HTML parser, not a full SGML parser, despite the name). I'm
not going to do that, though.
I prefer questions and comments to be sent to the [12]mailing list
rather than direct to me.
[13]John J. Lee, April 2006.
_________________________________________________________________
[14]Home
[15]General FAQs
[16]mechanize
pullparser
[17]ClientCookie
[18]ClientCookie docs
[19]ClientForm
[20]DOMForm
[21]python-spidermonkey
[22]ClientTable
[23]1.5.2 urllib2.py
[24]1.5.2 urllib.py
[25]Download
[26]FAQs
References
1. http://sourceforge.net/
2. http://wwwsearch.sourceforge.net/pullparser/src/pullparser-0.1.0.tar.gz
3. http://wwwsearch.sourceforge.net/pullparser/src/pullparser-0.1.0.zip
4. http://wwwsearch.sourceforge.net/pullparser/src/ChangeLog.txt
5. http://wwwsearch.sourceforge.net/pullparser/src/
6. http://subversion.tigris.org/
7. http://codespeak.net/svn/wwwsearch/pullparser/trunk#egg=pullparser-dev
8. http://www.crummy.com/software/BeautifulSoup/
9. http://www.crummy.com/software/BeautifulSoup/
10. http://www.opensource.org/licenses/bsd-license.php
11. http://www.zope.org/Resources/ZPL
12. http://lists.sourceforge.net/lists/listinfo/wwwsearch-general
13. mailto:jjl@pobox.com
14. http://wwwsearch.sourceforge.net/
15. http://wwwsearch.sourceforge.net/bits/GeneralFAQ.html
16. http://wwwsearch.sourceforge.net/mechanize/
17. http://wwwsearch.sourceforge.net/ClientCookie/
18. http://wwwsearch.sourceforge.net/ClientCookie/doc.html
19. http://wwwsearch.sourceforge.net/ClientForm/
20. http://wwwsearch.sourceforge.net/DOMForm/
21. http://wwwsearch.sourceforge.net/python-spidermonkey/
22. http://wwwsearch.sourceforge.net/ClientTable/
23. http://wwwsearch.sourceforge.net/bits/urllib2_152.py
24. http://wwwsearch.sourceforge.net/bits/urllib_152.py
25. http://wwwsearch.sourceforge.net/pullparser/#download
26. http://wwwsearch.sourceforge.net/pullparser/#faq