[1]SourceForge.net Logo pullparser A simple "pull API" for HTML parsing, after Perl's HTML::TokeParser. Many simple HTML parsing tasks are simpler this way than with the HTMLParser module. pullparser.PullParser is a subclass of HTMLParser.HTMLParser. Examples: This program extracts all links from a document. It will print one line for each link, containing the URL and the textual description between the ... tags: import pullparser, sys f = file(sys.argv[1]) p = pullparser.PullParser(f) for token in p.tags("a"): if token.type == "endtag": continue url = dict(token.attrs).get("href", "-") text = p.get_compressed_text(endat=("endtag", "a")) print "%s\t%s" % (url, text) This program extracts the from the document: import pullparser, sys f = file(sys.argv[1]) p = pullparser.PullParser(f) if p.get_tag("title"): title = p.get_compressed_text() print "Title: %s" % title Thanks to Gisle Aas, who wrote HTML::TokeParser. Download All documentation (including this web page) is included in the distribution. Stable release. * [2]pullparser-0.1.0.tar.gz * [3]pullparser-0.1.0.zip * [4]Change Log (included in distribution) * [5]Older versions. For installation instructions, see the INSTALL file included in the distribution. Subversion The [6]Subversion (SVN) trunk is [7]http://codespeak.net/svn/wwwsearch/pullparser/trunk, so to check out the source: svn co http://codespeak.net/svn/wwwsearch/pullparser/trunk pullparser See also [8]Beautiful Soup is widely recommended. More robust than this module. I recommend [9]Beautiful Soup over pullparser for new web scraping code. More robust and flexible than this module. FAQs * Which version of Python do I need? 2.2.1 or above. * Which license? pullparser is dual-licensed: you may pick either the [10]BSD license, or the [11]ZPL 2.1 (both are included in the distribution). * Why does it fail to parse my HTML? Because module HTMLParser is fussy. Try pullparser.TolerantPullParser instead, which uses module sgmllib instead. Note that self-closing tags (<foo/>) will show up as 'starttag' tags, not 'startendtag' tags if you use this class - this is a limitation of module sgmllib. * Why don't I see the tokens I expect? + Are there missing end-tags in your HTML? (Maybe this will improve in future.) + Element names passed to methods such as PullParser.get_token() must be given in lower case - maybe you forgot that? (Element names in the HTML can be any case, of course.) + HTMLParser.HTMLParser isn't very robust. Would be fairly easy to (perhaps optionally) rebase on the other standard library HTML parsing module, sgmllib.SGMLParser (which is really an HTML parser, not a full SGML parser, despite the name). I'm not going to do that, though. I prefer questions and comments to be sent to the [12]mailing list rather than direct to me. [13]John J. Lee, April 2006. _________________________________________________________________ [14]Home [15]General FAQs [16]mechanize pullparser [17]ClientCookie [18]ClientCookie docs [19]ClientForm [20]DOMForm [21]python-spidermonkey [22]ClientTable [23]1.5.2 urllib2.py [24]1.5.2 urllib.py [25]Download [26]FAQs References 1. http://sourceforge.net/ 2. http://wwwsearch.sourceforge.net/pullparser/src/pullparser-0.1.0.tar.gz 3. http://wwwsearch.sourceforge.net/pullparser/src/pullparser-0.1.0.zip 4. http://wwwsearch.sourceforge.net/pullparser/src/ChangeLog.txt 5. http://wwwsearch.sourceforge.net/pullparser/src/ 6. http://subversion.tigris.org/ 7. http://codespeak.net/svn/wwwsearch/pullparser/trunk#egg=pullparser-dev 8. http://www.crummy.com/software/BeautifulSoup/ 9. http://www.crummy.com/software/BeautifulSoup/ 10. http://www.opensource.org/licenses/bsd-license.php 11. http://www.zope.org/Resources/ZPL 12. http://lists.sourceforge.net/lists/listinfo/wwwsearch-general 13. mailto:jjl@pobox.com 14. http://wwwsearch.sourceforge.net/ 15. http://wwwsearch.sourceforge.net/bits/GeneralFAQ.html 16. http://wwwsearch.sourceforge.net/mechanize/ 17. http://wwwsearch.sourceforge.net/ClientCookie/ 18. http://wwwsearch.sourceforge.net/ClientCookie/doc.html 19. http://wwwsearch.sourceforge.net/ClientForm/ 20. http://wwwsearch.sourceforge.net/DOMForm/ 21. http://wwwsearch.sourceforge.net/python-spidermonkey/ 22. http://wwwsearch.sourceforge.net/ClientTable/ 23. http://wwwsearch.sourceforge.net/bits/urllib2_152.py 24. http://wwwsearch.sourceforge.net/bits/urllib_152.py 25. http://wwwsearch.sourceforge.net/pullparser/#download 26. http://wwwsearch.sourceforge.net/pullparser/#faq