1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160
|
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN"
"http://www.w3.org/TR/html4/strict.dtd">
@# This file is processed by EmPy to colorize Python source code
@# http://wwwsearch.sf.net/bits/colorize.py
@{from colorize import colorize}
@{import time}
@{import release}
@{last_modified = release.svn_id_to_time("$Id: README.html.in 25584 2006-04-08 18:27:21Z jjlee $")}
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1">
<meta name="author" content="John J. Lee <jjl@@pobox.com>">
<meta name="date" content="@(time.strftime("%Y-%m-%d", last_modified))">
<meta name="keywords" content="Python,HTML,token,parser,pull,API,web,client,client-side">
<title>pullparser</title>
<style type="text/css" media="screen">@@import "../styles/style.css";</style>
<base href="http://wwwsearch.sourceforge.net/pullparser/">
</head>
<body>
<div id="sf"><a href="http://sourceforge.net">
<img src="http://sourceforge.net/sflogo.php?group_id=48205&type=2"
width="125" height="37" alt="SourceForge.net Logo"></a></div>
<!--<img src="../images/sflogo.png"-->
<h1>pullparser</h1>
<div id="Content">
<p>A simple "pull API" for HTML parsing, after Perl's
<code>HTML::TokeParser</code>. Many simple HTML parsing tasks are
simpler this way than with the <code>HTMLParser</code> module.
<code>pullparser.PullParser</code> is a subclass of
<code>HTMLParser.HTMLParser</code>.
<p>Examples:
<p>This program extracts all links from a document. It will print one line for
each link, containing the URL and the textual description between the
<code><a>...</a></code> tags:
@{colorize(r"""
import pullparser, sys
f = file(sys.argv[1])
p = pullparser.PullParser(f)
for token in p.tags("a"):
if token.type == "endtag": continue
url = dict(token.attrs).get("href", "-")
text = p.get_compressed_text(endat=("endtag", "a"))
print "%s\t%s" % (url, text)
""")}
<p>This program extracts the <code><title></code> from the document:
@{colorize(r"""
import pullparser, sys
f = file(sys.argv[1])
p = pullparser.PullParser(f)
if p.get_tag("title"):
title = p.get_compressed_text()
print "Title: %s" % title
""")}
<p>Thanks to Gisle Aas, who wrote <code>HTML::TokeParser</code>.
<a name="download"></a>
<h2>Download</h2>
<p>All documentation (including this web page) is included in the distribution.
<p><em>Stable release.</em>
<ul>
@{version = "0.1.0"}
<li><a href="./src/pullparser-@(version).tar.gz">pullparser-@(version).tar.gz</a>
<li><a href="./src/pullparser-@(version).zip">pullparser-@(version).zip</a>
<li><a href="./src/ChangeLog.txt">Change Log</a> (included in distribution)
<li><a href="./src/">Older versions.</a>
</ul>
<p>For installation instructions, see the INSTALL file included in the
distribution.
<h2>Subversion</h2>
<p>The <a href="http://subversion.tigris.org/">Subversion (SVN)</a> trunk is <a href="http://codespeak.net/svn/wwwsearch/pullparser/trunk#egg=pullparser-dev">http://codespeak.net/svn/wwwsearch/pullparser/trunk</a>, so to check out the source:
<pre>
svn co http://codespeak.net/svn/wwwsearch/pullparser/trunk pullparser
</pre>
<h2>See also</h2>
<p><a href="http://www.crummy.com/software/BeautifulSoup/">Beautiful
Soup</a> is widely recommended. More robust than this module.
<p>I recommend <a
href="http://www.crummy.com/software/BeautifulSoup/">Beautiful Soup</a> over
pullparser for new web scraping code. More robust and flexible than this
module.
<a name="faq"></a>
<h2>FAQs</h2>
<ul>
<li>Which version of Python do I need?
<p>2.2.1 or above.
<li>Which license?
<p>pullparser is dual-licensed: you may pick either the
<a href="http://www.opensource.org/licenses/bsd-license.php">BSD license</a>,
or the <a href="http://www.zope.org/Resources/ZPL">ZPL 2.1</a> (both are
included in the distribution).
<li>Why does it fail to parse my HTML?
<p>Because module <code>HTMLParser</code> is fussy. Try
<code>pullparser.TolerantPullParser</code> instead, which uses module
<code>sgmllib</code> instead. Note that self-closing tags (<foo/>)
will show up as 'starttag' tags, not 'startendtag' tags if you use this
class - this is a limitation of module <code>sgmllib</code>.
<li>Why don't I see the tokens I expect?
<p>
<ul>
<li>Are there missing end-tags in your HTML? (Maybe this will improve in
future.)
<li>Element names passed to methods such as PullParser.get_token() must be
given in lower case - maybe you forgot that? (Element names <em>in the
HTML</em> can be any case, of course.)
<li><code>HTMLParser.HTMLParser</code> isn't very robust. Would be fairly
easy to (perhaps optionally) rebase on the other standard library HTML
parsing module, <code>sgmllib.SGMLParser</code> (which is really an
HTML parser, not a full SGML parser, despite the name). I'm not going
to do that, though.
</ul>
</ul>
<p>I prefer questions and comments to be sent to the <a
href="http://lists.sourceforge.net/lists/listinfo/wwwsearch-general">
mailing list</a> rather than direct to me.
<p><a href="mailto:jjl@@pobox.com">John J. Lee</a>,
@(time.strftime("%B %Y", last_modified)).
<hr>
</div>
<div id="Menu">
@(release.navbar('pullparser'))
<br>
<a href="./#download">Download</a><br>
<a href="./#faq">FAQs</a><br>
</div>
</body>
</html>
|