File: README.html.in

package info (click to toggle)
python-pullparser 0.1.0-1
  • links: PTS
  • area: main
  • in suites: etch, etch-m68k
  • size: 152 kB
  • ctags: 125
  • sloc: python: 791; makefile: 62
file content (160 lines) | stat: -rw-r--r-- 5,392 bytes parent folder | download | duplicates (2)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN"
        "http://www.w3.org/TR/html4/strict.dtd">
@# This file is processed by EmPy to colorize Python source code
@# http://wwwsearch.sf.net/bits/colorize.py
@{from colorize import colorize}
@{import time}
@{import release}
@{last_modified = release.svn_id_to_time("$Id: README.html.in 25584 2006-04-08 18:27:21Z jjlee $")}
<html>
<head>
  <meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1">
  <meta name="author" content="John J. Lee &lt;jjl@@pobox.com&gt;">
  <meta name="date" content="@(time.strftime("%Y-%m-%d", last_modified))">
  <meta name="keywords" content="Python,HTML,token,parser,pull,API,web,client,client-side">
  <title>pullparser</title>
  <style type="text/css" media="screen">@@import "../styles/style.css";</style>
  <base href="http://wwwsearch.sourceforge.net/pullparser/">
</head>
<body>

<div id="sf"><a href="http://sourceforge.net">
<img src="http://sourceforge.net/sflogo.php?group_id=48205&amp;type=2"
 width="125" height="37" alt="SourceForge.net Logo"></a></div>
<!--<img src="../images/sflogo.png"-->

<h1>pullparser</h1>

<div id="Content">

<p>A simple "pull API" for HTML parsing, after Perl's
<code>HTML::TokeParser</code>.  Many simple HTML parsing tasks are
simpler this way than with the <code>HTMLParser</code> module.
<code>pullparser.PullParser</code> is a subclass of
<code>HTMLParser.HTMLParser</code>.

<p>Examples:

<p>This program extracts all links from a document.  It will print one line for
each link, containing the URL and the textual description between the
<code>&lt;a&gt;...&lt;/a&gt;</code> tags:

@{colorize(r"""
import pullparser, sys
f = file(sys.argv[1])
p = pullparser.PullParser(f)
for token in p.tags("a"):
    if token.type == "endtag": continue
    url = dict(token.attrs).get("href", "-")
    text = p.get_compressed_text(endat=("endtag", "a"))
    print "%s\t%s" % (url, text)
""")}

<p>This program extracts the <code>&lt;title&gt;</code> from the document:

@{colorize(r"""
import pullparser, sys
f = file(sys.argv[1])
p = pullparser.PullParser(f)
if p.get_tag("title"):
    title = p.get_compressed_text()
    print "Title: %s" % title
""")}

<p>Thanks to Gisle Aas, who wrote <code>HTML::TokeParser</code>.


<a name="download"></a>
<h2>Download</h2>
<p>All documentation (including this web page) is included in the distribution.

<p><em>Stable release.</em>

<ul>
@{version = "0.1.0"}
<li><a href="./src/pullparser-@(version).tar.gz">pullparser-@(version).tar.gz</a>
<li><a href="./src/pullparser-@(version).zip">pullparser-@(version).zip</a>
<li><a href="./src/ChangeLog.txt">Change Log</a> (included in distribution)
<li><a href="./src/">Older versions.</a>
</ul>

<p>For installation instructions, see the INSTALL file included in the
distribution.


<h2>Subversion</h2>

<p>The <a href="http://subversion.tigris.org/">Subversion (SVN)</a> trunk is <a href="http://codespeak.net/svn/wwwsearch/pullparser/trunk#egg=pullparser-dev">http://codespeak.net/svn/wwwsearch/pullparser/trunk</a>, so to check out the source:

<pre>
svn co http://codespeak.net/svn/wwwsearch/pullparser/trunk pullparser
</pre>


<h2>See also</h2>

<p><a href="http://www.crummy.com/software/BeautifulSoup/">Beautiful
Soup</a> is widely recommended.  More robust than this module.

<p>I recommend <a
href="http://www.crummy.com/software/BeautifulSoup/">Beautiful Soup</a> over
pullparser for new web scraping code.  More robust and flexible than this
module.

<a name="faq"></a>
<h2>FAQs</h2>
<ul>
  <li>Which version of Python do I need?
  <p>2.2.1 or above.
  <li>Which license?
  <p>pullparser is dual-licensed: you may pick either the
     <a href="http://www.opensource.org/licenses/bsd-license.php">BSD license</a>,
     or the <a href="http://www.zope.org/Resources/ZPL">ZPL 2.1</a> (both are
     included in the distribution).
  <li>Why does it fail to parse my HTML?
    <p>Because module <code>HTMLParser</code> is fussy.  Try
    <code>pullparser.TolerantPullParser</code> instead, which uses module
    <code>sgmllib</code> instead.  Note that self-closing tags (&lt;foo/&gt;)
    will show up as 'starttag' tags, not 'startendtag' tags if you use this
    class - this is a limitation of module <code>sgmllib</code>.
  <li>Why don't I see the tokens I expect?
  <p>
  <ul>
    <li>Are there missing end-tags in your HTML?  (Maybe this will improve in
        future.)
    <li>Element names passed to methods such as PullParser.get_token() must be
        given in lower case - maybe you forgot that?  (Element names <em>in the
        HTML</em> can be any case, of course.)
    <li><code>HTMLParser.HTMLParser</code> isn't very robust.  Would be fairly
        easy to (perhaps optionally) rebase on the other standard library HTML
        parsing module, <code>sgmllib.SGMLParser</code> (which is really an
        HTML parser, not a full SGML parser, despite the name).  I'm not going
        to do that, though.
  </ul>
</ul>

<p>I prefer questions and comments to be sent to the <a
href="http://lists.sourceforge.net/lists/listinfo/wwwsearch-general">
mailing list</a> rather than direct to me.

<p><a href="mailto:jjl@@pobox.com">John J. Lee</a>,
@(time.strftime("%B %Y", last_modified)).

<hr>

</div>

<div id="Menu">

@(release.navbar('pullparser'))

<br>

<a href="./#download">Download</a><br>
<a href="./#faq">FAQs</a><br>

</div>


</body>
</html>