File: GeneralFAQ.html

package info (click to toggle)
python-pullparser 0.1.0-1
  • links: PTS
  • area: main
  • in suites: etch, etch-m68k
  • size: 152 kB
  • ctags: 125
  • sloc: python: 791; makefile: 62
file content (169 lines) | stat: -rw-r--r-- 8,164 bytes parent folder | download | duplicates (5)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN"
        "http://www.w3.org/TR/html4/strict.dtd">




<html>
<head>
  <meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1">
  <meta name="author" content="John J. Lee &lt;jjl@pobox.com&gt;">
  <meta name="date" content="2006-01-05">
  <meta name="keywords" content="FAQ,cookie,HTTP,HTML,form,table,Python,web,client,client-side,testing,sniffer,https,script,embedded">
  <title>Python web-client programming general FAQs</title>
  <style type="text/css" media="screen">@import "../styles/style.css";</style>
  <base href="http://wwwsearch.sourceforge.net/bits/clientx.html">
</head>
<body>

<div id="sf"><a href="http://sourceforge.net">
<img src="http://sourceforge.net/sflogo.php?group_id=48205&amp;type=2"
 width="125" height="37" alt="SourceForge.net Logo"></a></div>
<!--<img src="../images/sflogo.png"-->

<h1>Python web-client programming general FAQs</h1>

<div id="Content">
<ul>
  <li>Is there any example code?
     <p>There's (still!) a bit of a shortage of example code for ClientCookie
     and ClientForm &amp;co., because the stuff I've written tends to either
     require access to restricted-access sites, or is proprietary code (and the
     same goes for other people's code).
  <li>HTTPS on Windows?
     <p>Use this <a href="http://pypgsql.sourceforge.net/misc/python22-win32-ssl.zip">
     _socket.pyd</a>, or use Python 2.3.
  <li>I want to see what my web browser is doing, but standard network sniffers
     like <a href="http://www.ethereal.com/">ethereal</a> or netcat (nc) don't
     work for HTTPS.  How do I sniff HTTPS traffic?
  <p>Three good options:
  <ul>
    <li>Mozilla plugin: <a href="http://livehttpheaders.mozdev.org/">
     livehttpheaders</a>.
    <li><a href="http://www.blunck.info/iehttpheaders.html">ieHTTPHeaders</a>
     does the same for MSIE.
    <li>Use <a href="http://lynx.browser.org/">lynx</a> <code>-trace</code>,
     and filter out the junk with a script.
  </ul>
  <p>I'm told you can also use a proxy like <a
  href="http://www.proxomitron.info/">proxomitron</a> (never tried it
  myself).  There's also a commercial <a href="http://www.simtec.ltd.uk/">MSIE
  plugin</a>.
  <li>Embedded script is messing up my web-scraping.  What do I do?
     <p>It is possible to embed script in HTML pages (sandwiched between
     <code>&lt;SCRIPT&gt;here&lt;/SCRIPT&gt;</code> tags, and in
     <code>javascript:</code> URLs) - JavaScript / ECMAScript, VBScript, or
     even Python.  These scripts can do all sorts of things, including causing
     cookies to be set in a browser, submitting or filling in parts of forms in
     response to user actions, changing link colours as the mouse moves over a
     link, etc.

     <p>If you come across this in a page you want to automate, you
     have four options.  Here they are, roughly in order of simplicity.

     <ul>
       <li>Simply figure out what the embedded script is doing and emulate it
       in your Python code: for example, by manually adding cookies to your
       <code>CookieJar</code> instance, calling methods on
       <code>HTMLForm</code>s, calling <code>urlopen</code>, etc.

       <li>Dump ClientCookie and ClientForm and automate a browser instead.
       For example use MS Internet Explorer via its COM automation interfaces, using
       the <a href="http://starship.python.net/crew/mhammond/">Python for
       Windows extensions</a>, aka pywin32, aka win32all (eg.
       <a href="http://vsbabu.org/mt/archives/2003/06/13/ie_automation.html">simple
       function</a>, <a href="http://pamie.sourceforge.net/">pamie</a>;
       <a href="http://www.oreilly.com/catalog/pythonwin32/chapter/ch12.html">
       pywin32 chapter from the O'Reilly book</a>) or
       <a href="http://starship.python.net/crew/theller/ctypes/">ctypes</a>
       (<a href="http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/305273">
       example</a>: may be out of date, since <code>ctypes</code>' COM support is
       still evolving).
       <a href="http://www.brunningonline.net/simon/blog/archives/winGuiAuto.py.html">This</a>
       kind of thing may also come in useful on Windows for cases where the
       automation API is lacking.
       <a href="http://ftp.acc.umu.se/pub/GNOME/sources/pyphany/">pyphany</a>
       is a binding to the <a href="http://www.gnome.org/projects/epiphany/">
       epiphany web browser</a>, allowing both plugins and automation code to be
       written in Python.
       XXX Mozilla automation &amp; XPCOM / PyXPCOM, Konqueror &amp; DCOP / KParts / PyKDE).

       <li>Use Java's <a href="httpunit.sourceforge.net">httpunit</a> from
       Jython, since it knows some JavaScript.
       <li>Get ambitious and automatically delegate the work to an appropriate
       interpreter (Mozilla's JavaScript interpreter, for instance).  This
       approach is the one taken by <a href="../DOMForm">DOMForm</a> (the
       JavaScript support is &quot;very alpha&quot;, though!).
     </ul>
  <li>Misc links
     <ul>
       <li><a href="http://www.crummy.com/software/BeautifulSoup/">Beautiful
       Soup</a> is a widely recommended HTML-parsing module.
       <li><a href="http://linux.duke.edu/projects/urlgrabber/">urlgrabber</a>
       contains useful stuff like persistent connections, mirroring and
       throttling, and it looks like most or all of it is well-integrated with
       <code>urllib2</code> (originally part of the yum package manager, but
       now becoming a separate project).
       <li>Another Java thing: <a href="http://maxq.tigris.org/">maxq</a>,
       which provides a proxy to aid automatic generation of functional tests
       written in Jython using the standard library unittest module (PyUnit)
       and the &quot;Jakarta Commons&quot; HttpClient library.
       <li>A useful set Zope-oriented links on <a
       href="http://viii.dclxvi.org/bookmarks/tech/zope/test">tools for testing
       web applications</a>.
       <li>O'Reilly book: <a href="">Spidering Hacks</a>.  Very Perl-oriented.
       <li>Useful
       <a href="http://chrispederick.com/work/webdeveloper/"> Firefox extension
       </a> which, amongst other things, can display HTML form information and
       HTML table structure(thanks to Erno Kuusela for this link).
       <li>
       <a href="http://www.openqa.org/selenium/">Selenium</a>: In-browser web
       functional testing.
       <li><a href="http://www.opensourcetesting.org/functional.php">Open
       source functional testing tools</a>.  A nice list.
       <li><a href="http://www.rexx.com/~dkuhlman/quixote_htmlscraping.html">
       A HOWTO on web scraping</a> from Dave Kuhlman.
     </ul>
  <li>Will any of this code make its way into the Python standard library?

  <p>The request / response processing extensions to <code>urllib2</code> from
     ClientCookie have been merged into <code>urllib2</code> for Python 2.4.
     The cookie processing has been added, as module <code>cookielib</code>.
     Eventually, I'll submit patches to get the http-equiv, refresh, and
     robots.txt code in there too, and maybe <code>mechanize.UserAgent</code>
     too (but <em>not</em> <code>mechanize.Browser</code>).  The rest, probably
     not.

</ul>

<p>I prefer questions and comments to be sent to the <a
href="http://lists.sourceforge.net/lists/listinfo/wwwsearch-general">
mailing list</a> rather than direct to me.

<p><a href="mailto:jjl@pobox.com">John J. Lee</a>,
January 2006.

</div> <!--id="Content"-->

<div id="Menu">

<a href="..">Home</a><br>
<br>
<span class="thispage">General FAQs</span><br>
<br>
<a href="../mechanize/">mechanize</a><br>
<a href="../pullparser/">pullparser</a><br>
<a href="../ClientCookie/">ClientCookie</a><br>
<a href="../ClientCookie/doc.html"><span class="subpage">ClientCookie docs</span></a><br>
<a href="../ClientForm/">ClientForm</a><br>
<br>
<a href="../DOMForm/">DOMForm</a><br>
<a href="../python-spidermonkey/">python-spidermonkey</a><br>
<a href="../ClientTable/">ClientTable</a><br>
<a href="../bits/urllib2_152.py">1.5.2 urllib2.py</a><br>
<a href="../bits/urllib_152.py">1.5.2 urllib.py</a><br>

<br>

</body>
</html>