File: doc.html

package info (click to toggle)
python-mechanize 0.1.2b-2
  • links: PTS
  • area: main
  • in suites: etch, etch-m68k
  • size: 788 kB
  • ctags: 1,070
  • sloc: python: 7,658; makefile: 61
file content (887 lines) | stat: -rw-r--r-- 42,463 bytes parent folder | download
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN"
        "http://www.w3.org/TR/html4/strict.dtd">

<html>
<head>
  <meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1">
  <meta name="author" content="John J. Lee &lt;jjl@pobox.com&gt;">
  <meta name="date" content="2006-05-21">
  <title>mechanize documentation</title>
  <style type="text/css" media="screen">@import "../styles/style.css";</style>
  
</head>
<body>

<div id="sf"><a href="http://sourceforge.net">
<img src="http://sourceforge.net/sflogo.php?group_id=48205&amp;type=2"
 width="125" height="37" alt="SourceForge.net Logo"></a></div>

<h1>mechanize handlers</h1>

<div id="Content">

<p class="docwarning">This documentation is in need of reorganisation!</p>

<p>This page is the old ClientCookie documentation.  It deals with operation on
the level of urllib2 Handler objects, and also with adding headers, debugging,
and cookie handling.  Documentation for the higher-level browser-style
interface is <a href="./mechanize">elsewhere</a>.


<a name="examples"></a>
<h2>Examples</h2>

<pre>
<span class="pykw">import</span> mechanize
response = mechanize.urlopen(<span class="pystr">"http://foo.bar.com/"</span>)</pre>


<p>This function behaves identically to <code>urllib2.urlopen()</code>, except
that it deals with cookies automatically.

<p>Here is a more complicated example, involving <code>Request</code> objects
(useful if you want to pass <code>Request</code>s around, add headers to them,
etc.):

<pre>
<span class="pykw">import</span> mechanize
request = mechanize.Request(<span class="pystr">"http://www.acme.com/"</span>)
<span class="pycmt"># note we're using the urlopen from mechanize, not urllib2
</span>response = mechanize.urlopen(request)
<span class="pycmt"># let's say this next request requires a cookie that was set in response
</span>request2 = mechanize.Request(<span class="pystr">"http://www.acme.com/flying_machines.html"</span>)
response2 = mechanize.urlopen(request2)

<span class="pykw">print</span> response2.geturl()
<span class="pykw">print</span> response2.info()  <span class="pycmt"># headers</span>
<span class="pykw">print</span> response2.read()  <span class="pycmt"># body (readline and readlines work too)</span></pre>


<p>(The above example would also work with <code>urllib2.Request</code> objects
too, since <code>mechanize.HTTPRequestUpgradeProcessor</code> knows about
that class, but don't if you can avoid it, because this is an obscure hack for
compatibility purposes only).

<p>In these examples, the workings are hidden inside the
<code>mechanize.urlopen()</code> function, which is an extension of
<code>urllib2.urlopen()</code>.  Redirects, proxies and cookies are handled
automatically by this function (note that you may need a bit of configuration
to get your proxies correctly set up: see <code>urllib2</code> documentation).

<p>Cookie processing (etc.) is handled by processor objects, which are an
extension of <code>urllib2</code>'s handlers: <code>HTTPCookieProcessor</code>,
<code>HTTPRefererProcessor</code>, <code>SeekableProcessor</code> etc.  They
are used like any other handler.  There is quite a bit of other
<code>urllib2</code>-workalike code, too.  Note: This duplication has gone away
in Python 2.4, since 2.4's <code>urllib2</code> contains the processor
extensions from mechanize, so you can simply use mechanize's processor
classes direct with 2.4's <code>urllib2</code>; also, mechanize's cookie
functionality is included in Python 2.4 as module <code>cookielib</code> and
<code>urllib2.HTTPCookieProcessor</code>.

<p>There is also a <code>urlretrieve()</code> function, which works like
<code>urllib.urlretrieve()</code>.

<p>An example at a slightly lower level shows how the module processes
cookies more clearly:

<pre>
<span class="pycmt"># Don't copy this blindly!  You probably want to follow the examples
</span><span class="pycmt"># above, not this one.
</span><span class="pykw">import</span> mechanize

<span class="pycmt"># Build an opener that *doesn't* automatically call .add_cookie_header()
</span><span class="pycmt"># and .extract_cookies(), so we can do it manually without interference.
</span><span class="pykw">class</span> NullCookieProcessor(mechanize.HTTPCookieProcessor):
    <span class="pykw">def</span> http_request(self, request): <span class="pykw">return</span> request
    <span class="pykw">def</span> http_response(self, request, response): <span class="pykw">return</span> response
opener = mechanize.build_opener(NullCookieProcessor)

request = mechanize.Request(<span class="pystr">"http://www.acme.com/"</span>)
response = mechanize.urlopen(request)
cj = mechanize.CookieJar()
cj.extract_cookies(response, request)
<span class="pycmt"># let's say this next request requires a cookie that was set in response
</span>request2 = mechanize.Request(<span class="pystr">"http://www.acme.com/flying_machines.html"</span>)
cj.add_cookie_header(request2)
response2 = mechanize.urlopen(request2)</pre>


<p>The <code>CookieJar</code> class does all the work.  There are essentially
two operations: <code>.extract_cookies()</code> extracts HTTP cookies from
<code>Set-Cookie</code> (the original <a
href="http://www.netscape.com/newsref/std/cookie_spec.html">Netscape cookie
standard</a>) and <code>Set-Cookie2</code> (<a
href="http://www.ietf.org/rfc/rfc2965.txt">RFC 2965</a>) headers from a
response if and only if they should be set given the request, and
<code>.add_cookie_header()</code> adds <code>Cookie</code> headers if and only
if they are appropriate for a particular HTTP request.  Incoming cookies are
checked for acceptability based on the host name, etc.  Cookies are only set on
outgoing requests if they match the request's host name, path, etc.

<p><strong>Note that if you're using <code>mechanize.urlopen()</code> (or if
you're using <code>mechanize.HTTPCookieProcessor</code> by some other
means), you don't need to call <code>.extract_cookies()</code> or
<code>.add_cookie_header()</code> yourself</strong>.  If, on the other hand,
you don't want to use <code>urllib2</code>, you will need to use this pair of
methods.  You can make your own <code>request</code> and <code>response</code>
objects, which must support the interfaces described in the docstrings of
<code>.extract_cookies()</code> and <code>.add_cookie_header()</code>.

<p>There are also some <code>CookieJar</code> subclasses which can store
cookies in files and databases.  <code>FileCookieJar</code> is the abstract
class for <code>CookieJar</code>s that can store cookies in disk files.
<code>LWPCookieJar</code> saves cookies in a format compatible with the
libwww-perl library.  This class is convenient if you want to store cookies in
a human-readable file:

<pre>
<span class="pykw">import</span> mechanize
cj = mechanize.LWPCookieJar()
cj.revert(<span class="pystr">"cookie3.txt"</span>)
opener = mechanize.build_opener(mechanize.HTTPCookieProcessor(cj))
r = opener.open(<span class="pystr">"http://foobar.com/"</span>)
cj.save(<span class="pystr">"cookie3.txt"</span>)</pre>


<p>The <code>.revert()</code> method discards all existing cookies held by the
<code>CookieJar</code> (it won't lose any existing cookies if the load fails).
The <code>.load()</code> method, on the other hand, adds the loaded cookies to
existing cookies held in the <code>CookieJar</code> (old cookies are kept
unless overwritten by newly loaded ones).

<p><code>MozillaCookieJar</code> can load and save to the
Mozilla/Netscape/lynx-compatible <code>'cookies.txt'</code> format.  This
format loses some information (unusual and nonstandard cookie attributes such
as comment, and also information specific to RFC 2965 cookies).  The subclass
<code>MSIECookieJar</code> can load (but not save, yet) from Microsoft Internet
Explorer's cookie files (on Windows).  <code>BSDDBCookieJar</code> (NOT FULLY
TESTED!) saves to a BSDDB database using the standard library's
<code>bsddb</code> module.  There's an unfinished <code>MSIEDBCookieJar</code>,
which uses (reads and writes) the Windows MSIE cookie database directly, rather
than storing copies of cookies as <code>MSIECookieJar</code> does.

<h2>Important note</h2>

<p>Only use names you can import directly from the <code>mechanize</code>
package, and that don't start with a single underscore.  Everything else is
subject to change or disappearance without notice.

<a name="browsers"></a>
<h2>Cooperating with Mozilla/Netscape, lynx and Internet Explorer</h2>

<p>The subclass <code>MozillaCookieJar</code> differs from
<code>CookieJar</code> only in storing cookies using a different,
Mozilla/Netscape-compatible, file format.  The lynx browser also uses this
format.  This file format can't store RFC 2965 cookies, so they are downgraded
to Netscape cookies on saving.  <code>LWPCookieJar</code> itself uses a
libwww-perl specific format (`Set-Cookie3') - see the example above.  Python
and your browser should be able to share a cookies file (note that the file
location here will differ on non-unix OSes):

<p><strong>WARNING:</strong> you may want to backup your browser's cookies file
if you use <code>MozillaCookieJar</code> to save cookies.  I <em>think</em> it
works, but there have been bugs in the past!

<pre>
<span class="pykw">import</span> os, mechanize
cookies = mechanize.MozillaCookieJar()
cookies.load(os.path.join(os.environ[<span class="pystr">"HOME"</span>], <span class="pystr">"/.netscape/cookies.txt"</span>))
<span class="pycmt"># see also the save and revert methods</span></pre>


<p>Note that cookies saved while Mozilla is running will get clobbered by
Mozilla - see <code>MozillaCookieJar.__doc__</code>.

<p><code>MSIECookieJar</code> does the same for Microsoft Internet Explorer
(MSIE) 5.x and 6.x on Windows, but does not allow saving cookies in this
format.  In future, the Windows API calls might be used to load and save
(though the index has to be read directly, since there is no API for that,
AFAIK; there's also an unfinished <code>MSIEDBCookieJar</code>, which uses
(reads and writes) the Windows MSIE cookie database directly, rather than
storing copies of cookies as <code>MSIECookieJar</code> does).

<pre>
<span class="pykw">import</span> mechanize
cj = mechanize.MSIECookieJar(delayload=True)
cj.load_from_registry()  <span class="pycmt"># finds cookie index file from registry</span></pre>


<p>A true <code>delayload</code> argument speeds things up.

<p>On Windows 9x (win 95, win 98, win ME), you need to supply a username to the
<code>.load_from_registry()</code> method:

<pre>
cj.load_from_registry(username=<span class="pystr">"jbloggs"</span>)</pre>


<p>Konqueror/Safari and Opera use different file formats, which aren't yet
supported.

<a name="file"></a>
<h2>Saving cookies in a file</h2>

<p>If you have no need to co-operate with a browser, the most convenient way to
save cookies on disk between sessions in human-readable form is to use
<code>LWPCookieJar</code>.  This class uses a libwww-perl specific format
(`Set-Cookie3').  Unlike <code>MozilliaCookieJar</code>, this file format
doesn't lose information.

<a name="cookiejar"></a>
<h2>Using your own CookieJar instance</h2>

<p>You might want to do this to <a href="./doc.html#browsers">use your
browser's cookies</a>, to customize <code>CookieJar</code>'s behaviour by
passing constructor arguments, or to be able to get at the cookies it will hold
(for example, for saving cookies between sessions and for debugging).

<p>If you're using the higher-level <code>urllib2</code>-like interface
(<code>urlopen()</code>, etc), you'll have to let it know what
<code>CookieJar</code> it should use:

<pre>
<span class="pykw">import</span> mechanize
cookies = mechanize.CookieJar()
<span class="pycmt"># build_opener() adds standard handlers (such as HTTPHandler and
</span><span class="pycmt"># HTTPCookieProcessor) by default.  The cookie processor we supply
</span><span class="pycmt"># will replace the default one.
</span>opener = mechanize.build_opener(mechanize.HTTPCookieProcessor(cookies))

r = opener.open(<span class="pystr">"http://acme.com/"</span>)  <span class="pycmt"># GET</span>
r = opener.open(<span class="pystr">"http://acme.com/"</span>, data)  <span class="pycmt"># POST</span></pre>


<p>The <code>urlopen()</code> function uses a global
<code>OpenerDirector</code> instance to do its work, so if you want to use
<code>urlopen()</code> with your own <code>CookieJar</code>, install the
<code>OpenerDirector</code> you built with <code>build_opener()</code> using
the <code>mechanize.install_opener()</code> function, then proceed as usual:

<pre>
mechanize.install_opener(opener)
r = mechanize.urlopen(<span class="pystr">"http://www.acme.com/"</span>)</pre>


<p>Of course, everyone using <code>urlopen</code> is using the same global
<code>CookieJar</code> instance!

<a name="policy"></a>

<p>You can set a policy object (must satisfy the interface defined by
<code>mechanize.CookiePolicy</code>), which determines which cookies are
allowed to be set and returned.  Use the policy argument to the
<code>CookieJar</code> constructor, or use the .set_policy() method.  The
default implementation has some useful switches:

<pre>
<span class="pykw">from</span> mechanize <span class="pykw">import</span> CookieJar, DefaultCookiePolicy as Policy
cookies = CookieJar()
<span class="pycmt"># turn on RFC 2965 cookies, be more strict about domains when setting and
</span><span class="pycmt"># returning Netscape cookies, and block some domains from setting cookies
</span><span class="pycmt"># or having them returned (read the DefaultCookiePolicy docstring for the
</span><span class="pycmt"># domain matching rules here)
</span>policy = Policy(rfc2965=True, strict_ns_domain=Policy.DomainStrict,
                blocked_domains=[<span class="pystr">"ads.net"</span>, <span class="pystr">".ads.net"</span>])
cookies.set_policy(policy)</pre>



<a name="extras"></a>
<h2>Optional extras: robots.txt, HTTP-EQUIV, Refresh, Referer and seekable responses</h2>

<p>These are implemented as processor classes.  Processors are an extension of
<code>urllib2</code>'s handlers (now a standard part of urllib2 in Python 2.4):
you just pass them to <code>build_opener()</code> (example code below).

<dl>

<dt><code>HTTPRobotRulesProcessor</code>

<dd><p>WWW Robots (also called wanderers or spiders) are programs that traverse
many pages in the World Wide Web by recursively retrieving linked pages.  This
kind of program can place significant loads on web servers, so there is a <a
href="http://www.robotstxt.org/wc/norobots.html">standard</a> for a <code>
robots.txt</code> file by which web site operators can request robots to keep
out of their site, or out of particular areas of it.  This processor uses the
standard Python library's <code>robotparser</code> module.  It raises
<code>mechanize.RobotExclusionError</code> (subclass of
<code>urllib2.HTTPError</code>) if an attempt is made to open a URL prohibited
by <code>robots.txt</code>.  XXX ATM, this makes use of code in the
<code>robotparser</code> module that uses <code>urllib</code> - this will
likely change in future to use <code>urllib2</code>.

<dt><code>HTTPEquivProcessor</code>

<dd><p>The <code>&lt;META HTTP-EQUIV&gt;</code> tag is a way of including data
in HTML to be treated as if it were part of the HTTP headers.  mechanize can
automatically read these tags and add the <code>HTTP-EQUIV</code> headers to
the response object's real HTTP headers.  The HTML is left unchanged.

<dt><code>HTTPRefreshProcessor</code>

<dd><p>The <code>Refresh</code> HTTP header is a non-standard header which is
widely used.  It requests that the user-agent follow a URL after a specified
time delay.  mechanize can treat these headers (which may have been set in
<code>&lt;META HTTP-EQUIV&gt;</code> tags) as if they were 302 redirections.
Exactly when and how <code>Refresh</code> headers are handled is configurable
using the constructor arguments.

<dt><code>SeekableProcessor</code>

<dd><p>This makes mechanize's response objects <code>seek()</code>able.
Seeking is done lazily (ie. the response object only reads from the socket as
necessary, rather than slurping in all the data before the response is returned
to you).

<dt><code>HTTPRefererProcessor</code>

<dd><p>The <code>Referer</code> HTTP header lets the server know which URL
you've just visited.  Some servers use this header as state information, and
don't like it if this is not present.  It's a chore to add this header by hand
every time you make a request.  This adds it automatically.
<strong>NOTE</strong>: this only makes sense if you use each processor for a
single chain of HTTP requests (so, for example, if you use a single
HTTPRefererProcessor to fetch a series of URLs extracted from a single page,
<strong>this will break</strong>).  The <a href="../mechanize/">mechanize</a>
package does this properly.</p></dd>

<pre>
<span class="pykw">import</span> mechanize
cookies = mechanize.CookieJar()

opener = mechanize.build_opener(mechanize.HTTPRefererProcessor,
                                mechanize.HTTPEquivProcessor,
                                mechanize.HTTPRefreshProcessor,
                                mechanize.SeekableProcessor)
opener.open(<span class="pystr">"http://www.rhubarb.com/"</span>)</pre>


</dl>


<a name="requests"></a>
<h2>Confusing fact about headers and Requests</h2>

<p>mechanize automatically upgrades <code>urllib2.Request</code> objects to
<code>mechanize.Request</code>, as a backwards-compatibility hack.  This
means that you won't see any headers that are added to Request objects by
handlers unless you use <code>mechanize.Request</code> in the first place.
Sorry about that.


<a name="headers"></a>
<h2>Adding headers</h2>

<p>Adding headers is done like so:

<pre>
<span class="pykw">import</span> mechanize, urllib2
req = urllib2.Request(<span class="pystr">"http://foobar.com/"</span>)
req.add_header(<span class="pystr">"Referer"</span>, <span class="pystr">"http://wwwsearch.sourceforge.net/mechanize/"</span>)
r = mechanize.urlopen(req)</pre>


<p>You can also use the headers argument to the <code>urllib2.Request</code>
constructor.

<p><code>urllib2</code> (in fact, mechanize takes over this task from
<code>urllib2</code>) adds some headers to <code>Request</code> objects
automatically - see the next section for details.


<h2>Changing the automatically-added headers (User-Agent)</h2>

<p><code>OpenerDirector</code> automatically adds a <code>User-Agent</code>
header to every <code>Request</code>.

<p>To change this and/or add similar headers, use your own
<code>OpenerDirector</code>:

<pre>
<span class="pykw">import</span> mechanize
cookies = mechanize.CookieJar()
opener = mechanize.build_opener(mechanize.HTTPCookieProcessor(cookies))
opener.addheaders = [(<span class="pystr">"User-agent"</span>, <span class="pystr">"Mozilla/5.0 (compatible; MyProgram/0.1)"</span>),
                     (<span class="pystr">"From"</span>, <span class="pystr">"responsible.person@example.com"</span>)]</pre>


<p>Again, to use <code>urlopen()</code>, install your
<code>OpenerDirector</code> globally:

<pre>
mechanize.install_opener(opener)
r = mechanize.urlopen(<span class="pystr">"http://acme.com/"</span>)</pre>


<p>Also, a few standard headers (<code>Content-Length</code>,
<code>Content-Type</code> and <code>Host</code>) are added when the
<code>Request</code> is passed to <code>urlopen()</code> (or
<code>OpenerDirector.open()</code>).  mechanize explictly adds these (and
<code>User-Agent</code>) to the <code>Request</code> object, unlike versions of
<code>urllib2</code> before Python 2.4 (but <strong>note</strong> that
Content-Length is an exception to this rule: it is sent, but not explicitly
added to the <code>Request</code>'s headers; this is due to a bug in
<code>httplib</code> in Python 2.3 and earlier).  You shouldn't need to change
these headers, but since this is done by <code>AbstractHTTPHandler</code>, you
can change the way it works by passing a subclass of that handler to
<code>build_opener()</code> (or, as always, by constructing an opener yourself
and calling .add_handler()).


<a name="unverifiable"></a>
<h2>Initiating unverifiable transactions</h2>

<p>This section is only of interest for correct handling of third-party HTTP
cookies.  See <a href="./doc.html#standards">below</a> for an explanation of
'third-party'.

<p>First, some terminology.

<p>An <em>unverifiable request</em> (defined fully by RFC 2965) is one whose
URL the user did not have the option to approve.  For example, a transaction is
unverifiable if the request is for an image in an HTML document, and the user
had no option to approve the fetching of the image from a particular URL.

<p>The <em>request-host of the origin transaction</em> (defined fully by RFC
2965) is the host name or IP address of the original request that was initiated
by the user.  For example, if the request is for an image in an HTML document,
this is the request-host of the request for the page containing the image.

<p><strong>mechanize knows that redirected transactions are unverifiable,
and will handle that on its own (ie. you don't need to think about the origin
request-host or verifiability yourself).</strong>

<p>If you want to initiate an unverifiable transaction yourself (which you
should if, for example, you're downloading the images from a page, and 'the
user' hasn't explicitly OKed those URLs):

<ol>

  <li>If you're using a <code>urllib2.Request</code> from Python 2.3 or
  earlier, set the <code>unverifiable</code> and <code>origin_req_host</code>
  attributes on your <code>Request</code> instance:

<pre>
request.unverifiable = True
request.origin_req_host = <span class="pystr">"www.example.com"</span></pre>


  <li>If you're using a <code>urllib2.Request</code> from Python 2.4 or later,
  or you're using a <code>mechanize.Request<code>, use the
  <code>unverifiable</code> and <code>origin_req_host</code> arguments to the
  constructor:

<pre>
request = Request(origin_req_host=<span class="pystr">"www.example.com"</span>, unverifiable=True)</pre>


</ol>


<a name="rfc2965"></a>
<h2>RFC 2965 handling</h2>

<p>RFC 2965 handling is switched off by default, because few browsers implement
it, so the RFC 2965 protocol is essentially never seen on the internet.  To
switch it on, see <a href="./doc.html#policy">here</a>.


<a name="debugging"></a>
<h2>Debugging</h2>

<!--XXX move as much as poss. to General page-->

<p>First, a few common problems.  The most frequent mistake people seem to make
is to use <code>mechanize.urlopen()</code>, <em>and</em> the
<code>.extract_cookies()</code> and <code>.add_cookie_header()</code> methods
on a cookie object themselves.  If you use <code>mechanize.urlopen()</code>
(or <code>OpenerDirector.open()</code>), the module handles extraction and
adding of cookies by itself, so you should not call
<code>.extract_cookies()</code> or <code>.add_cookie_header()</code>.

<p>Are you sure the server is sending you any cookies in the first place?
Maybe the server is keeping track of state in some other way
(<code>HIDDEN</code> HTML form entries (possibly in a separate page referenced
by a frame), URL-encoded session keys, IP address, HTTP <code>Referer</code>
headers)?  Perhaps some embedded script in the HTML is setting cookies (see
below)?  Maybe you messed up your request, and the server is sending you some
standard failure page (even if the page doesn't appear to indicate any
failure).  Sometimes, a server wants particular headers set to the values it
expects, or it won't play nicely.  The most frequent offenders here are the
<code>Referer</code> [<em>sic</em>] and / or <code>User-Agent</code> HTTP
headers (<a href="./doc.html#headers">see above</a> for how to set these).  The
<code>User-Agent</code> header may need to be set to a value like that of a
popular browser.  The <code>Referer</code> header may need to be set to the URL
that the server expects you to have followed a link from.  Occasionally, it may
even be that operators deliberately configure a server to insist on precisely
the headers that the popular browsers (MS Internet Explorer, Mozilla/Netscape,
Opera, Konqueror/Safari) generate, but remember that incompetence (possibly on
your part) is more probable than deliberate sabotage (and if a site owner is
that keen to stop robots, you probably shouldn't be scraping it anyway).

<p>When you <code>.save()</code> to or
<code>.load()</code>/<code>.revert()</code> from a file, single-session cookies
will expire unless you explicitly request otherwise with the
<code>ignore_discard</code> argument.  This may be your problem if you find
cookies are going away after saving and loading.

<pre>
<span class="pykw">import</span> mechanize
cj = mechanize.LWPCookieJar()
opener = mechanize.build_opener(mechanize.HTTPCookieProcessor(cj))
mechanize.install_opener(opener)
r = mechanize.urlopen(<span class="pystr">"http://foobar.com/"</span>)
cj.save(<span class="pystr">"/some/file"</span>, ignore_discard=True, ignore_expires=True)</pre>


<p>If none of the advice above solves your problem quickly, try comparing the
headers and data that you are sending out with those that a browser emits.
Often this will give you the clue you need.  Of course, you'll want to check
that the browser is able to do manually what you're trying to achieve
programatically before minutely examining the headers.  Make sure that what you
do manually is <em>exactly</em> the same as what you're trying to do from
Python - you may simply be hitting a server bug that only gets revealed if you
view pages in a particular order, for example.  In order to see what your
browser is sending to the server (even if HTTPS is in use), see <a
href="../clientx.html">the General FAQ page</a>.  If nothing is obviously wrong
with the requests your program is sending and you're out of ideas, you can try
the last resort of good old brute force binary-search debugging.  Temporarily
switch to sending HTTP headers (with <code>httplib</code>).  Start by copying
Netscape/Mozilla or IE slavishly (apart from session IDs, etc., of course),
then begin the tedious process of mutating your headers and data until they
match what your higher-level code was sending.  This will at least reliably
find your problem.

<p>You can turn on display of HTTP headers:

<pre>
<span class="pykw">import</span> mechanize
hh = mechanize.HTTPHandler()  <span class="pycmt"># you might want HTTPSHandler, too</span>
hh.set_http_debuglevel(1)
opener = mechanize.build_opener(hh)
response = opener.open(url)</pre>


<p>Alternatively, you can examine your individual request and response objects
to see what's going on.  Note, though, that mechanize upgrades
urllib2.Request objects to mechanize.Request, so you won't see any headers
that are added to requests by handlers unless you use mechanize.Request in
the first place.  mechanize's responses can be made <code>.seek()</code>able
using <code>SeekableProcessor</code>.  It's often useful to use the
<code>.seek()</code> method like this during debugging:

<pre>
...
response = mechanize.urlopen(<span class="pystr">"http://spam.eggs.org/"</span>)
<span class="pykw">print</span> response.read()
response.seek(0)
<span class="pycmt"># rest of code continues as if you'd never .read() the response
</span>...</pre>


<p>Also, note <code>HTTPRedirectDebugProcessor</code> (which prints information
about redirections) and <code>HTTPResponseDebugProcessor</code> (which prints
out all response bodies, including those that are read during redirections).
<strong>NOTE</strong>: as well as having these processors in your
<code>OpenerDirector</code> (for example, by passing them to
<code>build_opener()</code>) you have to turn on logging at the
<code>INFO</code> level or lower in order to see any output.

<p>If you would like to see what is going on in mechanize's tiny mind, do
this:

<pre>
<span class="pykw">import</span> sys, logging
<span class="pycmt"># logging.DEBUG covers masses of debugging information,
</span><span class="pycmt"># logging.INFO just shows the output from HTTPRedirectDebugProcessor,
</span>logger = logging.getLogger(<span class="pystr">"mechanize"</span>)
logger.addHandler(logging.StreamHandler(sys.stdout))
logger.setLevel(logging.DEBUG)</pre>


<p>The <code>DEBUG</code> level (as opposed to the <code>INFO</code> level) can
actually be quite useful, as it explains why particular cookies are accepted or
rejected and why they are or are not returned.

<p>One final thing to note is that there are some catch-all bare
<code>except:</code> statements in the module, which are there to handle
unexpected bad input without crashing your program.  If this happens, it's a
bug in mechanize, so please mail me the warning text.


<a name="script"></a>
<h2>Embedded script that sets cookies</h2>

<p>It is possible to embed script in HTML pages (sandwiched between
<code>&lt;SCRIPT&gt;here&lt;/SCRIPT&gt;</code> tags, and in
<code>javascript:</code> URLs) - JavaScript / ECMAScript, VBScript, or even
Python - that causes cookies to be set in a browser.  See the <a
href="../bits/clientx.html">General FAQs</a> page for what to do about this.


<a name="dates"></a>
<h2>Parsing HTTP date strings</h2>

<p>A function named <code>str2time</code> is provided by the package,
which may be useful for parsing dates in HTTP headers.
<code>str2time</code> is intended to be liberal, since HTTP date/time
formats are poorly standardised in practice.  There is no need to use this
function in normal operations: <code>CookieJar</code> instances keep track
of cookie lifetimes automatically.  This function will stay around in some
form, though the supported date/time formats may change.


<a name="badhtml"></a>
<h2>Dealing with bad HTML</h2>

<p>XXX Intro

<p>XXX Test me

<pre><span class="pykw">import</span> copy
<span class="pykw">import</span> mechanize
<span class="pykw">class</span> CommentCleanProcessor(mechanize.BaseProcessor):
      <span class="pykw">def</span> http_response(self, request, response):
          <span class="pykw">if</span> <span class="pykw">not</span> hasattr(response, <span class="pystr">"seek"</span>):
              response = mechanize.response_seek_wrapper(response)
          response.seek(0)
          new_response = copy.copy(response)
          new_response.set_data(
              re.sub(<span class="pystr">"&lt;!-([^-]*)-&gt;"</span>, <span class="pystr">"&lt;!--\1--&gt;"</span>, response.read()))
          <span class="pykw">return</span> new_response
      https_response = http_response</pre>


<p>XXX TidyProcessor: mxTidy?  tidylib?  tidy?


<a name="standards"></a>
<h2>Note about cookie standards</h2>

<p>The various cookie standards and their history form a case study of the
terrible things that can happen to a protocol.  The long-suffering David
Kristol has written a <a
href="http://arxiv.org/abs/cs.SE/0105018">paper</a> about it, if you
want to know the gory details.

<p>Here is a summary.

<p>The <a href="http://www.netscape.com/newsref/std/cookie_spec.html">Netscape
protocol</a> (cookie_spec.html) is still the only standard supported by most
browsers (including Internet Explorer and Netscape).  Be aware that
cookie_spec.html is not, and never was, actually followed to the letter (or
anything close) by anyone (including Netscape, IE and mechanize): the
Netscape protocol standard is really defined by the behaviour of Netscape (and
now IE).  Netscape cookies are also known as V0 cookies, to distinguish them
from RFC 2109 or RFC 2965 cookies, which have a version cookie-attribute with a
value of 1.

<p><a href="http://www.ietf.org/rfcs/rfc2109.txt">RFC 2109</a> was introduced
to fix some problems identified with the Netscape protocol, while still keeping
the same HTTP headers (<code>Cookie</code> and <code>Set-Cookie</code>).  The
most prominent of these problems is the 'third-party' cookie issue, which was
an accidental feature of the Netscape protocol.  When one visits www.bland.org,
one doesn't expect to get a cookie from www.lurid.com, a site one has never
visited.  Depending on browser configuration, this can still happen, because
the unreconstructed Netscape protocol is happy to accept cookies from, say, an
image in a webpage (www.bland.org) that's included by linking to an
advertiser's server (www.lurid.com).  This kind of event, where your browser
talks to a server that you haven't explicitly okayed by some means, is what the
RFCs call an 'unverifiable transaction'.  In addition to the potential for
embarrassment caused by the presence of lurid.com's cookies on one's machine,
this may also be used to track your movements on the web, because advertising
agencies like doubleclick.net place ads on many sites.  RFC 2109 tried to
change this by requiring cookies to be turned off during unverifiable
transactions with third-party servers - unless the user explicitly asks them to
be turned on.  This clashed with the business model of advertisers like
doubleclick.net, who had started to take advantage of the third-party cookies
'bug'.  Since the browser vendors were more interested in the advertisers'
concerns than those of the browser users, this arguably doomed both RFC 2109
and its successor, RFC 2965, from the start.  Other problems than the
third-party cookie issue were also fixed by 2109.  However, even ignoring the
advertising issue, 2109 was stillborn, because Internet Explorer and Netscape
behaved differently in response to its extended <code>Set-Cookie</code>
headers.  This was not really RFC 2109's fault: it worked the way it did to
keep compatibility with the Netscape protocol as implemented by Netscape.
Microsoft Internet Explorer (MSIE) was very new when the standard was designed,
but was starting to be very popular when the standard was finalised.  XXX P3P,
and MSIE & Mozilla options

<p>XXX Apparently MSIE implements bits of RFC 2109 - but not very compliant
(surprise).  Presumably other browsers do too, as a result.  mechanize
already does allow Netscape cookies to have <code>max-age</code> and
<code>port</code> cookie-attributes, and as far as I know that's the extent of
the support present in MSIE.  I haven't tested, though!

<p><a href="http://www.ietf.org/rfcs/rfc2965.txt">RFC 2965</a> attempted to fix
the compatibility problem by introducing two new headers,
<code>Set-Cookie2</code> and <code>Cookie2</code>.  Unlike the
<code>Cookie</code> header, <code>Cookie2</code> does <em>not</em> carry
cookies to the server - rather, it simply advertises to the server that RFC
2965 is understood.  <code>Set-Cookie2</code> <em>does</em> carry cookies, from
server to client: the new header means that both IE and Netscape completely
ignore these cookies.  This prevents breakage, but introduces a chicken-egg
problem that means 2965 may never be widely adopted, especially since Microsoft
shows no interest in it.  XXX Rumour has it that the European Union is unhappy
with P3P, and might introduce legislation that requires something better,
forming a gap that RFC 2965 might fill - any truth in this?  Opera is the only
browser I know of that supports the standard.  On the server side, Apache's
<code>mod_usertrack</code> supports it.  One confusing point to note about RFC
2965 is that it uses the same value (1) of the Version attribute in HTTP
headers as does RFC 2109.

<p>Most recently, it was discovered that RFC 2965 does not fully take account
of issues arising when 2965 and Netscape cookies coexist, and errata were
discussed on the W3C http-state mailing list, but the list traffic died and it
seems RFC 2965 is dead as an internet protocol (but still a useful basis for
implementing the de-facto standards, and perhaps as an intranet protocol).

<p>Because Netscape cookies are so poorly specified, the general philosophy
of the module's Netscape cookie implementation is to start with RFC 2965
and open holes where required for Netscape protocol-compatibility.  RFC
2965 cookies are <em>always</em> treated as RFC 2965 requires, of course!


<a name="faq_pre"></a>
<h2>FAQs - pre install</h2>
<ul>
  <li>Doesn't the standard Python library module, <code>Cookie</code>, do
     this?
  <p>No: Cookie.py does the server end of the job.  It doesn't know when to
     accept cookies from a server or when to pass them back.
  <li>Is urllib2.py required?
  <p>No.  You probably want it, though.
  <li>Where can I find out more about the HTTP cookie protocol?
  <p>There is more than one protocol, in fact (see the <a href="./doc.html">docs</a>
     for a brief explanation of the history):
  <ul>
    <li>The original <a href="http://www.netscape.com/newsref/std/cookie_spec.html">
        Netscape cookie protocol</a> - the standard still in use today, in
        theory (in reality, the protocol implemented by all the major browsers
        only bears a passing resemblance to the protocol sketched out in this
        document).
    <li><a href="http://www.ietf.org/rfcs/rfc2109.txt">RFC 2109</a> - obsoleted
        by RFC 2965.
     <li><a href="http://www.ietf.org/rfcs/rfc2965.txt">RFC 2965</a> - the
        Netscape protocol with the bugs fixed (not widely used - the Netscape
        protocol still dominates, and seems likely to remain dominant
        indefinitely, at least on the Internet).
        <a href="http://www.ietf.org/rfcs/rfc2964.txt">RFC 2964</a> discusses use
        of the protocol.
        <a href="http://kristol.org/cookie/errata.html">Errata</a> to RFC 2965
        are currently being discussed on the
        <a href="http://lists.bell-labs.com/mailman/listinfo/http-state">
        http-state mailing list</a> (update: list traffic died months ago and
        hasn't revived).
    <li>A <a href="http://doi.acm.org/10.1145/502152.502153">paper</a> by David
        Kristol setting out the history of the cookie standards in exhausting
        detail.
    <li>HTTP cookies <a href="http://www.cookiecentral.com/">FAQ</a>.
  </ul>
  <li>Which protocols does ClientCookie support?
     <p>Netscape and RFC 2965.  RFC 2965 handling is switched off by default.
  <li>What about RFC 2109?
     <p>RFC 2109 cookies are currently parsed as Netscape cookies, and treated
     by default as RFC 2965 cookies thereafter if RFC 2965 handling is enabled,
     or as Netscape cookies otherwise.  RFC 2109 is officially obsoleted by RFC
     2965.  Browsers do use a few RFC 2109 features in their Netscape cookie
     implementations (<code>port</code> and <code>max-age</code>), and
     ClientCookie knows about that, too.
</ul>


<a name="faq_use"></a>
<h2>FAQs - usage</h2>
<ul>
  <li>Why don't I have any cookies?
  <p>Read the <a href="./doc.html#debugging">debugging section</a> of this page.
  <li>My response claims to be empty, but I know it's not!
  <p>Did you call <code>response.read()</code> (eg., in a debug statement),
     then forget that all the data has already been read?  In that case, you
     may want to use <code>SeekableProcessor</code>.
  <li>How do I download only part of a response body?
  <p>Just call <code>.read()</code> or <code>.readline()</code> methods on your
     response object as many times as you need.  The <code>.seek()</code> method
     (which will only be there if you're using <code>SeekableProcessor</code>)
     still works, because <code>SeekableProcessor</code>'s response objects
     cache read data.
  <li>What's the difference between the <code>.load()</code> and
      <code>.revert()</code> methods of <code>CookieJar</code>?
  <p><code>.load()</code> <emph>appends</emph> cookies from a file.
     <code>.revert()</code> discards all existing cookies held by the
     <code>CookieJar</code> first (but it won't lose any existing cookies if
     the loading fails).
  <li>Is it threadsafe?
  <p>No.  <em>Tested</em> patches welcome.  Clarification: As far as I know,
     it's perfectly possible to use mechanize in threaded code, but it
     provides no synchronisation: you have to provide that yourself.
  <li>How do I do &lt;X&gt;
  <p>The module docstrings are worth reading if you want to do something
     unusual.
  <li>What's this &quot;processor&quot; business about?  I knew
      <code>urllib2</code> used &quot;handlers&quot;, but not these
      &quot;processors&quot;.
  <p>This Python library <a href="http://www.python.org/sf/852995">patch</a>
     contains an explanation.  Processors are now a standard part of urllib2
     in Python 2.4.
  <li>How do I use it without urllib2.py?
  <p><pre>
<span class="pykw">from</span> mechanize <span class="pykw">import</span> CookieJar
<span class="pykw">print</span> CookieJar.extract_cookies.__doc__
<span class="pykw">print</span> CookieJar.add_cookie_header.__doc__</pre>

</ul>

<p>I prefer questions and comments to be sent to the <a
href="http://lists.sourceforge.net/lists/listinfo/wwwsearch-general">
mailing list</a> rather than direct to me.

<p><a href="mailto:jjl@pobox.com">John J. Lee</a>,
May 2006.

<hr>

</div>

<div id="Menu">

<a href="..">Home</a><br>
<br>
<a href="../bits/GeneralFAQ.html">General FAQs</a><br>
<br>
<a href="../mechanize/">mechanize</a><br>
<span class="thispage"><span class="subpage">mechanize docs</span></span><br>
<a href="../ClientForm/">ClientForm</a><br>
<br>
<a href="../ClientCookie/">ClientCookie</a><br>
<span class="thispage"><span class="subpage">ClientCookie docs</span></span><br>
<a href="../pullparser/">pullparser</a><br>
<a href="../DOMForm/">DOMForm</a><br>
<a href="../python-spidermonkey/">python-spidermonkey</a><br>
<a href="../ClientTable/">ClientTable</a><br>
<a href="../bits/urllib2_152.py">1.5.2 urllib2.py</a><br>
<a href="../bits/urllib_152.py">1.5.2 urllib.py</a><br>

<br>

<a href="./doc.html#examples">Examples</a><br>
<a href="./doc.html#browsers">Mozilla & MSIE</a><br>
<a href="./doc.html#file">Cookies in a file</a><br>
<a href="./doc.html#cookiejar">Using a <code>CookieJar</code></a><br>
<a href="./doc.html#extras">Processors</a><br>
<a href="./doc.html#requests">Request confusion</a><br>
<a href="./doc.html#headers">Adding headers</a><br>
<a href="./doc.html#unverifiable">Verifiability</a><br>
<a href="./doc.html#rfc2965">RFC 2965</a><br>
<a href="./doc.html#debugging">Debugging</a><br>
<a href="./doc.html#script">Embedded scripts</a><br>
<a href="./doc.html#dates">HTTP date parsing</a><br>
<a href="./doc.html#standards">Standards</a><br>
<a href="./doc.html#faq_use">FAQs - usage</a><br>

</div>

</body>

</html>