File: scrap.rst

package info (click to toggle)
pyquery 1.4.3-1
  • links: PTS, VCS
  • area: main
  • in suites: bookworm, forky, sid, trixie
  • size: 412 kB
  • sloc: python: 2,768; makefile: 128; xml: 9; sh: 4
file content (33 lines) | stat: -rw-r--r-- 877 bytes parent folder | download
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
Scraping
=========

..
  >>> from pyquery import PyQuery as pq

PyQuery is able to load an html document from a url::

  >>> pq(your_url)
  [<html>]

By default it uses python's urllib.

If `requests`_ is installed then it will use it. This allow you to use most of `requests`_ parameters::

  >>> pq(your_url, headers={'user-agent': 'pyquery'})
  [<html>]

  >>> pq(your_url, {'q': 'foo'}, method='post', verify=True)
  [<html>]


Timeout
-------

The default timeout is 60 seconds, you can change it by setting the timeout parameter which is forwarded to the underlying urllib or requests library.

Session
-------

When using the requests library you can instantiate a Session object which keeps state between http calls (for example - to keep cookies). You can set the session parameter to use this session object.

.. _requests: http://docs.python-requests.org/en/latest/