File: README.rst

package info (click to toggle)
python-protego 0.5.0%2Bdfsg-1
  • links: PTS, VCS
  • area: main
  • in suites: forky, sid
  • size: 30,052 kB
  • sloc: python: 1,579; perl: 190; cpp: 33; sh: 4; makefile: 3
file content (144 lines) | stat: -rw-r--r-- 5,023 bytes parent folder | download
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
=======
Protego
=======

.. image:: https://img.shields.io/pypi/pyversions/protego.svg
   :target: https://pypi.python.org/pypi/protego
   :alt: Supported Python Versions

.. image:: https://github.com/scrapy/protego/actions/workflows/tests-ubuntu.yml/badge.svg
   :target: https://github.com/scrapy/protego/actions/workflows/tests-ubuntu.yml
   :alt: CI

Protego is a pure-Python ``robots.txt`` parser with support for modern
conventions.


Install
=======

To install Protego, simply use pip:

.. code-block:: none

    pip install protego


Usage
=====

.. code-block:: pycon

   >>> from protego import Protego
   >>> robotstxt = """
   ... User-agent: *
   ... Disallow: /
   ... Allow: /about
   ... Allow: /account
   ... Disallow: /account/contact$
   ... Disallow: /account/*/profile
   ... Crawl-delay: 4
   ... Request-rate: 10/1m                 # 10 requests every 1 minute
   ... 
   ... Sitemap: http://example.com/sitemap-index.xml
   ... Host: http://example.co.in
   ... """
   >>> rp = Protego.parse(robotstxt)
   >>> rp.can_fetch("http://example.com/profiles", "mybot")
   False
   >>> rp.can_fetch("http://example.com/about", "mybot")
   True
   >>> rp.can_fetch("http://example.com/account", "mybot")
   True
   >>> rp.can_fetch("http://example.com/account/myuser/profile", "mybot")
   False
   >>> rp.can_fetch("http://example.com/account/contact", "mybot")
   False
   >>> rp.crawl_delay("mybot")
   4.0
   >>> rp.request_rate("mybot")
   RequestRate(requests=10, seconds=60, start_time=None, end_time=None)
   >>> list(rp.sitemaps)
   ['http://example.com/sitemap-index.xml']
   >>> rp.preferred_host
   'http://example.co.in'


Using Protego with Requests_:

.. code-block:: pycon

   >>> from protego import Protego
   >>> import requests
   >>> r = requests.get("https://google.com/robots.txt")
   >>> rp = Protego.parse(r.text)
   >>> rp.can_fetch("https://google.com/search", "mybot")
   False
   >>> rp.can_fetch("https://google.com/search/about", "mybot")
   True
   >>> list(rp.sitemaps)
   ['https://www.google.com/sitemap.xml']

.. _Requests: https://3.python-requests.org/


Comparison
==========

The following table compares Protego to the most popular ``robots.txt`` parsers
implemented in Python or featuring Python bindings:

+----------------------------+---------+-----------------+--------+---------------------------+
|                            | Protego | RobotFileParser | Reppy  | Robotexclusionrulesparser |
+============================+=========+=================+========+===========================+
| Implementation language    | Python  | Python          | C++    | Python                    |
+----------------------------+---------+-----------------+--------+---------------------------+
| Reference specification    | Google_ | `Martijn Koster’s 1996 draft`_                       |
+----------------------------+---------+-----------------+--------+---------------------------+
| `Wildcard support`_        | ✓       |                 | ✓      | ✓                         |
+----------------------------+---------+-----------------+--------+---------------------------+
| `Length-based precedence`_ | ✓       |                 | ✓      |                           |
+----------------------------+---------+-----------------+--------+---------------------------+
| Performance_               |         | +40%            | +1300% | -25%                      |
+----------------------------+---------+-----------------+--------+---------------------------+

.. _Google: https://developers.google.com/search/reference/robots_txt
.. _Length-based precedence: https://developers.google.com/search/reference/robots_txt#order-of-precedence-for-group-member-lines
.. _Martijn Koster’s 1996 draft: https://www.robotstxt.org/norobots-rfc.txt
.. _Performance: https://anubhavp28.github.io/gsoc-weekly-checkin-12/
.. _Wildcard support: https://developers.google.com/search/reference/robots_txt#url-matching-based-on-path-values


API Reference
=============

Class ``protego.Protego``:

Properties
----------

*   ``sitemaps`` {``list_iterator``} A list of sitemaps specified in
    ``robots.txt``.

*   ``preferred_host`` {string} Preferred host specified in ``robots.txt``.


Methods
-------

*   ``parse(robotstxt_body)`` Parse ``robots.txt`` and return a new instance of
    ``protego.Protego``.

*   ``can_fetch(url, user_agent)`` Return True if the user agent can fetch the
    URL, otherwise return ``False``.

*   ``crawl_delay(user_agent)`` Return the crawl delay specified for the user
    agent as a float. If nothing is specified, return ``None``.

*   ``request_rate(user_agent)`` Return the request rate specified for the user
    agent as a named tuple ``RequestRate(requests, seconds, start_time,
    end_time)``. If nothing is specified, return ``None``.

*   ``visit_time(user_agent)`` Return the visit time specified for the user 
    agent as a named tuple ``VisitTime(start_time, end_time)``. 
    If nothing is specified, return ``None``.