File: PKG-INFO

package info (click to toggle)
python-protego 0.5.0%2Bdfsg-1
  • links: PTS, VCS
  • area: main
  • in suites: forky, sid
  • size: 30,052 kB
  • sloc: python: 1,579; perl: 190; cpp: 33; sh: 4; makefile: 3
file content (173 lines) | stat: -rw-r--r-- 6,353 bytes parent folder | download
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
Metadata-Version: 2.4
Name: Protego
Version: 0.5.0
Summary: Pure-Python robots.txt parser with support for modern conventions
Project-URL: Homepage, https://github.com/scrapy/protego
Project-URL: Source, https://github.com/scrapy/protego
Project-URL: Tracker, https://github.com/scrapy/protego/issues
Project-URL: Release notes, https://github.com/scrapy/protego/blob/master/CHANGELOG.rst
Author-email: Anubhav Patel <anubhavp28@gmail.com>
License-Expression: BSD-3-Clause
License-File: LICENSE
Keywords: parser,rep,robots,robots.txt
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Programming Language :: Python :: Implementation :: CPython
Classifier: Programming Language :: Python :: Implementation :: PyPy
Classifier: Topic :: Internet :: WWW/HTTP
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Requires-Python: >=3.9
Description-Content-Type: text/x-rst

=======
Protego
=======

.. image:: https://img.shields.io/pypi/pyversions/protego.svg
   :target: https://pypi.python.org/pypi/protego
   :alt: Supported Python Versions

.. image:: https://github.com/scrapy/protego/actions/workflows/tests-ubuntu.yml/badge.svg
   :target: https://github.com/scrapy/protego/actions/workflows/tests-ubuntu.yml
   :alt: CI

Protego is a pure-Python ``robots.txt`` parser with support for modern
conventions.


Install
=======

To install Protego, simply use pip:

.. code-block:: none

    pip install protego


Usage
=====

.. code-block:: pycon

   >>> from protego import Protego
   >>> robotstxt = """
   ... User-agent: *
   ... Disallow: /
   ... Allow: /about
   ... Allow: /account
   ... Disallow: /account/contact$
   ... Disallow: /account/*/profile
   ... Crawl-delay: 4
   ... Request-rate: 10/1m                 # 10 requests every 1 minute
   ... 
   ... Sitemap: http://example.com/sitemap-index.xml
   ... Host: http://example.co.in
   ... """
   >>> rp = Protego.parse(robotstxt)
   >>> rp.can_fetch("http://example.com/profiles", "mybot")
   False
   >>> rp.can_fetch("http://example.com/about", "mybot")
   True
   >>> rp.can_fetch("http://example.com/account", "mybot")
   True
   >>> rp.can_fetch("http://example.com/account/myuser/profile", "mybot")
   False
   >>> rp.can_fetch("http://example.com/account/contact", "mybot")
   False
   >>> rp.crawl_delay("mybot")
   4.0
   >>> rp.request_rate("mybot")
   RequestRate(requests=10, seconds=60, start_time=None, end_time=None)
   >>> list(rp.sitemaps)
   ['http://example.com/sitemap-index.xml']
   >>> rp.preferred_host
   'http://example.co.in'


Using Protego with Requests_:

.. code-block:: pycon

   >>> from protego import Protego
   >>> import requests
   >>> r = requests.get("https://google.com/robots.txt")
   >>> rp = Protego.parse(r.text)
   >>> rp.can_fetch("https://google.com/search", "mybot")
   False
   >>> rp.can_fetch("https://google.com/search/about", "mybot")
   True
   >>> list(rp.sitemaps)
   ['https://www.google.com/sitemap.xml']

.. _Requests: https://3.python-requests.org/


Comparison
==========

The following table compares Protego to the most popular ``robots.txt`` parsers
implemented in Python or featuring Python bindings:

+----------------------------+---------+-----------------+--------+---------------------------+
|                            | Protego | RobotFileParser | Reppy  | Robotexclusionrulesparser |
+============================+=========+=================+========+===========================+
| Implementation language    | Python  | Python          | C++    | Python                    |
+----------------------------+---------+-----------------+--------+---------------------------+
| Reference specification    | Google_ | `Martijn Koster’s 1996 draft`_                       |
+----------------------------+---------+-----------------+--------+---------------------------+
| `Wildcard support`_        | ✓       |                 | ✓      | ✓                         |
+----------------------------+---------+-----------------+--------+---------------------------+
| `Length-based precedence`_ | ✓       |                 | ✓      |                           |
+----------------------------+---------+-----------------+--------+---------------------------+
| Performance_               |         | +40%            | +1300% | -25%                      |
+----------------------------+---------+-----------------+--------+---------------------------+

.. _Google: https://developers.google.com/search/reference/robots_txt
.. _Length-based precedence: https://developers.google.com/search/reference/robots_txt#order-of-precedence-for-group-member-lines
.. _Martijn Koster’s 1996 draft: https://www.robotstxt.org/norobots-rfc.txt
.. _Performance: https://anubhavp28.github.io/gsoc-weekly-checkin-12/
.. _Wildcard support: https://developers.google.com/search/reference/robots_txt#url-matching-based-on-path-values


API Reference
=============

Class ``protego.Protego``:

Properties
----------

*   ``sitemaps`` {``list_iterator``} A list of sitemaps specified in
    ``robots.txt``.

*   ``preferred_host`` {string} Preferred host specified in ``robots.txt``.


Methods
-------

*   ``parse(robotstxt_body)`` Parse ``robots.txt`` and return a new instance of
    ``protego.Protego``.

*   ``can_fetch(url, user_agent)`` Return True if the user agent can fetch the
    URL, otherwise return ``False``.

*   ``crawl_delay(user_agent)`` Return the crawl delay specified for the user
    agent as a float. If nothing is specified, return ``None``.

*   ``request_rate(user_agent)`` Return the request rate specified for the user
    agent as a named tuple ``RequestRate(requests, seconds, start_time,
    end_time)``. If nothing is specified, return ``None``.

*   ``visit_time(user_agent)`` Return the visit time specified for the user 
    agent as a named tuple ``VisitTime(start_time, end_time)``. 
    If nothing is specified, return ``None``.