1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180
|
Metadata-Version: 2.2
Name: Protego
Version: 0.4.0
Summary: Pure-Python robots.txt parser with support for modern conventions
Home-page: https://github.com/scrapy/protego
Author: Anubhav Patel
Author-email: anubhavp28@gmail.com
License: BSD
Keywords: robots.txt,parser,robots,rep
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: BSD License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Programming Language :: Python :: Implementation :: CPython
Classifier: Programming Language :: Python :: Implementation :: PyPy
Requires-Python: >=3.9
Description-Content-Type: text/x-rst
License-File: LICENSE
Dynamic: author
Dynamic: author-email
Dynamic: classifier
Dynamic: description
Dynamic: description-content-type
Dynamic: home-page
Dynamic: keywords
Dynamic: license
Dynamic: requires-python
Dynamic: summary
=======
Protego
=======
.. image:: https://img.shields.io/pypi/pyversions/protego.svg
:target: https://pypi.python.org/pypi/protego
:alt: Supported Python Versions
.. image:: https://github.com/scrapy/protego/workflows/CI/badge.svg
:target: https://github.com/scrapy/protego/actions?query=workflow%3ACI
:alt: CI
Protego is a pure-Python ``robots.txt`` parser with support for modern
conventions.
Install
=======
To install Protego, simply use pip:
.. code-block:: none
pip install protego
Usage
=====
.. code-block:: pycon
>>> from protego import Protego
>>> robotstxt = """
... User-agent: *
... Disallow: /
... Allow: /about
... Allow: /account
... Disallow: /account/contact$
... Disallow: /account/*/profile
... Crawl-delay: 4
... Request-rate: 10/1m # 10 requests every 1 minute
...
... Sitemap: http://example.com/sitemap-index.xml
... Host: http://example.co.in
... """
>>> rp = Protego.parse(robotstxt)
>>> rp.can_fetch("http://example.com/profiles", "mybot")
False
>>> rp.can_fetch("http://example.com/about", "mybot")
True
>>> rp.can_fetch("http://example.com/account", "mybot")
True
>>> rp.can_fetch("http://example.com/account/myuser/profile", "mybot")
False
>>> rp.can_fetch("http://example.com/account/contact", "mybot")
False
>>> rp.crawl_delay("mybot")
4.0
>>> rp.request_rate("mybot")
RequestRate(requests=10, seconds=60, start_time=None, end_time=None)
>>> list(rp.sitemaps)
['http://example.com/sitemap-index.xml']
>>> rp.preferred_host
'http://example.co.in'
Using Protego with Requests_:
.. code-block:: pycon
>>> from protego import Protego
>>> import requests
>>> r = requests.get("https://google.com/robots.txt")
>>> rp = Protego.parse(r.text)
>>> rp.can_fetch("https://google.com/search", "mybot")
False
>>> rp.can_fetch("https://google.com/search/about", "mybot")
True
>>> list(rp.sitemaps)
['https://www.google.com/sitemap.xml']
.. _Requests: https://3.python-requests.org/
Comparison
==========
The following table compares Protego to the most popular ``robots.txt`` parsers
implemented in Python or featuring Python bindings:
+----------------------------+---------+-----------------+--------+---------------------------+
| | Protego | RobotFileParser | Reppy | Robotexclusionrulesparser |
+============================+=========+=================+========+===========================+
| Implementation language | Python | Python | C++ | Python |
+----------------------------+---------+-----------------+--------+---------------------------+
| Reference specification | Google_ | `Martijn Koster’s 1996 draft`_ |
+----------------------------+---------+-----------------+--------+---------------------------+
| `Wildcard support`_ | ✓ | | ✓ | ✓ |
+----------------------------+---------+-----------------+--------+---------------------------+
| `Length-based precedence`_ | ✓ | | ✓ | |
+----------------------------+---------+-----------------+--------+---------------------------+
| Performance_ | | +40% | +1300% | -25% |
+----------------------------+---------+-----------------+--------+---------------------------+
.. _Google: https://developers.google.com/search/reference/robots_txt
.. _Length-based precedence: https://developers.google.com/search/reference/robots_txt#order-of-precedence-for-group-member-lines
.. _Martijn Koster’s 1996 draft: https://www.robotstxt.org/norobots-rfc.txt
.. _Performance: https://anubhavp28.github.io/gsoc-weekly-checkin-12/
.. _Wildcard support: https://developers.google.com/search/reference/robots_txt#url-matching-based-on-path-values
API Reference
=============
Class ``protego.Protego``:
Properties
----------
* ``sitemaps`` {``list_iterator``} A list of sitemaps specified in
``robots.txt``.
* ``preferred_host`` {string} Preferred host specified in ``robots.txt``.
Methods
-------
* ``parse(robotstxt_body)`` Parse ``robots.txt`` and return a new instance of
``protego.Protego``.
* ``can_fetch(url, user_agent)`` Return True if the user agent can fetch the
URL, otherwise return ``False``.
* ``crawl_delay(user_agent)`` Return the crawl delay specified for the user
agent as a float. If nothing is specified, return ``None``.
* ``request_rate(user_agent)`` Return the request rate specified for the user
agent as a named tuple ``RequestRate(requests, seconds, start_time,
end_time)``. If nothing is specified, return ``None``.
* ``visit_time(user_agent)`` Return the visit time specified for the user
agent as a named tuple ``VisitTime(start_time, end_time)``.
If nothing is specified, return ``None``.
|