File: PKG-INFO

package info (click to toggle)
python-protego 0.2.1%2Bdfsg-1
  • links: PTS, VCS
  • area: main
  • in suites: bookworm
  • size: 30,208 kB
  • sloc: python: 1,430; perl: 190; cpp: 33; sh: 12; makefile: 3
file content (164 lines) | stat: -rw-r--r-- 6,926 bytes parent folder | download | duplicates (2)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
Metadata-Version: 2.1
Name: Protego
Version: 0.2.1
Summary: Pure-Python robots.txt parser with support for modern conventions
Home-page: https://github.com/scrapy/protego
Author: Anubhav Patel
Author-email: anubhavp28@gmail.com
License: BSD
Description: =======
        Protego
        =======
        
        .. image:: https://img.shields.io/pypi/pyversions/protego.svg
           :target: https://pypi.python.org/pypi/protego
           :alt: Supported Python Versions
        
        .. image:: https://img.shields.io/travis/scrapy/protego/master.svg
           :target: https://travis-ci.org/scrapy/protego
           :alt: Build Status
        
        Protego is a pure-Python ``robots.txt`` parser with support for modern
        conventions.
        
        
        Install
        =======
        
        To install Protego, simply use pip:
        
        .. code-block:: none
        
            pip install protego
        
        
        Usage
        =====
        
        >>> from protego import Protego
        >>> robotstxt = """
        ... User-agent: *
        ... Disallow: /
        ... Allow: /about
        ... Allow: /account
        ... Disallow: /account/contact$
        ... Disallow: /account/*/profile
        ... Crawl-delay: 4
        ... Request-rate: 10/1m                 # 10 requests every 1 minute
        ... 
        ... Sitemap: http://example.com/sitemap-index.xml
        ... Host: http://example.co.in
        ... """
        >>> rp = Protego.parse(robotstxt)
        >>> rp.can_fetch("http://example.com/profiles", "mybot")
        False
        >>> rp.can_fetch("http://example.com/about", "mybot")
        True
        >>> rp.can_fetch("http://example.com/account", "mybot")
        True
        >>> rp.can_fetch("http://example.com/account/myuser/profile", "mybot")
        False
        >>> rp.can_fetch("http://example.com/account/contact", "mybot")
        False
        >>> rp.crawl_delay("mybot")
        4.0
        >>> rp.request_rate("mybot")
        RequestRate(requests=10, seconds=60, start_time=None, end_time=None)
        >>> list(rp.sitemaps)
        ['http://example.com/sitemap-index.xml']
        >>> rp.preferred_host
        'http://example.co.in'
        
        Using Protego with Requests_:
        
        >>> from protego import Protego
        >>> import requests
        >>> r = requests.get("https://google.com/robots.txt")
        >>> rp = Protego.parse(r.text)
        >>> rp.can_fetch("https://google.com/search", "mybot")
        False
        >>> rp.can_fetch("https://google.com/search/about", "mybot")
        True
        >>> list(rp.sitemaps)
        ['https://www.google.com/sitemap.xml']
        
        .. _Requests: https://3.python-requests.org/
        
        
        Comparison
        ==========
        
        The following table compares Protego to the most popular ``robots.txt`` parsers
        implemented in Python or featuring Python bindings:
        
        +----------------------------+---------+-----------------+--------+---------------------------+
        |                            | Protego | RobotFileParser | Reppy  | Robotexclusionrulesparser |
        +============================+=========+=================+========+===========================+
        | Implementation language    | Python  | Python          | C++    | Python                    |
        +----------------------------+---------+-----------------+--------+---------------------------+
        | Reference specification    | Google_ | `Martijn Koster’s 1996 draft`_                       |
        +----------------------------+---------+-----------------+--------+---------------------------+
        | `Wildcard support`_        | ✓       |                 | ✓      | ✓                         |
        +----------------------------+---------+-----------------+--------+---------------------------+
        | `Length-based precedence`_ | ✓       |                 | ✓      |                           |
        +----------------------------+---------+-----------------+--------+---------------------------+
        | Performance_               |         | +40%            | +1300% | -25%                      |
        +----------------------------+---------+-----------------+--------+---------------------------+
        
        .. _Google: https://developers.google.com/search/reference/robots_txt
        .. _Length-based precedence: https://developers.google.com/search/reference/robots_txt#order-of-precedence-for-group-member-lines
        .. _Martijn Koster’s 1996 draft: https://www.robotstxt.org/norobots-rfc.txt
        .. _Performance: https://anubhavp28.github.io/gsoc-weekly-checkin-12/
        .. _Wildcard support: https://developers.google.com/search/reference/robots_txt#url-matching-based-on-path-values
        
        
        API Reference
        =============
        
        Class ``protego.Protego``:
        
        Properties
        ----------
        
        *   ``sitemaps`` {``list_iterator``} A list of sitemaps specified in
            ``robots.txt``.
        
        *   ``preferred_host`` {string} Preferred host specified in ``robots.txt``.
        
        
        Methods
        -------
        
        *   ``parse(robotstxt_body)`` Parse ``robots.txt`` and return a new instance of
            ``protego.Protego``.
        
        *   ``can_fetch(url, user_agent)`` Return True if the user agent can fetch the
            URL, otherwise return ``False``.
        
        *   ``crawl_delay(user_agent)`` Return the crawl delay specified for the user
            agent as a float. If nothing is specified, return ``None``.
        
        *   ``request_rate(user_agent)`` Return the request rate specified for the user
            agent as a named tuple ``RequestRate(requests, seconds, start_time,
            end_time)``. If nothing is specified, return ``None``.
        
Keywords: robots.txt,parser,robots,rep
Platform: UNKNOWN
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: BSD License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python
Classifier: Programming Language :: Python :: 2
Classifier: Programming Language :: Python :: 2.7
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.5
Classifier: Programming Language :: Python :: 3.6
Classifier: Programming Language :: Python :: 3.7
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: Implementation :: CPython
Classifier: Programming Language :: Python :: Implementation :: PyPy
Requires-Python: >=2.7, !=3.0.*, !=3.1.*, !=3.2.*, !=3.3.*, !=3.4.*
Description-Content-Type: text/x-rst