File: README.md

package info (click to toggle)
python-protego 0.1.16%2Bdfsg-2
  • links: PTS, VCS
  • area: main
  • in suites: bookworm, bullseye, sid
  • size: 30,204 kB
  • sloc: python: 1,426; perl: 190; cpp: 33; sh: 12; makefile: 3
file content (87 lines) | stat: -rw-r--r-- 2,598 bytes parent folder | download | duplicates (2)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
# Protego

![build-badge](https://api.travis-ci.com/scrapy/protego.svg?branch=master)
[![made-with-python](https://img.shields.io/badge/Made%20with-Python-1f425f.svg)](https://www.python.org/)
## Overview
Protego is a pure-Python `robots.txt` parser with support for modern conventions.

## Requirements
* Python 2.7 or Python 3.5+
* Works on Linux, Windows, Mac OSX, BSD

## Install

To install Protego, simply use pip:

```
pip install protego
```

## Usage

```python
>>> from protego import Protego
>>> robotstxt = """
... User-agent: *
... Disallow: /
... Allow: /about
... Allow: /account
... Disallow: /account/contact$
... Disallow: /account/*/profile
... Crawl-delay: 4
... Request-rate: 10/1m                 # 10 requests every 1 minute
... 
... Sitemap: http://example.com/sitemap-index.xml
... Host: http://example.co.in
... """
>>> rp = Protego.parse(robotstxt)
>>> rp.can_fetch("http://example.com/profiles", "mybot")
False
>>> rp.can_fetch("http://example.com/about", "mybot")
True
>>> rp.can_fetch("http://example.com/account", "mybot")
True
>>> rp.can_fetch("http://example.com/account/myuser/profile", "mybot")
False
>>> rp.can_fetch("http://example.com/account/contact", "mybot")
False
>>> rp.crawl_delay("mybot")
4.0
>>> rp.request_rate("mybot")
RequestRate(requests=10, seconds=60, start_time=None, end_time=None)
>>> list(rp.sitemaps)
['http://example.com/sitemap-index.xml']
>>> rp.preferred_host
'http://example.co.in'
```

Using Protego with [Requests](https://3.python-requests.org/)

```python
>>> from protego import Protego
>>> import requests
>>> r = requests.get("https://google.com/robots.txt")
>>> rp = Protego.parse(r.text)
>>> rp.can_fetch("https://google.com/search", "mybot")
False
>>> rp.can_fetch("https://google.com/search/about", "mybot")
True
>>> list(rp.sitemaps)
['https://www.google.com/sitemap.xml']
```

## Documentation

Class `protego.Protego`:
    
### Properties

* `sitemaps` {`list_iterator`} A list of sitemaps specified in `robots.txt`.
* `preferred_host` {string} Preferred host specified in `robots.txt`.

### Methods

* `parse(robotstxt_body)` Parse `robots.txt` and return a new instance of `protego.Protego`. 
* `can_fetch(url, user_agent)` Return True if the user agent can fetch the URL, otherwise return False.
* `crawl_delay(user_agent)` Return the crawl delay specified for the user agent as a float. If nothing is specified, return None.
* `request_rate(user_agent)` Return the request rate specified for the user agent as a named tuple `RequestRate(requests, seconds, start_time, end_time)`. If nothing is specified, return None.