File: autothrottle.rst

package info (click to toggle)
python-scrapy 1.5.1-1%2Bdeb10u1
  • links: PTS, VCS
  • area: main
  • in suites: buster
  • size: 4,404 kB
  • sloc: python: 25,793; xml: 199; makefile: 95; sh: 33
file content (165 lines) | stat: -rw-r--r-- 5,745 bytes parent folder | download
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
.. _topics-autothrottle:

======================
AutoThrottle extension
======================

This is an extension for automatically throttling crawling speed based on load
of both the Scrapy server and the website you are crawling.

Design goals
============

1. be nicer to sites instead of using default download delay of zero
2. automatically adjust scrapy to the optimum crawling speed, so the user
   doesn't have to tune the download delays to find the optimum one.
   The user only needs to specify the maximum concurrent requests
   it allows, and the extension does the rest.

.. _autothrottle-algorithm:

How it works
============

AutoThrottle extension adjusts download delays dynamically to make spider send
:setting:`AUTOTHROTTLE_TARGET_CONCURRENCY` concurrent requests on average
to each remote website.

It uses download latency to compute the delays. The main idea is the
following: if a server needs ``latency`` seconds to respond, a client
should send a request each ``latency/N`` seconds to have ``N`` requests
processed in parallel.

Instead of adjusting the delays one can just set a small fixed
download delay and impose hard limits on concurrency using
:setting:`CONCURRENT_REQUESTS_PER_DOMAIN` or
:setting:`CONCURRENT_REQUESTS_PER_IP` options. It will provide a similar
effect, but there are some important differences:

* because the download delay is small there will be occasional bursts
  of requests;
* often non-200 (error) responses can be returned faster than regular
  responses, so with a small download delay and a hard concurrency limit
  crawler will be sending requests to server faster when server starts to
  return errors. But this is an opposite of what crawler should do - in case
  of errors it makes more sense to slow down: these errors may be caused by
  the high request rate.

AutoThrottle doesn't have these issues.

Throttling algorithm
====================

AutoThrottle algorithm adjusts download delays based on the following rules:

1. spiders always start with a download delay of
   :setting:`AUTOTHROTTLE_START_DELAY`;
2. when a response is received, the target download delay is calculated as
   ``latency / N`` where ``latency`` is a latency of the response,
   and ``N`` is :setting:`AUTOTHROTTLE_TARGET_CONCURRENCY`.
3. download delay for next requests is set to the average of previous
   download delay and the target download delay;
4. latencies of non-200 responses are not allowed to decrease the delay;
5. download delay can't become less than :setting:`DOWNLOAD_DELAY` or greater
   than :setting:`AUTOTHROTTLE_MAX_DELAY`

.. note:: The AutoThrottle extension honours the standard Scrapy settings for
   concurrency and delay. This means that it will respect
   :setting:`CONCURRENT_REQUESTS_PER_DOMAIN` and
   :setting:`CONCURRENT_REQUESTS_PER_IP` options and
   never set a download delay lower than :setting:`DOWNLOAD_DELAY`.

.. _download-latency:

In Scrapy, the download latency is measured as the time elapsed between
establishing the TCP connection and receiving the HTTP headers.

Note that these latencies are very hard to measure accurately in a cooperative
multitasking environment because Scrapy may be busy processing a spider
callback, for example, and unable to attend downloads. However, these latencies
should still give a reasonable estimate of how busy Scrapy (and ultimately, the
server) is, and this extension builds on that premise.

Settings
========

The settings used to control the AutoThrottle extension are:

* :setting:`AUTOTHROTTLE_ENABLED`
* :setting:`AUTOTHROTTLE_START_DELAY`
* :setting:`AUTOTHROTTLE_MAX_DELAY`
* :setting:`AUTOTHROTTLE_TARGET_CONCURRENCY`
* :setting:`AUTOTHROTTLE_DEBUG`
* :setting:`CONCURRENT_REQUESTS_PER_DOMAIN`
* :setting:`CONCURRENT_REQUESTS_PER_IP`
* :setting:`DOWNLOAD_DELAY`

For more information see :ref:`autothrottle-algorithm`.

.. setting:: AUTOTHROTTLE_ENABLED

AUTOTHROTTLE_ENABLED
~~~~~~~~~~~~~~~~~~~~

Default: ``False``

Enables the AutoThrottle extension.

.. setting:: AUTOTHROTTLE_START_DELAY

AUTOTHROTTLE_START_DELAY
~~~~~~~~~~~~~~~~~~~~~~~~

Default: ``5.0``

The initial download delay (in seconds).

.. setting:: AUTOTHROTTLE_MAX_DELAY

AUTOTHROTTLE_MAX_DELAY
~~~~~~~~~~~~~~~~~~~~~~

Default: ``60.0``

The maximum download delay (in seconds) to be set in case of high latencies.

.. setting:: AUTOTHROTTLE_TARGET_CONCURRENCY

AUTOTHROTTLE_TARGET_CONCURRENCY
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. versionadded:: 1.1

Default: ``1.0``

Average number of requests Scrapy should be sending in parallel to remote
websites.

By default, AutoThrottle adjusts the delay to send a single
concurrent request to each of the remote websites. Set this option to
a higher value (e.g. ``2.0``) to increase the throughput and the load on remote
servers. A lower ``AUTOTHROTTLE_TARGET_CONCURRENCY`` value
(e.g. ``0.5``) makes the crawler more conservative and polite.

Note that :setting:`CONCURRENT_REQUESTS_PER_DOMAIN`
and :setting:`CONCURRENT_REQUESTS_PER_IP` options are still respected
when AutoThrottle extension is enabled. This means that if
``AUTOTHROTTLE_TARGET_CONCURRENCY`` is set to a value higher than
:setting:`CONCURRENT_REQUESTS_PER_DOMAIN` or
:setting:`CONCURRENT_REQUESTS_PER_IP`, the crawler won't reach this number
of concurrent requests.

At every given time point Scrapy can be sending more or less concurrent
requests than ``AUTOTHROTTLE_TARGET_CONCURRENCY``; it is a suggested
value the crawler tries to approach, not a hard limit.

.. setting:: AUTOTHROTTLE_DEBUG

AUTOTHROTTLE_DEBUG
~~~~~~~~~~~~~~~~~~

Default: ``False``

Enable AutoThrottle debug mode which will display stats on every response
received, so you can see how the throttling parameters are being adjusted in
real time.