File: practices.rst

package info (click to toggle)
python-scrapy 2.14.1-1
  • links: PTS, VCS
  • area: main
  • in suites: forky, sid
  • size: 6,316 kB
  • sloc: python: 55,421; xml: 199; makefile: 25; sh: 7
file content (345 lines) | stat: -rw-r--r-- 12,216 bytes parent folder | download
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
.. _topics-practices:

================
Common Practices
================

This section documents common practices when using Scrapy. These are things
that cover many topics and don't often fall into any other specific section.

.. skip: start

.. _run-from-script:

Run Scrapy from a script
========================

You can use the :ref:`API <topics-api>` to run Scrapy from a script, instead of
the typical way of running Scrapy via ``scrapy crawl``.

Remember that Scrapy is built on top of the Twisted
asynchronous networking library, so you need to run it inside the Twisted reactor.

The first utility you can use to run your spiders is
:class:`scrapy.crawler.AsyncCrawlerProcess` or
:class:`scrapy.crawler.CrawlerProcess`. These classes will start a Twisted
reactor for you, configuring the logging and setting shutdown handlers. These
classes are the ones used by all Scrapy commands. They have similar
functionality, differing in their asynchronous API style:
:class:`~scrapy.crawler.AsyncCrawlerProcess` returns coroutines from its
asynchronous methods while :class:`~scrapy.crawler.CrawlerProcess` returns
:class:`~twisted.internet.defer.Deferred` objects.

Here's an example showing how to run a single spider with it.

.. code-block:: python

    import scrapy
    from scrapy.crawler import AsyncCrawlerProcess


    class MySpider(scrapy.Spider):
        # Your spider definition
        ...


    process = AsyncCrawlerProcess(
        settings={
            "FEEDS": {
                "items.json": {"format": "json"},
            },
        }
    )

    process.crawl(MySpider)
    process.start()  # the script will block here until the crawling is finished

You can define :ref:`settings <topics-settings>` within the dictionary passed
to :class:`~scrapy.crawler.AsyncCrawlerProcess`. Make sure to check the
:class:`~scrapy.crawler.AsyncCrawlerProcess`
documentation to get acquainted with its usage details.

If you are inside a Scrapy project there are some additional helpers you can
use to import those components within the project. You can automatically import
your spiders passing their name to
:class:`~scrapy.crawler.AsyncCrawlerProcess`, and use
:func:`scrapy.utils.project.get_project_settings` to get a
:class:`~scrapy.settings.Settings` instance with your project settings.

What follows is a working example of how to do that, using the `testspiders`_
project as example.

.. code-block:: python

    from scrapy.crawler import AsyncCrawlerProcess
    from scrapy.utils.project import get_project_settings

    process = AsyncCrawlerProcess(get_project_settings())

    # 'followall' is the name of one of the spiders of the project.
    process.crawl("followall", domain="scrapy.org")
    process.start()  # the script will block here until the crawling is finished

There's another Scrapy utility that provides more control over the crawling
process: :class:`scrapy.crawler.AsyncCrawlerRunner` or
:class:`scrapy.crawler.CrawlerRunner`. These classes are thin wrappers
that encapsulate some simple helpers to run multiple crawlers, but they won't
start or interfere with existing reactors in any way. Just like
:class:`scrapy.crawler.AsyncCrawlerProcess` and
:class:`scrapy.crawler.CrawlerProcess` they differ in their asynchronous API
style.

When using these classes the reactor should be explicitly run after scheduling
your spiders. It's recommended that you use
:class:`~scrapy.crawler.AsyncCrawlerRunner` or
:class:`~scrapy.crawler.CrawlerRunner` instead of
:class:`~scrapy.crawler.AsyncCrawlerProcess` or
:class:`~scrapy.crawler.CrawlerProcess` if your application is already using
Twisted and you want to run Scrapy in the same reactor.

If you want to stop the reactor or run any other code right after the spider
finishes you can do that after the task returned from
:meth:`AsyncCrawlerRunner.crawl() <scrapy.crawler.AsyncCrawlerRunner.crawl>`
completes (or the Deferred returned from :meth:`CrawlerRunner.crawl()
<scrapy.crawler.CrawlerRunner.crawl>` fires). In the simplest case you can also
use :func:`twisted.internet.task.react` to start and stop the reactor, though
it may be easier to just use :class:`~scrapy.crawler.AsyncCrawlerProcess` or
:class:`~scrapy.crawler.CrawlerProcess` instead.

Here's an example of using :class:`~scrapy.crawler.AsyncCrawlerRunner` together
with simple reactor management code:

.. code-block:: python

    import scrapy
    from scrapy.crawler import AsyncCrawlerRunner
    from scrapy.utils.defer import deferred_f_from_coro_f
    from scrapy.utils.log import configure_logging
    from scrapy.utils.reactor import install_reactor
    from twisted.internet.task import react


    class MySpider(scrapy.Spider):
        # Your spider definition
        ...


    async def crawl(_):
        configure_logging({"LOG_FORMAT": "%(levelname)s: %(message)s"})
        runner = AsyncCrawlerRunner()
        await runner.crawl(MySpider)  # completes when the spider finishes


    install_reactor("twisted.internet.asyncioreactor.AsyncioSelectorReactor")
    react(deferred_f_from_coro_f(crawl))

Same example but using :class:`~scrapy.crawler.CrawlerRunner` and a
different reactor (:class:`~scrapy.crawler.AsyncCrawlerRunner` only works
with :class:`~twisted.internet.asyncioreactor.AsyncioSelectorReactor`):

.. code-block:: python

    import scrapy
    from scrapy.crawler import CrawlerRunner
    from scrapy.utils.log import configure_logging
    from scrapy.utils.reactor import install_reactor
    from twisted.internet.task import react


    class MySpider(scrapy.Spider):
        custom_settings = {
            "TWISTED_REACTOR": "twisted.internet.epollreactor.EPollReactor",
        }
        # Your spider definition
        ...


    def crawl(_):
        configure_logging({"LOG_FORMAT": "%(levelname)s: %(message)s"})
        runner = CrawlerRunner()
        d = runner.crawl(MySpider)
        return d  # this Deferred fires when the spider finishes


    install_reactor("twisted.internet.epollreactor.EPollReactor")
    react(crawl)

.. seealso:: :doc:`twisted:core/howto/reactor-basics`

.. _run-multiple-spiders:

Running multiple spiders in the same process
============================================

By default, Scrapy runs a single spider per process when you run ``scrapy
crawl``. However, Scrapy supports running multiple spiders per process using
the :ref:`internal API <topics-api>`.

Here is an example that runs multiple spiders simultaneously:

.. code-block:: python

    import scrapy
    from scrapy.crawler import AsyncCrawlerProcess
    from scrapy.utils.project import get_project_settings


    class MySpider1(scrapy.Spider):
        # Your first spider definition
        ...


    class MySpider2(scrapy.Spider):
        # Your second spider definition
        ...


    settings = get_project_settings()
    process = AsyncCrawlerProcess(settings)
    process.crawl(MySpider1)
    process.crawl(MySpider2)
    process.start()  # the script will block here until all crawling jobs are finished

Same example using :class:`~scrapy.crawler.AsyncCrawlerRunner`:

.. code-block:: python

    import scrapy
    from scrapy.crawler import AsyncCrawlerRunner
    from scrapy.utils.defer import deferred_f_from_coro_f
    from scrapy.utils.log import configure_logging
    from scrapy.utils.reactor import install_reactor
    from twisted.internet.task import react


    class MySpider1(scrapy.Spider):
        # Your first spider definition
        ...


    class MySpider2(scrapy.Spider):
        # Your second spider definition
        ...


    async def crawl(_):
        configure_logging({"LOG_FORMAT": "%(levelname)s: %(message)s"})
        runner = AsyncCrawlerRunner()
        runner.crawl(MySpider1)
        runner.crawl(MySpider2)
        await runner.join()  # completes when both spiders finish


    install_reactor("twisted.internet.asyncioreactor.AsyncioSelectorReactor")
    react(deferred_f_from_coro_f(crawl))


Same example but running the spiders sequentially by awaiting until each one
finishes before starting the next one:

.. code-block:: python

    import scrapy
    from scrapy.crawler import AsyncCrawlerRunner
    from scrapy.utils.defer import deferred_f_from_coro_f
    from scrapy.utils.log import configure_logging
    from scrapy.utils.reactor import install_reactor
    from twisted.internet.task import react


    class MySpider1(scrapy.Spider):
        # Your first spider definition
        ...


    class MySpider2(scrapy.Spider):
        # Your second spider definition
        ...


    async def crawl(_):
        configure_logging({"LOG_FORMAT": "%(levelname)s: %(message)s"})
        runner = AsyncCrawlerRunner()
        await runner.crawl(MySpider1)
        await runner.crawl(MySpider2)


    install_reactor("twisted.internet.asyncioreactor.AsyncioSelectorReactor")
    react(deferred_f_from_coro_f(crawl))

.. note:: When running multiple spiders in the same process, :ref:`reactor
    settings <reactor-settings>` should not have a different value per spider.
    Also, :ref:`pre-crawler settings <pre-crawler-settings>` cannot be defined
    per spider.

.. seealso:: :ref:`run-from-script`.

.. skip: end

.. _distributed-crawls:

Distributed crawls
==================

Scrapy doesn't provide any built-in facility for running crawls in a distributed
(multi-server) manner. However, there are some ways to distribute crawls, which
vary depending on how you plan to distribute them.

If you have many spiders, the obvious way to distribute the load is to setup
many Scrapyd instances and distribute spider runs among those.

If you instead want to run a single (big) spider through many machines, what
you usually do is partition the URLs to crawl and send them to each separate
spider. Here is a concrete example:

First, you prepare the list of URLs to crawl and put them into separate
files/urls::

    http://somedomain.com/urls-to-crawl/spider1/part1.list
    http://somedomain.com/urls-to-crawl/spider1/part2.list
    http://somedomain.com/urls-to-crawl/spider1/part3.list

Then you fire a spider run on 3 different Scrapyd servers. The spider would
receive a (spider) argument ``part`` with the number of the partition to
crawl::

    curl http://scrapy1.mycompany.com:6800/schedule.json -d project=myproject -d spider=spider1 -d part=1
    curl http://scrapy2.mycompany.com:6800/schedule.json -d project=myproject -d spider=spider1 -d part=2
    curl http://scrapy3.mycompany.com:6800/schedule.json -d project=myproject -d spider=spider1 -d part=3

.. _bans:

Avoiding getting banned
=======================

Some websites implement certain measures to prevent bots from crawling them,
with varying degrees of sophistication. Getting around those measures can be
difficult and tricky, and may sometimes require special infrastructure. Please
consider contacting `commercial support`_ if in doubt.

Here are some tips to keep in mind when dealing with these kinds of sites:

* rotate your user agent from a pool of well-known ones from browsers (Google
  around to get a list of them)
* disable cookies (see :setting:`COOKIES_ENABLED`) as some sites may use
  cookies to spot bot behaviour
* use download delays (2 or higher). See :setting:`DOWNLOAD_DELAY` setting.
* if possible, use `Common Crawl`_ to fetch pages, instead of hitting the sites
  directly
* use a pool of rotating IPs. For example, the free `Tor project`_ or paid
  services like `ProxyMesh`_. An open source alternative is `scrapoxy`_, a
  super proxy that you can attach your own proxies to.
* use a ban avoidance service, such as `Zyte API`_, which provides a `Scrapy
  plugin <https://github.com/scrapy-plugins/scrapy-zyte-api>`__ and additional
  features, like `AI web scraping <https://www.zyte.com/ai-web-scraping/>`__

If you are still unable to prevent your bot getting banned, consider contacting
`commercial support`_.

.. _Tor project: https://www.torproject.org/
.. _commercial support: https://scrapy.org/support/
.. _ProxyMesh: https://proxymesh.com/
.. _Common Crawl: https://commoncrawl.org/
.. _testspiders: https://github.com/scrapinghub/testspiders
.. _scrapoxy: https://scrapoxy.io/
.. _Zyte API: https://docs.zyte.com/zyte-api/get-started.html