File: sep-019.rst

package info (click to toggle)
python-scrapy 2.13.3-1
  • links: PTS, VCS
  • area: main
  • in suites: forky, sid
  • size: 5,664 kB
  • sloc: python: 52,028; xml: 199; makefile: 25; sh: 7
file content (329 lines) | stat: -rw-r--r-- 12,566 bytes parent folder | download | duplicates (2)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
=======  ===================
SEP      19
Title    Per-spider settings
Author   Pablo Hoffman, Nicolás Ramirez, Julia Medina
Created  2013-03-07
Status   Final (implemented with minor variations)
=======  ===================

======================================================
SEP-019: Per-spider settings and Crawl Process Cleanup
======================================================

This is a proposal to add support for overriding settings per-spiders in a
consistent way, while taking the chance to refactor the settings population
and whole crawl workflow.

In short, you will be able to overwrite settings (on a per-spider basis) by
implementing a class method in your spider:

.. code-block:: python

    class MySpider(Spider):
        @classmethod
        def custom_settings(cls):
            return {
                "DOWNLOAD_DELAY": 5.0,
                "RETRY_ENABLED": False,
            }


Proposed changes
================

- new ``custom_settings`` class method will be added to spiders, to give them
  a chance to override settings *before* they're used to instantiate the crawler
- spider managers will maintain loading spider classes functionality (with a
  new ``load`` method that will return a spider class given its name), but
  spider initialization will be delegated to crawlers (with a new
  ``from_crawler`` class method in spiders, that will allow them access to
  crawlers directly)
- spider manager will be striped out of Crawler class, as it will no longer
  need it
- ``SPIDER_MODULES`` and ``SPIDER_MANAGER_CLASS`` settings will be removed and
  replaced by entries on ``scrapy.cfg``. Thus spider managers won't need
  project settings to configure themselves
- CrawlerProcess will be remove, since crawlers will be created independently
  with a required spider class and optional ``SettingsReader`` instance
- Settings class will be split into two classes: ``SettingsLoader`` and
  ``SettingsReader``, and a new concept of "setting priority" will be added


Settings
========

Settings class will be split into two classes ``SettingsLoader`` and
``SettingsReader``. First one will be used to settle all different levels of
settings across the project, and the later will be a frozen version of the
already loaded settings and will be the preferred way to access them. This
will avoid the current possible misconception that you can change settings
after they have been populated. There'll be a new concept of settings
priorities, and ``settings.overrides`` will be deprecated in favor of
explicitly loaded settings with priorities, as it'll make the settings
overriding not order-dependent.

Because of this, ``CrawlerSettings`` (with its overrides, settings_module and
defaults) will be remove, but its interface could be maintained for backward
compatibility in ``SettingsReader`` (since on ``SettingsLoader``, overrides
dictionary and settings with priorities don't get along with a consistent
implementation). Maintaining this attributes and their functionality is not
advisable since it breaks the read-only access nature of the class.

With the new per-spider settings, there's a need of a helper function that
will take a spider and return a ``SettingsReader`` instance populated with
defaults', project's and the given spider's settings.  Motive behind this is
that ``get_project_settings`` can't continue being used for getting settings
instance for crawler usage when using the API directly, as the project is not
the only source of settings anymore. ``get_projects_settings`` will become an
internal function because of that.

SettingsLoader
--------------

``SettingsLoader`` is going to populate settings at startup, then it'll be
converted to a ``SettingsReader`` instance and discarded afterwards.

It is supposed to be write-only, but many previously loaded settings are
needed to be accessed before freezing them. For example, the
``COMMANDS_MODULE`` setting allows loading more command default settings.
Another example is that we need to read ``LOG_*`` settings early because we
must be able to log errors on the load settings process. ``ScrapyCommands``
may be configured based upon current settings, as users can plug custom
commands. These are some of the reasons that suggest that we need a read-write
access for this class.

- Will have a method ``set(name, value, priority)`` to register a setting with
  a given priority. A ``setdict(dict, priority)`` method may come handy for
  loading project's and per-spider settings.

- Will have current Settings getter functions (``get``, ``getint``,
  ``getfloat``, ``getdict``, etc.) (See above for reasons behind this).

- Will have a ``freeze`` method that returns an instance of
  ``SettingsReader``, with a copy of the current state of settings (already
  prioritized).

SettingsReader
--------------

It's intended to be the one used by core, extensions, and all components that
use settings without modifying them. Because there are logical objects that
change settings, such as ``ScrapyCommands``, use cases of each settings class
need to be comprehensively explained.

New crawlers will be created with an instance of this class (The one returned
by the ``freeze`` method on the already populated ``SettingsReader``), because
they are not expected to alter the settings.

It'll be read-only, keeping the same getter methods of current ``Settings``
(``get``, ``getint``, ``getfloat``, ``getdict``, etc.). There could be a
``set`` method that will throw a descriptive explanatory error for debugging
compatibility, avoiding its inadvertently usage.

Setting priorities
------------------

There will be 5 setting priorities used by default:

- 0: global defaults (those in ``scrapy.settings.default_settings``)
- 10: per-command defaults (for example, shell runs with ``KEEP_ALIVE=True``)
- 20: project settings (those in ``settings.py``)
- 30: per-spider settings (those returned by ``Spider.custom_settings`` class method)
- 40: command line arguments (those passed in the command line)

There are a couple of issues here:

- ``SCRAPY_PICKLED_SETTINGS_TO_OVERRIDE`` and ``SCRAPY_{settings}`` need-to-be
  deprecated environment variables: Can be kept, with a new or existing
  priority.

- We could have different priorities for settings given with the ``-s``
  argument and other named arguments in the command line (For example, ``-s
  LOG_ENABLE=False --loglevel=ERROR`` will set ``LOG_ENABLE`` to True, because
  named options are overridden later in the current implementation), but
  because the processing of command line options is done in one place we could
  leave them with the same priority and depend on the order of the set calls
  just for this case.

Deprecated code
---------------

``scrapy.conf.settings`` singleton is a deprecated implementation concerning
settings load. Could be maintained as it is, but the singleton should
implement new ``SettingsReader`` interface in order to work.


Spider manager
==============

Currently, the spider manager is part of the crawler which creates a cyclic
loop between settings and spiders and it shouldn't belong there. The spiders
should be loaded outside and passed to the crawler object, which will require a
spider class to be instantiated.

This new spider manager will not have access to the settings (they won't be
loaded yet) so it will use scrapy.cfg to configure itself.

The ``scrapy.cfg`` would look like this::

    [settings]
    default = myproject.settings

    [spiders]
    manager = scrapy.spidermanager.SpiderManager
    modules = myproject.spiders

- ``manager`` replaces ``SPIDER_MANAGER_CLASS`` setting and will, if omitted,
  default to ``scrapy.spidermanager.SpiderManager``
- ``modules`` replaces ``SPIDER_MODULES`` setting and will be required

These ideas translate to the following changes on the ``SpiderManager`` class:

- ``__init__(spider_modules)`` -> ``__init__()``. ``spider_modules`` will be
  looked in ``scrapy.cfg``.

- ``create('spider_name', **spider_kargs)`` -> ``load('spider_name')``. This
  will return a spider class, not an instance. It's basically a ``__get__``
  to ``self._spiders``.

- All remaining functions should be deprecated or remove accordingly, since a
  crawler reference is no longer needed.

- New helper ``get_spider_manager_class_from_scrapycfg`` in
  ``scrapy/utils/spidermanager.py``.


Spiders
=======

A new class method ``custom_settings`` is proposed, that could be use to
override project and default settings before they're used to instantiate the
crawler:

.. code-block:: python

    class MySpider(Spider):
        @classmethod
        def custom_settings(cls):
            return {
                "DOWNLOAD_DELAY": 5.0,
                "RETRY_ENABLED": False,
            }

This will only involve a ``set`` call with the corresponding priority when
populating ``SettingsLoader``.

Contributing to API changes, new ``from_crawler`` class method will be added
to spiders to give them a chance to access settings, stats, or the crawler
core components themselves. This should be the new way to create a spider from
now on (instead of normally instantiate it as is currently).


Scrapy commands
===============

As already stated, ``ScrapyCommands`` modify the settings, so they need a
``SettingsLoader`` instance reference in order to do that.

Present ``process_option`` implementations on Base and other commands read and
override settings. These overrides should be changed to ``set`` calls with
the allocated priority.

Each command with a custom ``run`` method should be modified to reflect the new
refactored API (Particularly ``crawl`` command).


CrawlerProcess
==============

``CrawlerProcess`` should be remove because Scrapy crawl command no longer
supports running multiple spiders. The preferred way for doing this is using
the API manually, instantiating a separate Crawler for each spider, so
``CrawlerProcess`` has loosen its utility.

This change is not directly related to the project (it's not focus on settings
but it fits in the API clean up task), but it's a great opportunity to
consider since we're changing the crawling startup flow.

This class will be deleted and the attributes and methods will be merge with
``Crawler``. For that effect, these are the specific merges and removals:

- ``self.crawlers`` doesn't make sense is this new set up, each reference will
  be replace with self.

- ``create_crawler`` will be ``__init__`` of ``Crawler``

- ``_start_crawler`` will be merge with ``Crawler.start``

- ``start`` will be merge with ``Crawler.crawl`` but this will need from the
  later an extra boolean parameter ``start_reactor`` (default: True) to crawl
  with or without starting twisted reactor (This is required in
  ``commands.shell`` in order to start the reactor in another thread).


Startup process
===============

This summarizes the current and new proposed mechanisms for starting up a
Scrapy crawler. Imports and non representative functions are omitted for
brevity.


Current (old) startup process
-----------------------------

::

    # execute in cmdline

    # loads settings.py, returns CrawlerSettings(settings_module)
    settings = get_project_settings()
    settings.defaults.update(cmd.default_settings)

    cmd.crawler_process = CrawlerProcess(settings)
    cmd.run # (In a _run_print_help call)

        # Command.run in commands/crawl.py

        self.crawler_process.create_crawler()
        spider = crawler.spiders.create(spider_name, **spider_kwargs)
    crawler.crawl(spider)
        self.crawler_process.start() # starts crawling spider

            # CrawlerProcess._start_crawler in crawler.py

            crawler.configure()

Proposed (new) startup process
------------------------------

::

    # execute in cmdline

    smcls = get_spider_manager_class_from_scrapycfg()
    sm = smcls() # loads spiders from module defined in scrapy.cfg
    spidercls = sm.load(spider_name) # returns spider class, not instance

    settings = get_project_settings() # loads settings.py
    settings.setdict(cmd.default_settings, priority=40)

    settings.setdict(spidercls.custom_settings(), priority=30)

    settings = settings.freeze()
    cmd.crawler = Crawler(spidercls, settings=settings)

        # Crawler.__init__ in crawler.py

        self.configure()

    cmd.run # (In a _run_print_help call)

        # Command.run in commands/crawl.py

        self.crawler.crawl(**spider_kwargs)

            # Crawler.crawl in crawler.py

            spider = self.spidercls.from_crawler(self, **spider_kwargs)
        # starts crawling spider