File: sep-016.rst

package info (click to toggle)
python-scrapy 1.5.1-1%2Bdeb10u1
  • links: PTS, VCS
  • area: main
  • in suites: buster
  • size: 4,404 kB
  • sloc: python: 25,793; xml: 199; makefile: 95; sh: 33
file content (306 lines) | stat: -rw-r--r-- 9,448 bytes parent folder | download | duplicates (2)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
=======  =============================
SEP      16
Title    Leg Spider
Author   Insophia Team
Created  2010-06-03
Status   Superseded  by :doc:`sep-018`
=======  =============================

===================
SEP-016: Leg Spider
===================

This SEP introduces a new kind of Spider called ``LegSpider`` which provides
modular functionality which can be plugged to different spiders.

Rationale
=========

The purpose of Leg Spiders is to define an architecture for building spiders
based on smaller well-tested components (aka. Legs) that can be combined to
achieve the desired functionality. These reusable components will benefit all
Scrapy users by building a repository of well-tested components (legs) that can
be shared among different spiders and projects. Some of them will come bundled
with Scrapy.

The Legs themselves can be also combined with sub-legs, in a hierarchical
fashion. Legs are also spiders themselves, hence the name "Leg Spider".

``LegSpider`` API
=================

A ``LegSpider`` is a ``BaseSpider`` subclass that adds the following attributes and methods:

- ``legs``
   - legs composing this spider
- ``process_response(response)``
   - Process a (downloaded) response and return a list of requests and items
- ``process_request(request)``
   - Process a request after it has been extracted and before returning it from
     the spider
- ``process_item(item)``
   - Process an item after it has been extracted and before returning it from
     the spider
- ``set_spider()``
   - Defines the main spider associated with this Leg Spider, which is often
     used to configure the Leg Spider behavior.

How Leg Spiders work
====================

1. Each Leg Spider has zero or many Leg Spiders associated with it. When a
   response arrives, the Leg Spider process it with its ``process_response``
   method and also the ``process_response`` method of all its "sub leg
   spiders". Finally, the output of all of them is combined to produce the
   final aggregated output.
2. Each element of the aggregated output of ``process_response`` is processed
   with either ``process_item`` or ``process_request`` before being returned
   from the spider. Similar to ``process_response``, each item/request is
   processed with all ``process_{request,item``} of the leg spiders composing
   the spider, and also with those of the spider itself.

Leg Spider examples
===================

Regex (HTML) Link Extractor
---------------------------

A typical application of LegSpider's is to build Link Extractors. For example:

::

   #!python
   class RegexHtmlLinkExtractor(LegSpider):

       def process_response(self, response):
           if isinstance(response, HtmlResponse):
               allowed_regexes = self.spider.url_regexes_to_follow
               # extract urls to follow using allowed_regexes
               return [Request(x) for x in urls_to_follow]

   class MySpider(LegSpider):

       legs = [RegexHtmlLinkExtractor()]
       url_regexes_to_follow = ['/product.php?.*']

       def parse_response(self, response):
           # parse response and extract items
           return items

RSS2 link extractor
-------------------

This is a Leg Spider that can be used for following links from RSS2 feeds.

::

   #!python
   class Rss2LinkExtractor(LegSpider):

       def process_response(self, response):
           if response.headers.get('Content-type') 'application/rss+xml':
               xs = XmlXPathSelector(response)
               urls = xs.select("//item/link/text()").extract()
               return [Request(x) for x in urls]

Callback dispatcher based on rules
----------------------------------

Another example could be to build a callback dispatcher based on rules:

::

   #!python
   class CallbackRules(LegSpider):

       def __init__(self, *a, **kw):
           super(CallbackRules, self).__init__(*a, **kw)
           for regex, method_name in self.spider.callback_rules.items():
               r = re.compile(regex)
               m = getattr(self.spider, method_name, None)
               if m:
                   self._rules[r] = m

       def process_response(self, response):
           for regex, method in self._rules.items():
               m = regex.search(response.url)
               if m:
                   return method(response)
           return []

   class MySpider(LegSpider):

       legs = [CallbackRules()]
       callback_rules = {
           '/product.php.*': 'parse_product',
           '/category.php.*': 'parse_category',
       }

       def parse_product(self, response):
           # parse response and populate item
           return item

URL Canonicalizers
------------------

Another example could be for building URL canonicalizers:

::

   #!python
   class CanonializeUrl(LegSpider):

       def process_request(self, request):
           curl = canonicalize_url(request.url, rules=self.spider.canonicalization_rules)
           return request.replace(url=curl)

   class MySpider(LegSpider):

       legs = [CanonicalizeUrl()]
       canonicalization_rules = ['sort-query-args', 'normalize-percent-encoding', ...]

       # ...

Setting item identifier
-----------------------

Another example could be for setting a unique identifier to items, based on
certain fields:

::

   #!python
   class ItemIdSetter(LegSpider):

       def process_item(self, item):
           id_field = self.spider.id_field
           id_fields_to_hash = self.spider.id_fields_to_hash
           item[id_field] = make_hash_based_on_fields(item, id_fields_to_hash)
           return item

   class MySpider(LegSpider):

       legs = [ItemIdSetter()]
       id_field = 'guid'
       id_fields_to_hash = ['supplier_name', 'supplier_id']

       def process_response(self, item):
           # extract item from response
           return item

Combining multiple leg spiders
------------------------------

Here's an example that combines functionality from multiple leg spiders:

::

   #!python
   class MySpider(LegSpider):

       legs = [RegexLinkExtractor(), ParseRules(), CanonicalizeUrl(), ItemIdSetter()]

       url_regexes_to_follow = ['/product.php?.*']

       parse_rules = {
           '/product.php.*': 'parse_product',
           '/category.php.*': 'parse_category',
       }

       canonicalization_rules = ['sort-query-args', 'normalize-percent-encoding', ...]

       id_field = 'guid'
       id_fields_to_hash = ['supplier_name', 'supplier_id']

       def process_product(self, item):
           # extract item from response
           return item

       def process_category(self, item):
           # extract item from response
           return item

Leg Spiders vs Spider middlewares
=================================

A common question that would arise is when one should use Leg Spiders and when
to use Spider middlewares. Leg Spiders functionality is meant to implement
spider-specific functionality, like link extraction which has custom rules per
spider. Spider middlewares, on the other hand, are meant to implement global
functionality.

When not to use Leg Spiders
===========================

Leg Spiders are not a silver bullet to implement all kinds of spiders, so it's
important to keep in mind their scope and limitations, such as:

- Leg Spiders can't filter duplicate requests, since they don't have access to
  all requests at the same time. This functionality should be done in a spider
  or scheduler middleware.
- Leg Spiders are meant to be used for spiders whose behavior (requests & items
  to extract) depends only on the current page and not previously crawled pages
  (aka. "context-free spiders"). If your spider has some custom logic with
  chained downloads (for example, multi-page items) then Leg Spiders may not be
  a good fit.

``LegSpider`` proof-of-concept implementation
=============================================

Here's a proof-of-concept implementation of ``LegSpider``:

::

   #!python
   from scrapy.http import Request
   from scrapy.item import BaseItem
   from scrapy.spider import BaseSpider
   from scrapy.utils.spider import iterate_spider_output


   class LegSpider(BaseSpider):
       """A spider made of legs"""

       legs = []

       def __init__(self, *args, **kwargs):
           super(LegSpider, self).__init__(*args, **kwargs)
           self._legs = [self] + self.legs[:]
           for l in self._legs:
               l.set_spider(self)

       def parse(self, response):
           res = self._process_response(response)
           for r in res:
               if isinstance(r, BaseItem):
                   yield self._process_item(r)
               else:
                   yield self._process_request(r)

       def process_response(self, response):
           return []

       def process_request(self, request):
           return request

       def process_item(self, item):
           return item

       def set_spider(self, spider):
           self.spider = spider

       def _process_response(self, response):
           res = []
           for l in self._legs:
               res.extend(iterate_spider_output(l.process_response(response)))
           return res

       def _process_request(self, request):
           for l in self._legs:
               request = l.process_request(request)
           return request

       def _process_item(self, item):
           for l in self._legs:
               item = l.process_item(item)
           return item