File: sep-012.rst

package info (click to toggle)
python-scrapy 2.14.2-1
  • links: PTS, VCS
  • area: main
  • in suites: forky, sid
  • size: 6,332 kB
  • sloc: python: 55,629; xml: 199; makefile: 25; sh: 7
file content (92 lines) | stat: -rw-r--r-- 2,212 bytes parent folder | download | duplicates (2)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
=======  ==============================
SEP      12
Title    Spider name
Author   Ismael Carnales, Pablo Hoffman
Created  2009-12-01
Updated  2010-03-23
Status   Final
=======  ==============================

====================
SEP-012: Spider name
====================

The spiders are currently referenced by its ``domain_name`` attribute. This SEP
proposes adding a ``name`` attribute to spiders and using it as their
identifier.

Current limitations and flaws
=============================

1. You can't create two spiders that scrape the same domain (without using
   workarounds like assigning an arbitrary ``domain_name`` and putting the
   real domains in the ``extra_domain_names`` attributes)
2. For spiders with multiple domains, you have to specify them in two different
   places: ``domain_name`` and ``extra_domain_names``.

Proposed changes
================

 1. Add a ``name`` attribute to spiders and use it as their unique identifier.
 2. Merge ``domain_name`` and ``extra_domain_names`` attributes in a single
     list ``allowed_domains``.

Implications of the changes
===========================

General
-------

In general, all references to ``spider.domain_name`` will be replaced by
``spider.name``

OffsiteMiddleware
-----------------

``OffsiteMiddleware`` will use ``spider.allowed_domains`` for determining the
domain names of a spider

scrapy-ctl.py
-------------

crawl
~~~~~

The new syntax for crawl command will be:

::

   crawl [options] <spider|url> ...

If you provide an url, it will try to find the spider the processes it. If no
spider is found or more than one spider is found, it will raise an error. So,
to crawl in those cases you must set the spider to use using the ``--spider``
option

genspider
~~~~~~~~~

The new signature for genspider will be:

::

   genspider [options] <name> <domain>

example:

::

   $ scrapy-ctl genspider google google.com

   $ ls project/spiders/
   project/spiders/google.py

   $ cat project/spiders/google.py

.. code-block:: python

   class GooglecomSpider(BaseSpider):
       name = "google"
       allowed_domains = ["google.com"]

.. note:: ``spider_allowed_domains`` becomes optional as only ``OffsiteMiddleware`` uses it.