File: sep-020.rst

package info (click to toggle)
python-scrapy 2.14.2-1
  • links: PTS, VCS
  • area: main
  • in suites: forky, sid
  • size: 6,332 kB
  • sloc: python: 55,629; xml: 199; makefile: 25; sh: 7
file content (207 lines) | stat: -rw-r--r-- 7,167 bytes parent folder | download | duplicates (2)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
=======  ================
SEP      20
Title    Bulk Item Loader
Author   Steven Almeroth
Created  2012-02-24
Status   Draft
=======  ================

.. note: this SEP has been migrated from the Wiki.

=========================
SEP-020: Bulk Item Loader
=========================

Introduction
============

Just as Item Loaders "provide a convenient mechanism for populating scraped
Items", the Bulk Item Loader provides a convenient mechanism for populating
Item Loaders.

Rationale
=========

There are certain markup patterns that lend themselves quite nicely to
automated parsing, for example the ``<table>`` tag outlines such a pattern
for populating a database table with the embedded ``<tr>`` elements denoting
the rows and the further embedded ``<td>`` elements denoting the individual
fields.

One pattern that is particularly well suited for auto-populating an Item Loader
is the `definition list <https://www.w3.org/TR/html401/struct/lists.html#h-10.3>`_::

    <div class="geeks">
        <dl>
            <dt> hacker
            <dd> a clever programmer

            <dt> nerd
            <dd> technically bright but socially inept person
        </dl>
    </div>

Within the ``<dl>`` each ``<dt>`` would contain the Field *name*
and the following ``<dd>`` would contain the Field *value*.

How it works
============

Without a bulk loader a programmer needs to specifically hardcode all the entries
that are needed.  With the bulk loader on the other hand, just a seed point is
required.

Before
------

.. code-block:: python

    xpath = '//div[@class="geeks"]/dl/dt[contains(text(),"%s")]/following-sibling::dd[1]//text()'
    gl = XPathItemLoader(response=response, item=dict())
    gl.default_output_processor = Compose(TakeFirst(), lambda v: v.strip())
    gl.add_xpath("hacker", xpath % "hacker")
    gl.add_xpath("nerd", xpath % "nerd")

After
-----

.. code-block:: python

    bil = BulkItemLoader(response=response)
    bil.parse_dl('//div[@class="geeks"]/dl')

Code Proposal
=============

This is a working code sample that covers just the basics.

.. code-block:: python

    from scrapy.contrib.loader import XPathItemLoader
    from scrapy.contrib.loader.processor import MapCompose


    class BulkItemLoader(XPathItemLoader):
        """Item loader based on specified pattern recognition"""

        default_item_class = dict
        base_xpath = "//body"
        ignore = ()

        def _get_label(self, entity):
            """Pull the text label out of selected markup

            :param entity: Found markup
            :type entity: Selector
            """
            label = " ".join(entity.xpath(".//text()").extract())
            label = label.encode("ascii", "xmlcharrefreplace") if label else ""
            label = label.strip("&#160;") if "&#160;" in label else label
            label = label.strip(":") if ":" in label else label
            label = label.strip()
            return label

        def _get_entities(self, xpath):
            """Retrieve the list of selectors for a given sub-pattern

            :param xpath: The xpath to select
            :type xpath: String
            :return: The list of selectors
            :rtype: list
            """
            return self.selector.xpath(self.base_xpath + xpath)

        def parse_dl(self, xpath="//dl"):
            """Look for the specified definition list pattern and store all found
            values for the enclosed terms and descriptions.

            :param xpath: The xpath to select
            :type xpath: String
            """
            for term in self._get_entities(xpath + "/dt"):
                label = self._get_label(term)
                if label and label not in self.ignore:
                    value = term.xpath("following-sibling::dd[1]//text()")
                    if value:
                        self.add_value(
                            label, value.extract(), MapCompose(lambda v: v.strip())
                        )

Example Spider
==============

This spider uses the bulk loader above.

Spider code
-----------

.. code-block:: python

    from scrapy.spider import BaseSpider
    from scrapy.contrib.loader.bulk import BulkItemLoader


    class W3cSpider(BaseSpider):
        name = "w3c"
        allowed_domains = ["w3.org"]
        start_urls = ("http://www.w3.org/TR/html401/struct/lists.html",)

        def parse(self, response):
            el = BulkItemLoader(response=response)
            el.parse_dl("//dl[2]")
            item = el.load_item()

            from pprint import pprint

            pprint(item)

Log Output
----------

::

    2012-11-19 14:21:22-0600 [scrapy] INFO: Scrapy 0.17.0 started (bot: scrapy-loader)
    2012-11-19 14:21:22-0600 [scrapy] DEBUG: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
    2012-11-19 14:21:22-0600 [scrapy] DEBUG: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, RedirectMiddleware, CookiesMiddleware, HttpCompressionMiddleware, ChunkedTransferMiddleware, DownloaderStats, HttpCacheMiddleware
    2012-11-19 14:21:22-0600 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
    2012-11-19 14:21:22-0600 [scrapy] DEBUG: Enabled item pipelines:
    2012-11-19 14:21:22-0600 [w3c] INFO: Spider opened
    2012-11-19 14:21:22-0600 [w3c] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
    2012-11-19 14:21:22-0600 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023
    2012-11-19 14:21:22-0600 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080
    2012-11-19 14:21:22-0600 [w3c] DEBUG: Crawled (200) <GET http://www.w3.org/TR/html401/struct/lists.html> (referer: None) ['cached']
    {'Notes': [u'The recipe may be improved by adding raisins.'],
     'The ingredients': [u'',
                         u'100 g. flour',
                         u'',
                         u'10 g. sugar',
                         u'',
                         u'1 cup water',
                         u'',
                         u'2 eggs',
                         u'',
                         u'salt, pepper',
                         u''],
     'The procedure': [u'',
                       u'Mix dry ingredients thoroughly.',
                       u'',
                       u'Pour in wet ingredients.',
                       u'',
                       u'Mix for 10 minutes.',
                       u'',
                       u'Bake for one hour at 300 degrees.',
                       u'']}

Notes
=====

Other parsers can also be dropped in such as:

* ``parse_table ()`` with column designations for key and value,
* ``parse_ul ()`` with a key/value separator designation,
* ``parse_ol ()`` with a key/value separator designation,
* ``parse ()`` with a designation for key/value tags.

Actually this touches on the subject of *embedded intelligence* as it would
be possible, with a little bootstrapping for what goes where, for a general
parser to just go out and grab all of the above.