File: sep-020.rst

package info (click to toggle)
python-scrapy 2.14.2-1
links: PTS, VCS
area: main
in suites: forky, sid
size: 6,332 kB
sloc: python: 55,629; xml: 199; makefile: 25; sh: 7
file content (207 lines) | stat: -rw-r--r-- 7,167 bytes
parent folder | download | duplicates (2)
=======  ================
SEP      20
Title    Bulk Item Loader
Author   Steven Almeroth
Created  2012-02-24
Status   Draft
=======  ================

.. note: this SEP has been migrated from the Wiki.

=========================
SEP-020: Bulk Item Loader
=========================

Introduction
============

Just as Item Loaders "provide a convenient mechanism for populating scraped
Items", the Bulk Item Loader provides a convenient mechanism for populating
Item Loaders.

Rationale
=========

There are certain markup patterns that lend themselves quite nicely to
automated parsing, for example the ``<table>`` tag outlines such a pattern
for populating a database table with the embedded ``<tr>`` elements denoting
the rows and the further embedded ``<td>`` elements denoting the individual
fields.

One pattern that is particularly well suited for auto-populating an Item Loader
is the `definition list <https://www.w3.org/TR/html401/struct/lists.html#h-10.3>`_::

    <div class="geeks">
        <dl>
            <dt> hacker
            <dd> a clever programmer

            <dt> nerd
            <dd> technically bright but socially inept person
        </dl>
    </div>

Within the ``<dl>`` each ``<dt>`` would contain the Field *name*
and the following ``<dd>`` would contain the Field *value*.

How it works
============

Without a bulk loader a programmer needs to specifically hardcode all the entries
that are needed.  With the bulk loader on the other hand, just a seed point is
required.

Before
------

.. code-block:: python

    xpath = '//div[@class="geeks"]/dl/dt[contains(text(),"%s")]/following-sibling::dd[1]//text()'
    gl = XPathItemLoader(response=response, item=dict())
    gl.default_output_processor = Compose(TakeFirst(), lambda v: v.strip())
    gl.add_xpath("hacker", xpath % "hacker")
    gl.add_xpath("nerd", xpath % "nerd")

After
-----

.. code-block:: python

    bil = BulkItemLoader(response=response)
    bil.parse_dl('//div[@class="geeks"]/dl')

Code Proposal
=============

This is a working code sample that covers just the basics.

.. code-block:: python

    from scrapy.contrib.loader import XPathItemLoader
    from scrapy.contrib.loader.processor import MapCompose


    class BulkItemLoader(XPathItemLoader):
        """Item loader based on specified pattern recognition"""

        default_item_class = dict
        base_xpath = "//body"
        ignore = ()

        def _get_label(self, entity):
            """Pull the text label out of selected markup

            :param entity: Found markup
            :type entity: Selector
            """
            label = " ".join(entity.xpath(".//text()").extract())
            label = label.encode("ascii", "xmlcharrefreplace") if label else ""
            label = label.strip("&#160;") if "&#160;" in label else label
            label = label.strip(":") if ":" in label else label
            label = label.strip()
            return label

        def _get_entities(self, xpath):
            """Retrieve the list of selectors for a given sub-pattern

            :param xpath: The xpath to select
            :type xpath: String
            :return: The list of selectors
            :rtype: list
            """
            return self.selector.xpath(self.base_xpath + xpath)

        def parse_dl(self, xpath="//dl"):
            """Look for the specified definition list pattern and store all found
            values for the enclosed terms and descriptions.

            :param xpath: The xpath to select
            :type xpath: String
            """
            for term in self._get_entities(xpath + "/dt"):
                label = self._get_label(term)
                if label and label not in self.ignore:
                    value = term.xpath("following-sibling::dd[1]//text()")
                    if value:
                        self.add_value(
                            label, value.extract(), MapCompose(lambda v: v.strip())
                        )

Example Spider
==============

This spider uses the bulk loader above.

Spider code
-----------

.. code-block:: python

    from scrapy.spider import BaseSpider
    from scrapy.contrib.loader.bulk import BulkItemLoader


    class W3cSpider(BaseSpider):
        name = "w3c"
        allowed_domains = ["w3.org"]
        start_urls = ("http://www.w3.org/TR/html401/struct/lists.html",)

        def parse(self, response):
            el = BulkItemLoader(response=response)
            el.parse_dl("//dl[2]")
            item = el.load_item()

            from pprint import pprint

            pprint(item)

Log Output
----------

::

    2012-11-19 14:21:22-0600 [scrapy] INFO: Scrapy 0.17.0 started (bot: scrapy-loader)
    2012-11-19 14:21:22-0600 [scrapy] DEBUG: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
    2012-11-19 14:21:22-0600 [scrapy] DEBUG: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, RedirectMiddleware, CookiesMiddleware, HttpCompressionMiddleware, ChunkedTransferMiddleware, DownloaderStats, HttpCacheMiddleware
    2012-11-19 14:21:22-0600 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
    2012-11-19 14:21:22-0600 [scrapy] DEBUG: Enabled item pipelines:
    2012-11-19 14:21:22-0600 [w3c] INFO: Spider opened
    2012-11-19 14:21:22-0600 [w3c] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
    2012-11-19 14:21:22-0600 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023
    2012-11-19 14:21:22-0600 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080
    2012-11-19 14:21:22-0600 [w3c] DEBUG: Crawled (200) <GET http://www.w3.org/TR/html401/struct/lists.html> (referer: None) ['cached']
    {'Notes': [u'The recipe may be improved by adding raisins.'],
     'The ingredients': [u'',
                         u'100 g. flour',
                         u'',
                         u'10 g. sugar',
                         u'',
                         u'1 cup water',
                         u'',
                         u'2 eggs',
                         u'',
                         u'salt, pepper',
                         u''],
     'The procedure': [u'',
                       u'Mix dry ingredients thoroughly.',
                       u'',
                       u'Pour in wet ingredients.',
                       u'',
                       u'Mix for 10 minutes.',
                       u'',
                       u'Bake for one hour at 300 degrees.',
                       u'']}

Notes
=====

Other parsers can also be dropped in such as:

* ``parse_table ()`` with column designations for key and value,
* ``parse_ul ()`` with a key/value separator designation,
* ``parse_ol ()`` with a key/value separator designation,
* ``parse ()`` with a designation for key/value tags.

Actually this touches on the subject of *embedded intelligence* as it would
be possible, with a little bootstrapping for what goes where, for a general
parser to just go out and grab all of the above.