1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393
|
.. _topics-items:
=====
Items
=====
.. module:: scrapy.item
:synopsis: Item and Field classes
The main goal in scraping is to extract structured data from unstructured
sources, typically, web pages. :ref:`Spiders <topics-spiders>` may return the
extracted data as `items`, Python objects that define key-value pairs.
Scrapy supports :ref:`multiple types of items <item-types>`. When you create an
item, you may use whichever type of item you want. When you write code that
receives an item, your code should :ref:`work for any item type
<supporting-item-types>`.
.. _item-types:
Item Types
==========
Scrapy supports the following types of items, via the `itemadapter`_ library:
:ref:`dictionaries <dict-items>`, :ref:`Item objects <item-objects>`,
:ref:`dataclass objects <dataclass-items>`, and :ref:`attrs objects <attrs-items>`.
.. _itemadapter: https://github.com/scrapy/itemadapter
.. _dict-items:
Dictionaries
------------
As an item type, :class:`dict` is convenient and familiar.
.. _item-objects:
Item objects
------------
:class:`Item` provides a :class:`dict`-like API plus additional features that
make it the most feature-complete item type:
.. autoclass:: scrapy.Item
:members: copy, deepcopy, fields
:undoc-members:
:class:`Item` objects replicate the standard :class:`dict` API, including
its ``__init__`` method.
:class:`Item` allows the defining of field names, so that:
- :class:`KeyError` is raised when using undefined field names (i.e.
prevents typos going unnoticed)
- :ref:`Item exporters <topics-exporters>` can export all fields by
default even if the first scraped object does not have values for all
of them
:class:`Item` also allows the defining of field metadata, which can be used to
:ref:`customize serialization <topics-exporters-field-serialization>`.
:mod:`trackref` tracks :class:`Item` objects to help find memory leaks
(see :ref:`topics-leaks-trackrefs`).
Example:
.. code-block:: python
from scrapy.item import Item, Field
class CustomItem(Item):
one_field = Field()
another_field = Field()
.. _dataclass-items:
Dataclass objects
-----------------
.. versionadded:: 2.2
:func:`~dataclasses.dataclass` allows the defining of item classes with field names,
so that :ref:`item exporters <topics-exporters>` can export all fields by
default even if the first scraped object does not have values for all of them.
Additionally, ``dataclass`` items also allow you to:
* define the type and default value of each defined field.
* define custom field metadata through :func:`dataclasses.field`, which can be used to
:ref:`customize serialization <topics-exporters-field-serialization>`.
Example:
.. code-block:: python
from dataclasses import dataclass
@dataclass
class CustomItem:
one_field: str
another_field: int
.. note:: Field types are not enforced at run time.
.. _attrs-items:
attr.s objects
--------------
.. versionadded:: 2.2
:func:`attr.s` allows the defining of item classes with field names,
so that :ref:`item exporters <topics-exporters>` can export all fields by
default even if the first scraped object does not have values for all of them.
Additionally, ``attr.s`` items also allow to:
* define the type and default value of each defined field.
* define custom field :ref:`metadata <attrs:metadata>`, which can be used to
:ref:`customize serialization <topics-exporters-field-serialization>`.
In order to use this type, the :doc:`attrs package <attrs:index>` needs to be installed.
Example:
.. code-block:: python
import attr
@attr.s
class CustomItem:
one_field = attr.ib()
another_field = attr.ib()
Working with Item objects
=========================
.. _topics-items-declaring:
Declaring Item subclasses
-------------------------
Item subclasses are declared using a simple class definition syntax and
:class:`Field` objects. Here is an example:
.. code-block:: python
import scrapy
class Product(scrapy.Item):
name = scrapy.Field()
price = scrapy.Field()
stock = scrapy.Field()
tags = scrapy.Field()
last_updated = scrapy.Field(serializer=str)
.. note:: Those familiar with `Django`_ will notice that Scrapy Items are
declared similar to `Django Models`_, except that Scrapy Items are much
simpler as there is no concept of different field types.
.. _Django: https://www.djangoproject.com/
.. _Django Models: https://docs.djangoproject.com/en/dev/topics/db/models/
.. _topics-items-fields:
Declaring fields
----------------
:class:`Field` objects are used to specify metadata for each field. For
example, the serializer function for the ``last_updated`` field illustrated in
the example above.
You can specify any kind of metadata for each field. There is no restriction on
the values accepted by :class:`Field` objects. For this same
reason, there is no reference list of all available metadata keys. Each key
defined in :class:`Field` objects could be used by a different component, and
only those components know about it. You can also define and use any other
:class:`Field` key in your project too, for your own needs. The main goal of
:class:`Field` objects is to provide a way to define all field metadata in one
place. Typically, those components whose behaviour depends on each field use
certain field keys to configure that behaviour. You must refer to their
documentation to see which metadata keys are used by each component.
It's important to note that the :class:`Field` objects used to declare the item
do not stay assigned as class attributes. Instead, they can be accessed through
the :attr:`~scrapy.Item.fields` attribute.
.. autoclass:: scrapy.Field
The :class:`Field` class is just an alias to the built-in :class:`dict` class and
doesn't provide any extra functionality or attributes. In other words,
:class:`Field` objects are plain-old Python dicts. A separate class is used
to support the :ref:`item declaration syntax <topics-items-declaring>`
based on class attributes.
.. note:: Field metadata can also be declared for ``dataclass`` and ``attrs``
items. Please refer to the documentation for `dataclasses.field`_ and
`attr.ib`_ for additional information.
.. _dataclasses.field: https://docs.python.org/3/library/dataclasses.html#dataclasses.field
.. _attr.ib: https://www.attrs.org/en/stable/api-attr.html#attr.ib
Working with Item objects
-------------------------
Here are some examples of common tasks performed with items, using the
``Product`` item :ref:`declared above <topics-items-declaring>`. You will
notice the API is very similar to the :class:`dict` API.
Creating items
''''''''''''''
.. code-block:: pycon
>>> product = Product(name="Desktop PC", price=1000)
>>> print(product)
Product(name='Desktop PC', price=1000)
Getting field values
''''''''''''''''''''
.. code-block:: pycon
>>> product["name"]
Desktop PC
>>> product.get("name")
Desktop PC
>>> product["price"]
1000
>>> product["last_updated"]
Traceback (most recent call last):
...
KeyError: 'last_updated'
>>> product.get("last_updated", "not set")
not set
>>> product["lala"] # getting unknown field
Traceback (most recent call last):
...
KeyError: 'lala'
>>> product.get("lala", "unknown field")
'unknown field'
>>> "name" in product # is name field populated?
True
>>> "last_updated" in product # is last_updated populated?
False
>>> "last_updated" in product.fields # is last_updated a declared field?
True
>>> "lala" in product.fields # is lala a declared field?
False
Setting field values
''''''''''''''''''''
.. code-block:: pycon
>>> product["last_updated"] = "today"
>>> product["last_updated"]
today
>>> product["lala"] = "test" # setting unknown field
Traceback (most recent call last):
...
KeyError: 'Product does not support field: lala'
Accessing all populated values
''''''''''''''''''''''''''''''
To access all populated values, just use the typical :class:`dict` API:
.. code-block:: pycon
>>> product.keys()
['price', 'name']
>>> product.items()
[('price', 1000), ('name', 'Desktop PC')]
.. _copying-items:
Copying items
'''''''''''''
To copy an item, you must first decide whether you want a shallow copy or a
deep copy.
If your item contains :term:`mutable` values like lists or dictionaries,
a shallow copy will keep references to the same mutable values across all
different copies.
For example, if you have an item with a list of tags, and you create a shallow
copy of that item, both the original item and the copy have the same list of
tags. Adding a tag to the list of one of the items will add the tag to the
other item as well.
If that is not the desired behavior, use a deep copy instead.
See :mod:`copy` for more information.
To create a shallow copy of an item, you can either call
:meth:`~scrapy.Item.copy` on an existing item
(``product2 = product.copy()``) or instantiate your item class from an existing
item (``product2 = Product(product)``).
To create a deep copy, call :meth:`~scrapy.Item.deepcopy` instead
(``product2 = product.deepcopy()``).
Other common tasks
''''''''''''''''''
Creating dicts from items:
.. code-block:: pycon
>>> dict(product) # create a dict from all populated values
{'price': 1000, 'name': 'Desktop PC'}
Creating items from dicts:
>>> Product({"name": "Laptop PC", "price": 1500})
Product(price=1500, name='Laptop PC')
>>> Product({"name": "Laptop PC", "lala": 1500}) # warning: unknown field in dict
Traceback (most recent call last):
...
KeyError: 'Product does not support field: lala'
Extending Item subclasses
-------------------------
You can extend Items (to add more fields or to change some metadata for some
fields) by declaring a subclass of your original Item.
For example:
.. code-block:: python
class DiscountedProduct(Product):
discount_percent = scrapy.Field(serializer=str)
discount_expiration_date = scrapy.Field()
You can also extend field metadata by using the previous field metadata and
appending more values, or changing existing values, like this:
.. code-block:: python
class SpecificProduct(Product):
name = scrapy.Field(Product.fields["name"], serializer=my_serializer)
That adds (or replaces) the ``serializer`` metadata key for the ``name`` field,
keeping all the previously existing metadata values.
.. _supporting-item-types:
Supporting All Item Types
=========================
In code that receives an item, such as methods of :ref:`item pipelines
<topics-item-pipeline>` or :ref:`spider middlewares
<topics-spider-middleware>`, it is a good practice to use the
:class:`~itemadapter.ItemAdapter` class to write code that works for any
supported item type.
Other classes related to items
==============================
.. autoclass:: ItemMeta
|