File: tutorial.rst

package info (click to toggle)
python-scrapy 2.13.3-1
  • links: PTS, VCS
  • area: main
  • in suites: forky, sid
  • size: 5,664 kB
  • sloc: python: 52,028; xml: 199; makefile: 25; sh: 7
file content (835 lines) | stat: -rw-r--r-- 30,612 bytes parent folder | download
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
.. _intro-tutorial:

===============
Scrapy Tutorial
===============

In this tutorial, we'll assume that Scrapy is already installed on your system.
If that's not the case, see :ref:`intro-install`.

We are going to scrape `quotes.toscrape.com <https://quotes.toscrape.com/>`_, a website
that lists quotes from famous authors.

This tutorial will walk you through these tasks:

1. Creating a new Scrapy project
2. Writing a :ref:`spider <topics-spiders>` to crawl a site and extract data
3. Exporting the scraped data using the command line
4. Changing spider to recursively follow links
5. Using spider arguments

Scrapy is written in Python_. The more you learn about Python, the more you
can get out of Scrapy.

If you're already familiar with other languages and want to learn Python quickly, the
`Python Tutorial`_ is a good resource.

If you're new to programming and want to start with Python, the following books
may be useful to you:

* `Automate the Boring Stuff With Python`_

* `How To Think Like a Computer Scientist`_

* `Learn Python 3 The Hard Way`_

You can also take a look at `this list of Python resources for non-programmers`_,
as well as the `suggested resources in the learnpython-subreddit`_.

.. _Python: https://www.python.org/
.. _this list of Python resources for non-programmers: https://wiki.python.org/moin/BeginnersGuide/NonProgrammers
.. _Python Tutorial: https://docs.python.org/3/tutorial
.. _Automate the Boring Stuff With Python: https://automatetheboringstuff.com/
.. _How To Think Like a Computer Scientist: http://openbookproject.net/thinkcs/python/english3e/
.. _Learn Python 3 The Hard Way: https://learnpythonthehardway.org/python3/
.. _suggested resources in the learnpython-subreddit: https://www.reddit.com/r/learnpython/wiki/index#wiki_new_to_python.3F


Creating a project
==================

Before you start scraping, you will have to set up a new Scrapy project. Enter a
directory where you'd like to store your code and run::

    scrapy startproject tutorial

This will create a ``tutorial`` directory with the following contents::

    tutorial/
        scrapy.cfg            # deploy configuration file

        tutorial/             # project's Python module, you'll import your code from here
            __init__.py

            items.py          # project items definition file

            middlewares.py    # project middlewares file

            pipelines.py      # project pipelines file

            settings.py       # project settings file

            spiders/          # a directory where you'll later put your spiders
                __init__.py


Our first Spider
================

Spiders are classes that you define and that Scrapy uses to scrape information from a website
(or a group of websites). They must subclass :class:`~scrapy.Spider` and define the initial
requests to be made, and optionally, how to follow links in pages and parse the downloaded
page content to extract data.

This is the code for our first Spider. Save it in a file named
``quotes_spider.py`` under the ``tutorial/spiders`` directory in your project:

.. code-block:: python

    from pathlib import Path

    import scrapy


    class QuotesSpider(scrapy.Spider):
        name = "quotes"

        async def start(self):
            urls = [
                "https://quotes.toscrape.com/page/1/",
                "https://quotes.toscrape.com/page/2/",
            ]
            for url in urls:
                yield scrapy.Request(url=url, callback=self.parse)

        def parse(self, response):
            page = response.url.split("/")[-2]
            filename = f"quotes-{page}.html"
            Path(filename).write_bytes(response.body)
            self.log(f"Saved file {filename}")


As you can see, our Spider subclasses :class:`scrapy.Spider <scrapy.Spider>`
and defines some attributes and methods:

* :attr:`~scrapy.Spider.name`: identifies the Spider. It must be
  unique within a project, that is, you can't set the same name for different
  Spiders.

* :meth:`~scrapy.Spider.start`: must be an asynchronous generator that
  yields requests (and, optionally, items) for the spider to start crawling.
  Subsequent requests will be generated successively from these initial
  requests.

* :meth:`~scrapy.Spider.parse`: a method that will be called to handle
  the response downloaded for each of the requests made. The response parameter
  is an instance of :class:`~scrapy.http.TextResponse` that holds
  the page content and has further helpful methods to handle it.

  The :meth:`~scrapy.Spider.parse` method usually parses the response, extracting
  the scraped data as dicts and also finding new URLs to
  follow and creating new requests (:class:`~scrapy.Request`) from them.

How to run our spider
---------------------

To put our spider to work, go to the project's top level directory and run::

   scrapy crawl quotes

This command runs the spider named ``quotes`` that we've just added, that
will send some requests for the ``quotes.toscrape.com`` domain. You will get an output
similar to this::

    ... (omitted for brevity)
    2016-12-16 21:24:05 [scrapy.core.engine] INFO: Spider opened
    2016-12-16 21:24:05 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
    2016-12-16 21:24:05 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
    2016-12-16 21:24:05 [scrapy.core.engine] DEBUG: Crawled (404) <GET https://quotes.toscrape.com/robots.txt> (referer: None)
    2016-12-16 21:24:05 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://quotes.toscrape.com/page/1/> (referer: None)
    2016-12-16 21:24:05 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://quotes.toscrape.com/page/2/> (referer: None)
    2016-12-16 21:24:05 [quotes] DEBUG: Saved file quotes-1.html
    2016-12-16 21:24:05 [quotes] DEBUG: Saved file quotes-2.html
    2016-12-16 21:24:05 [scrapy.core.engine] INFO: Closing spider (finished)
    ...

Now, check the files in the current directory. You should notice that two new
files have been created: *quotes-1.html* and *quotes-2.html*, with the content
for the respective URLs, as our ``parse`` method instructs.

.. note:: If you are wondering why we haven't parsed the HTML yet, hold
  on, we will cover that soon.


What just happened under the hood?
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Scrapy sends the first :class:`scrapy.Request <scrapy.Request>` objects yielded
by the :meth:`~scrapy.Spider.start` spider method. Upon receiving a
response for each one, Scrapy calls the callback method associated with the
request (in this case, the ``parse`` method) with a
:class:`~scrapy.http.Response` object.


A shortcut to the ``start`` method
----------------------------------

Instead of implementing a :meth:`~scrapy.Spider.start` method that yields
:class:`~scrapy.Request` objects from URLs, you can define a
:attr:`~scrapy.Spider.start_urls` class attribute with a list of URLs. This
list will then be used by the default implementation of
:meth:`~scrapy.Spider.start` to create the initial requests for your
spider.

.. code-block:: python

    from pathlib import Path

    import scrapy


    class QuotesSpider(scrapy.Spider):
        name = "quotes"
        start_urls = [
            "https://quotes.toscrape.com/page/1/",
            "https://quotes.toscrape.com/page/2/",
        ]

        def parse(self, response):
            page = response.url.split("/")[-2]
            filename = f"quotes-{page}.html"
            Path(filename).write_bytes(response.body)

The :meth:`~scrapy.Spider.parse` method will be called to handle each
of the requests for those URLs, even though we haven't explicitly told Scrapy
to do so. This happens because :meth:`~scrapy.Spider.parse` is Scrapy's
default callback method, which is called for requests without an explicitly
assigned callback.


Extracting data
---------------

The best way to learn how to extract data with Scrapy is trying selectors
using the :ref:`Scrapy shell <topics-shell>`. Run::

    scrapy shell 'https://quotes.toscrape.com/page/1/'

.. note::

   Remember to always enclose URLs in quotes when running Scrapy shell from the
   command line, otherwise URLs containing arguments (i.e. ``&`` character)
   will not work.

   On Windows, use double quotes instead::

       scrapy shell "https://quotes.toscrape.com/page/1/"

You will see something like::

    [ ... Scrapy log here ... ]
    2016-09-19 12:09:27 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://quotes.toscrape.com/page/1/> (referer: None)
    [s] Available Scrapy objects:
    [s]   scrapy     scrapy module (contains scrapy.Request, scrapy.Selector, etc)
    [s]   crawler    <scrapy.crawler.Crawler object at 0x7fa91d888c90>
    [s]   item       {}
    [s]   request    <GET https://quotes.toscrape.com/page/1/>
    [s]   response   <200 https://quotes.toscrape.com/page/1/>
    [s]   settings   <scrapy.settings.Settings object at 0x7fa91d888c10>
    [s]   spider     <DefaultSpider 'default' at 0x7fa91c8af990>
    [s] Useful shortcuts:
    [s]   shelp()           Shell help (print this help)
    [s]   fetch(req_or_url) Fetch request (or URL) and update local objects
    [s]   view(response)    View response in a browser

Using the shell, you can try selecting elements using `CSS`_ with the response
object:

.. invisible-code-block: python

    response = load_response('https://quotes.toscrape.com/page/1/', 'quotes1.html')

.. code-block:: pycon

    >>> response.css("title")
    [<Selector query='descendant-or-self::title' data='<title>Quotes to Scrape</title>'>]

The result of running ``response.css('title')`` is a list-like object called
:class:`~scrapy.selector.SelectorList`, which represents a list of
:class:`~scrapy.Selector` objects that wrap around XML/HTML elements
and allow you to run further queries to refine the selection or extract the
data.

To extract the text from the title above, you can do:

.. code-block:: pycon

    >>> response.css("title::text").getall()
    ['Quotes to Scrape']

There are two things to note here: one is that we've added ``::text`` to the
CSS query, to mean we want to select only the text elements directly inside
``<title>`` element.  If we don't specify ``::text``, we'd get the full title
element, including its tags:

.. code-block:: pycon

    >>> response.css("title").getall()
    ['<title>Quotes to Scrape</title>']

The other thing is that the result of calling ``.getall()`` is a list: it is
possible that a selector returns more than one result, so we extract them all.
When you know you just want the first result, as in this case, you can do:

.. code-block:: pycon

    >>> response.css("title::text").get()
    'Quotes to Scrape'

As an alternative, you could've written:

.. code-block:: pycon

    >>> response.css("title::text")[0].get()
    'Quotes to Scrape'

Accessing an index on a :class:`~scrapy.selector.SelectorList` instance will
raise an :exc:`IndexError` exception if there are no results:

.. code-block:: pycon

    >>> response.css("noelement")[0].get()
    Traceback (most recent call last):
    ...
    IndexError: list index out of range

You might want to use ``.get()`` directly on the
:class:`~scrapy.selector.SelectorList` instance instead, which returns ``None``
if there are no results:

.. code-block:: pycon

    >>> response.css("noelement").get()

There's a lesson here: for most scraping code, you want it to be resilient to
errors due to things not being found on a page, so that even if some parts fail
to be scraped, you can at least get **some** data.

Besides the :meth:`~scrapy.selector.SelectorList.getall` and
:meth:`~scrapy.selector.SelectorList.get` methods, you can also use
the :meth:`~scrapy.selector.SelectorList.re` method to extract using
:doc:`regular expressions <library/re>`:

.. code-block:: pycon

    >>> response.css("title::text").re(r"Quotes.*")
    ['Quotes to Scrape']
    >>> response.css("title::text").re(r"Q\w+")
    ['Quotes']
    >>> response.css("title::text").re(r"(\w+) to (\w+)")
    ['Quotes', 'Scrape']

In order to find the proper CSS selectors to use, you might find it useful to open
the response page from the shell in your web browser using ``view(response)``.
You can use your browser's developer tools to inspect the HTML and come up
with a selector (see :ref:`topics-developer-tools`).

`Selector Gadget`_ is also a nice tool to quickly find CSS selector for
visually selected elements, which works in many browsers.

.. _Selector Gadget: https://selectorgadget.com/


XPath: a brief intro
^^^^^^^^^^^^^^^^^^^^

Besides `CSS`_, Scrapy selectors also support using `XPath`_ expressions:

.. code-block:: pycon

    >>> response.xpath("//title")
    [<Selector query='//title' data='<title>Quotes to Scrape</title>'>]
    >>> response.xpath("//title/text()").get()
    'Quotes to Scrape'

XPath expressions are very powerful, and are the foundation of Scrapy
Selectors. In fact, CSS selectors are converted to XPath under-the-hood. You
can see that if you read the text representation of the selector
objects in the shell closely.

While perhaps not as popular as CSS selectors, XPath expressions offer more
power because besides navigating the structure, it can also look at the
content. Using XPath, you're able to select things like: *the link
that contains the text "Next Page"*. This makes XPath very fitting to the task
of scraping, and we encourage you to learn XPath even if you already know how to
construct CSS selectors, it will make scraping much easier.

We won't cover much of XPath here, but you can read more about :ref:`using XPath
with Scrapy Selectors here <topics-selectors>`. To learn more about XPath, we
recommend `this tutorial to learn XPath through examples
<http://zvon.org/comp/r/tut-XPath_1.html>`_, and `this tutorial to learn "how
to think in XPath" <http://plasmasturm.org/log/xpath101/>`_.

.. _XPath: https://www.w3.org/TR/xpath-10/
.. _CSS: https://www.w3.org/TR/selectors

Extracting quotes and authors
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Now that you know a bit about selection and extraction, let's complete our
spider by writing the code to extract the quotes from the web page.

Each quote in https://quotes.toscrape.com is represented by HTML elements that look
like this:

.. code-block:: html

    <div class="quote">
        <span class="text">“The world as we have created it is a process of our
        thinking. It cannot be changed without changing our thinking.”</span>
        <span>
            by <small class="author">Albert Einstein</small>
            <a href="/author/Albert-Einstein">(about)</a>
        </span>
        <div class="tags">
            Tags:
            <a class="tag" href="/tag/change/page/1/">change</a>
            <a class="tag" href="/tag/deep-thoughts/page/1/">deep-thoughts</a>
            <a class="tag" href="/tag/thinking/page/1/">thinking</a>
            <a class="tag" href="/tag/world/page/1/">world</a>
        </div>
    </div>

Let's open up scrapy shell and play a bit to find out how to extract the data
we want::

    scrapy shell 'https://quotes.toscrape.com'

We get a list of selectors for the quote HTML elements with:

.. code-block:: pycon

    >>> response.css("div.quote")
    [<Selector query="descendant-or-self::div[@class and contains(concat(' ', normalize-space(@class), ' '), ' quote ')]" data='<div class="quote" itemscope itemtype...'>,
    <Selector query="descendant-or-self::div[@class and contains(concat(' ', normalize-space(@class), ' '), ' quote ')]" data='<div class="quote" itemscope itemtype...'>,
    ...]

Each of the selectors returned by the query above allows us to run further
queries over their sub-elements. Let's assign the first selector to a
variable, so that we can run our CSS selectors directly on a particular quote:

.. code-block:: pycon

    >>> quote = response.css("div.quote")[0]

Now, let's extract the ``text``, ``author`` and ``tags`` from that quote
using the ``quote`` object we just created:

.. code-block:: pycon

    >>> text = quote.css("span.text::text").get()
    >>> text
    '“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”'
    >>> author = quote.css("small.author::text").get()
    >>> author
    'Albert Einstein'

Given that the tags are a list of strings, we can use the ``.getall()`` method
to get all of them:

.. code-block:: pycon

    >>> tags = quote.css("div.tags a.tag::text").getall()
    >>> tags
    ['change', 'deep-thoughts', 'thinking', 'world']

.. invisible-code-block: python

  from sys import version_info

Having figured out how to extract each bit, we can now iterate over all the
quote elements and put them together into a Python dictionary:

.. code-block:: pycon

    >>> for quote in response.css("div.quote"):
    ...     text = quote.css("span.text::text").get()
    ...     author = quote.css("small.author::text").get()
    ...     tags = quote.css("div.tags a.tag::text").getall()
    ...     print(dict(text=text, author=author, tags=tags))
    ...
    {'text': '“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”', 'author': 'Albert Einstein', 'tags': ['change', 'deep-thoughts', 'thinking', 'world']}
    {'text': '“It is our choices, Harry, that show what we truly are, far more than our abilities.”', 'author': 'J.K. Rowling', 'tags': ['abilities', 'choices']}
    ...

Extracting data in our spider
-----------------------------

Let's get back to our spider. Until now, it hasn't extracted any data in
particular, just saving the whole HTML page to a local file. Let's integrate the
extraction logic above into our spider.

A Scrapy spider typically generates many dictionaries containing the data
extracted from the page. To do that, we use the ``yield`` Python keyword
in the callback, as you can see below:

.. code-block:: python

    import scrapy


    class QuotesSpider(scrapy.Spider):
        name = "quotes"
        start_urls = [
            "https://quotes.toscrape.com/page/1/",
            "https://quotes.toscrape.com/page/2/",
        ]

        def parse(self, response):
            for quote in response.css("div.quote"):
                yield {
                    "text": quote.css("span.text::text").get(),
                    "author": quote.css("small.author::text").get(),
                    "tags": quote.css("div.tags a.tag::text").getall(),
                }

To run this spider, exit the scrapy shell by entering::

    quit()

Then, run::

   scrapy crawl quotes

Now, it should output the extracted data with the log::

    2016-09-19 18:57:19 [scrapy.core.scraper] DEBUG: Scraped from <200 https://quotes.toscrape.com/page/1/>
    {'tags': ['life', 'love'], 'author': 'André Gide', 'text': '“It is better to be hated for what you are than to be loved for what you are not.”'}
    2016-09-19 18:57:19 [scrapy.core.scraper] DEBUG: Scraped from <200 https://quotes.toscrape.com/page/1/>
    {'tags': ['edison', 'failure', 'inspirational', 'paraphrased'], 'author': 'Thomas A. Edison', 'text': "“I have not failed. I've just found 10,000 ways that won't work.”"}


.. _storing-data:

Storing the scraped data
========================

The simplest way to store the scraped data is by using :ref:`Feed exports
<topics-feed-exports>`, with the following command::

    scrapy crawl quotes -O quotes.json

That will generate a ``quotes.json`` file containing all scraped items,
serialized in `JSON`_.

The ``-O`` command-line switch overwrites any existing file; use ``-o`` instead
to append new content to any existing file. However, appending to a JSON file
makes the file contents invalid JSON. When appending to a file, consider
using a different serialization format, such as `JSON Lines`_::

    scrapy crawl quotes -o quotes.jsonl

The `JSON Lines`_ format is useful because it's stream-like, so you can easily
append new records to it. It doesn't have the same problem as JSON when you run
twice. Also, as each record is a separate line, you can process big files
without having to fit everything in memory, there are tools like `JQ`_ to help
do that at the command-line.

In small projects (like the one in this tutorial), that should be enough.
However, if you want to perform more complex things with the scraped items, you
can write an :ref:`Item Pipeline <topics-item-pipeline>`. A placeholder file
for Item Pipelines has been set up for you when the project is created, in
``tutorial/pipelines.py``. Though you don't need to implement any item
pipelines if you just want to store the scraped items.

.. _JSON Lines: https://jsonlines.org
.. _JQ: https://stedolan.github.io/jq


Following links
===============

Let's say, instead of just scraping the stuff from the first two pages
from https://quotes.toscrape.com, you want quotes from all the pages in the website.

Now that you know how to extract data from pages, let's see how to follow links
from them.

The first thing to do is extract the link to the page we want to follow.  Examining
our page, we can see there is a link to the next page with the following
markup:

.. code-block:: html

    <ul class="pager">
        <li class="next">
            <a href="/page/2/">Next <span aria-hidden="true">&rarr;</span></a>
        </li>
    </ul>

We can try extracting it in the shell:

>>> response.css('li.next a').get()
'<a href="/page/2/">Next <span aria-hidden="true">→</span></a>'

This gets the anchor element, but we want the attribute ``href``. For that,
Scrapy supports a CSS extension that lets you select the attribute contents,
like this:

.. code-block:: pycon

    >>> response.css("li.next a::attr(href)").get()
    '/page/2/'

There is also an ``attrib`` property available
(see :ref:`selecting-attributes` for more):

.. code-block:: pycon

    >>> response.css("li.next a").attrib["href"]
    '/page/2/'

Now let's see our spider, modified to recursively follow the link to the next
page, extracting data from it:

.. code-block:: python

    import scrapy


    class QuotesSpider(scrapy.Spider):
        name = "quotes"
        start_urls = [
            "https://quotes.toscrape.com/page/1/",
        ]

        def parse(self, response):
            for quote in response.css("div.quote"):
                yield {
                    "text": quote.css("span.text::text").get(),
                    "author": quote.css("small.author::text").get(),
                    "tags": quote.css("div.tags a.tag::text").getall(),
                }

            next_page = response.css("li.next a::attr(href)").get()
            if next_page is not None:
                next_page = response.urljoin(next_page)
                yield scrapy.Request(next_page, callback=self.parse)


Now, after extracting the data, the ``parse()`` method looks for the link to
the next page, builds a full absolute URL using the
:meth:`~scrapy.http.Response.urljoin` method (since the links can be
relative) and yields a new request to the next page, registering itself as
callback to handle the data extraction for the next page and to keep the
crawling going through all the pages.

What you see here is Scrapy's mechanism of following links: when you yield
a Request in a callback method, Scrapy will schedule that request to be sent
and register a callback method to be executed when that request finishes.

Using this, you can build complex crawlers that follow links according to rules
you define, and extract different kinds of data depending on the page it's
visiting.

In our example, it creates a sort of loop, following all the links to the next page
until it doesn't find one -- handy for crawling blogs, forums and other sites with
pagination.


.. _response-follow-example:

A shortcut for creating Requests
--------------------------------

As a shortcut for creating Request objects you can use
:meth:`response.follow <scrapy.http.TextResponse.follow>`:

.. code-block:: python

    import scrapy


    class QuotesSpider(scrapy.Spider):
        name = "quotes"
        start_urls = [
            "https://quotes.toscrape.com/page/1/",
        ]

        def parse(self, response):
            for quote in response.css("div.quote"):
                yield {
                    "text": quote.css("span.text::text").get(),
                    "author": quote.css("span small::text").get(),
                    "tags": quote.css("div.tags a.tag::text").getall(),
                }

            next_page = response.css("li.next a::attr(href)").get()
            if next_page is not None:
                yield response.follow(next_page, callback=self.parse)

Unlike scrapy.Request, ``response.follow`` supports relative URLs directly - no
need to call urljoin. Note that ``response.follow`` just returns a Request
instance; you still have to yield this Request.

.. skip: start

You can also pass a selector to ``response.follow`` instead of a string;
this selector should extract necessary attributes:

.. code-block:: python

    for href in response.css("ul.pager a::attr(href)"):
        yield response.follow(href, callback=self.parse)

For ``<a>`` elements there is a shortcut: ``response.follow`` uses their href
attribute automatically. So the code can be shortened further:

.. code-block:: python

    for a in response.css("ul.pager a"):
        yield response.follow(a, callback=self.parse)

To create multiple requests from an iterable, you can use
:meth:`response.follow_all <scrapy.http.TextResponse.follow_all>` instead:

.. code-block:: python

    anchors = response.css("ul.pager a")
    yield from response.follow_all(anchors, callback=self.parse)

or, shortening it further:

.. code-block:: python

    yield from response.follow_all(css="ul.pager a", callback=self.parse)

.. skip: end


More examples and patterns
--------------------------

Here is another spider that illustrates callbacks and following links,
this time for scraping author information:

.. code-block:: python

    import scrapy


    class AuthorSpider(scrapy.Spider):
        name = "author"

        start_urls = ["https://quotes.toscrape.com/"]

        def parse(self, response):
            author_page_links = response.css(".author + a")
            yield from response.follow_all(author_page_links, self.parse_author)

            pagination_links = response.css("li.next a")
            yield from response.follow_all(pagination_links, self.parse)

        def parse_author(self, response):
            def extract_with_css(query):
                return response.css(query).get(default="").strip()

            yield {
                "name": extract_with_css("h3.author-title::text"),
                "birthdate": extract_with_css(".author-born-date::text"),
                "bio": extract_with_css(".author-description::text"),
            }

This spider will start from the main page, it will follow all the links to the
authors pages calling the ``parse_author`` callback for each of them, and also
the pagination links with the ``parse`` callback as we saw before.

Here we're passing callbacks to
:meth:`response.follow_all <scrapy.http.TextResponse.follow_all>` as positional
arguments to make the code shorter; it also works for
:class:`~scrapy.Request`.

The ``parse_author`` callback defines a helper function to extract and cleanup the
data from a CSS query and yields the Python dict with the author data.

Another interesting thing this spider demonstrates is that, even if there are
many quotes from the same author, we don't need to worry about visiting the
same author page multiple times. By default, Scrapy filters out duplicated
requests to URLs already visited, avoiding the problem of hitting servers too
much because of a programming mistake. This can be configured in the
:setting:`DUPEFILTER_CLASS` setting.

Hopefully by now you have a good understanding of how to use the mechanism
of following links and callbacks with Scrapy.

As yet another example spider that leverages the mechanism of following links,
check out the :class:`~scrapy.spiders.CrawlSpider` class for a generic
spider that implements a small rules engine that you can use to write your
crawlers on top of it.

Also, a common pattern is to build an item with data from more than one page,
using a :ref:`trick to pass additional data to the callbacks
<topics-request-response-ref-request-callback-arguments>`.


Using spider arguments
======================

You can provide command line arguments to your spiders by using the ``-a``
option when running them::

    scrapy crawl quotes -O quotes-humor.json -a tag=humor

These arguments are passed to the Spider's ``__init__`` method and become
spider attributes by default.

In this example, the value provided for the ``tag`` argument will be available
via ``self.tag``. You can use this to make your spider fetch only quotes
with a specific tag, building the URL based on the argument:

.. code-block:: python

    import scrapy


    class QuotesSpider(scrapy.Spider):
        name = "quotes"

        async def start(self):
            url = "https://quotes.toscrape.com/"
            tag = getattr(self, "tag", None)
            if tag is not None:
                url = url + "tag/" + tag
            yield scrapy.Request(url, self.parse)

        def parse(self, response):
            for quote in response.css("div.quote"):
                yield {
                    "text": quote.css("span.text::text").get(),
                    "author": quote.css("small.author::text").get(),
                }

            next_page = response.css("li.next a::attr(href)").get()
            if next_page is not None:
                yield response.follow(next_page, self.parse)


If you pass the ``tag=humor`` argument to this spider, you'll notice that it
will only visit URLs from the ``humor`` tag, such as
``https://quotes.toscrape.com/tag/humor``.

You can :ref:`learn more about handling spider arguments here <spiderargs>`.

Next steps
==========

This tutorial covered only the basics of Scrapy, but there's a lot of other
features not mentioned here. Check the :ref:`topics-whatelse` section in the
:ref:`intro-overview` chapter for a quick overview of the most important ones.

You can continue from the section :ref:`section-basics` to know more about the
command-line tool, spiders, selectors and other things the tutorial hasn't covered like
modeling the scraped data. If you'd prefer to play with an example project, check
the :ref:`intro-examples` section.

.. _JSON: https://en.wikipedia.org/wiki/JSON