File: overview.rst

package info (click to toggle)
python-web-poet 0.23.2-1
  • links: PTS, VCS
  • area: main
  • in suites:
  • size: 908 kB
  • sloc: python: 6,112; makefile: 19
file content (36 lines) | stat: -rw-r--r-- 1,346 bytes parent folder | download | duplicates (3)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
.. _overview:

========
Overview
========

A good web scraping framework helps to keep your code maintainable by, among
other things, enabling and encouraging `separation of concerns`_.

.. _separation of concerns: https://en.wikipedia.org/wiki/Separation_of_concerns

For example, Scrapy_ lets you implement different aspects of web scraping, like
ban avoidance or data delivery, into separate components.

.. _Scrapy: https://scrapy.org/

However, there are 2 core aspects of web scraping that can be hard to decouple:
*crawling*, i.e. visiting URLs, and *parsing*, i.e. extracting data.

web-poet lets you :ref:`write data extraction code <page-objects>` that:

-   Makes your web scraping code easier to maintain, since your data extraction
    and crawling code are no longer intertwined and can be maintained
    separately.

-   Can be reused with different versions of your crawling code, i.e. with
    different crawling strategies.

-   Can be executed independently of your crawling code, enabling easier
    debugging and easier automated testing.

-   Can be used with any Python web scraping framework or library that
    implements the :ref:`web-poet specification <spec>`, either directly
    or through a third-party plugin. See :ref:`frameworks`.

To learn more about why and how web-poet came to be, see :ref:`from-ground-up`.