1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36
|
.. _overview:
========
Overview
========
A good web scraping framework helps to keep your code maintainable by, among
other things, enabling and encouraging `separation of concerns`_.
.. _separation of concerns: https://en.wikipedia.org/wiki/Separation_of_concerns
For example, Scrapy_ lets you implement different aspects of web scraping, like
ban avoidance or data delivery, into separate components.
.. _Scrapy: https://scrapy.org/
However, there are 2 core aspects of web scraping that can be hard to decouple:
*crawling*, i.e. visiting URLs, and *parsing*, i.e. extracting data.
web-poet lets you :ref:`write data extraction code <page-objects>` that:
- Makes your web scraping code easier to maintain, since your data extraction
and crawling code are no longer intertwined and can be maintained
separately.
- Can be reused with different versions of your crawling code, i.e. with
different crawling strategies.
- Can be executed independently of your crawling code, enabling easier
debugging and easier automated testing.
- Can be used with any Python web scraping framework or library that
implements the :ref:`web-poet specification <spec>`, either directly
or through a third-party plugin. See :ref:`frameworks`.
To learn more about why and how web-poet came to be, see :ref:`from-ground-up`.
|