File: case-study-web-crawler.rst

package info (click to toggle)
diskcache 5.6.3-1
  • links: PTS, VCS
  • area: main
  • in suites: forky, sid, trixie
  • size: 1,364 kB
  • sloc: python: 7,026; makefile: 20
file content (144 lines) | stat: -rw-r--r-- 5,083 bytes parent folder | download | duplicates (3)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
Case Study: Web Crawler
=======================

:doc:`DiskCache <index>` version 2.7 added a couple persistent data
structures. Let's see how they're useful with a case study in crawling the
web. Easy enough, right? We'll start with code to retrieve urls:

    >>> from time import sleep
    >>> def get(url):
    ...     "Get data for url."
    ...     sleep(url / 1000.0)
    ...     return str(url)

No, we're not actually crawling the web. Our urls are numbers and we'll simply
go to sleep to simulate downloading a web page.

    >>> get(20)
    '20'

Once we download some data, we'll need to parse it and extract the links.

    >>> from random import randrange, seed
    >>> def parse(data):
    ...     "Parse data and return list of links."
    ...     seed(int(data))
    ...     count = randrange(1, 10)
    ...     return [randrange(100) for _ in range(count)]

Again, we're not really parsing data. We're just returning a list of one to ten
integers between zero and one hundred. In our imaginary web, urls are just
integers.

    >>> parse('20')
    [68, 76, 90, 25, 63, 90, 87, 57, 16]

Alright, this is a pretty basic pattern. The ``get`` function returns data and
the ``parse`` function returns a list of more data to go get. We can use the
deque data type from the standard library's collection module to crawl our web.

    >>> from collections import deque
    >>> def crawl():
    ...     urls = deque([0])
    ...     results = dict()
    ...
    ...     while True:
    ...         try:
    ...             url = urls.popleft()
    ...         except IndexError:
    ...             break
    ...
    ...         if url in results:
    ...             continue
    ...
    ...         data = get(url)
    ...
    ...         for link in parse(data):
    ...             urls.append(link)
    ...
    ...         results[url] = data
    ...
    ...     print('Results: %s' % len(results))

We're doing a breadth-first search crawl of the web. Our initial seed is zero
and we use that to initialize our queue. All the results are stored in a
dictionary mapping url to data. We then iterate by repeatedly popping the first
url from our queue. If we've already visited the url then we continue,
otherwise we get the corresponding data and parse it. The parsed results are
appended to our queue. Finally we store the data in our results
dictionary.

    >>> crawl()
    Results: 99

The results of our current code are ephemeral. All results are lost once the
program terminates. To make the results persistent, we can use :doc:`DiskCache
<index>` data structures and store the results in the local file
system. :doc:`DiskCache <index>` provides both :class:`Deque <diskcache.Deque>`
and :class:`Index <diskcache.Index>` data structures which can replace our urls
and results variables.

    >>> from diskcache import Deque, Index
    >>> def crawl():
    ...     urls = Deque([0], 'data/urls')
    ...     results = Index('data/results')
    ...
    ...     while True:
    ...         try:
    ...             url = urls.popleft()
    ...         except IndexError:
    ...             break
    ...
    ...         if url in results:
    ...             continue
    ...
    ...         data = get(url)
    ...
    ...         for link in parse(data):
    ...             urls.append(link)
    ...
    ...         results[url] = data
    ...
    ...     print('Results: %s' % len(results))

Look familiar? Only three lines changed. The import at the top changed so now
we're using ``diskcache`` rather than the ``collections`` module. Then, when we
initialize the urls and results objects, we pass relative paths to directories
where we want the data stored. Again, let's try it out:

    >>> crawl()
    Results: 99

Our results are now persistent. We can initialize our results index outside of
the crawl function and query it.

    >>> results = Index('data/results')
    >>> len(results)
    99

As an added benefit, our code also now works in parallel.

    >>> results.clear()
    >>> from multiprocessing import Process
    >>> processes = [Process(target=crawl) for _ in range(4)]
    >>> for process in processes:
    ...     process.start()
    >>> for process in processes:
    ...     process.join()
    >>> len(results)
    99

Each of the processes uses the same deque and index to crawl our web. Work is
automatically divided among the processes as they pop urls from the queue. If
this were run as a script then multiple Python processes could be started and
stopped as desired.

Interesting, no? Three simple changes and our code goes from ephemeral and
single-process to persistent and multi-process. Nothing truly new has happened
here but the API is convenient and that makes a huge difference. We're also no
longer constrained by memory. :doc:`DiskCache <index>` makes efficient use of
your disk and you can customize how much memory is used. By default the maximum
memory consumption of deque and index objects is only a few dozen
megabytes. Now our simple script can efficiently process terabytes of data.

Go forth and build and share!