1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144
|
Case Study: Web Crawler
=======================
:doc:`DiskCache <index>` version 2.7 added a couple persistent data
structures. Let's see how they're useful with a case study in crawling the
web. Easy enough, right? We'll start with code to retrieve urls:
>>> from time import sleep
>>> def get(url):
... "Get data for url."
... sleep(url / 1000.0)
... return str(url)
No, we're not actually crawling the web. Our urls are numbers and we'll simply
go to sleep to simulate downloading a web page.
>>> get(20)
'20'
Once we download some data, we'll need to parse it and extract the links.
>>> from random import randrange, seed
>>> def parse(data):
... "Parse data and return list of links."
... seed(int(data))
... count = randrange(1, 10)
... return [randrange(100) for _ in range(count)]
Again, we're not really parsing data. We're just returning a list of one to ten
integers between zero and one hundred. In our imaginary web, urls are just
integers.
>>> parse('20')
[68, 76, 90, 25, 63, 90, 87, 57, 16]
Alright, this is a pretty basic pattern. The ``get`` function returns data and
the ``parse`` function returns a list of more data to go get. We can use the
deque data type from the standard library's collection module to crawl our web.
>>> from collections import deque
>>> def crawl():
... urls = deque([0])
... results = dict()
...
... while True:
... try:
... url = urls.popleft()
... except IndexError:
... break
...
... if url in results:
... continue
...
... data = get(url)
...
... for link in parse(data):
... urls.append(link)
...
... results[url] = data
...
... print('Results: %s' % len(results))
We're doing a breadth-first search crawl of the web. Our initial seed is zero
and we use that to initialize our queue. All the results are stored in a
dictionary mapping url to data. We then iterate by repeatedly popping the first
url from our queue. If we've already visited the url then we continue,
otherwise we get the corresponding data and parse it. The parsed results are
appended to our queue. Finally we store the data in our results
dictionary.
>>> crawl()
Results: 99
The results of our current code are ephemeral. All results are lost once the
program terminates. To make the results persistent, we can use :doc:`DiskCache
<index>` data structures and store the results in the local file
system. :doc:`DiskCache <index>` provides both :class:`Deque <diskcache.Deque>`
and :class:`Index <diskcache.Index>` data structures which can replace our urls
and results variables.
>>> from diskcache import Deque, Index
>>> def crawl():
... urls = Deque([0], 'data/urls')
... results = Index('data/results')
...
... while True:
... try:
... url = urls.popleft()
... except IndexError:
... break
...
... if url in results:
... continue
...
... data = get(url)
...
... for link in parse(data):
... urls.append(link)
...
... results[url] = data
...
... print('Results: %s' % len(results))
Look familiar? Only three lines changed. The import at the top changed so now
we're using ``diskcache`` rather than the ``collections`` module. Then, when we
initialize the urls and results objects, we pass relative paths to directories
where we want the data stored. Again, let's try it out:
>>> crawl()
Results: 99
Our results are now persistent. We can initialize our results index outside of
the crawl function and query it.
>>> results = Index('data/results')
>>> len(results)
99
As an added benefit, our code also now works in parallel.
>>> results.clear()
>>> from multiprocessing import Process
>>> processes = [Process(target=crawl) for _ in range(4)]
>>> for process in processes:
... process.start()
>>> for process in processes:
... process.join()
>>> len(results)
99
Each of the processes uses the same deque and index to crawl our web. Work is
automatically divided among the processes as they pop urls from the queue. If
this were run as a script then multiple Python processes could be started and
stopped as desired.
Interesting, no? Three simple changes and our code goes from ephemeral and
single-process to persistent and multi-process. Nothing truly new has happened
here but the API is convenient and that makes a huge difference. We're also no
longer constrained by memory. :doc:`DiskCache <index>` makes efficient use of
your disk and you can customize how much memory is used. By default the maximum
memory consumption of deque and index objects is only a few dozen
megabytes. Now our simple script can efficiently process terabytes of data.
Go forth and build and share!
|