File: parallelism.rst

package info (click to toggle)
toolz 1.1.0-1
  • links: PTS, VCS
  • area: main
  • in suites: forky, sid
  • size: 588 kB
  • sloc: python: 4,223; makefile: 137
file content (94 lines) | stat: -rw-r--r-- 3,397 bytes parent folder | download | duplicates (3)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
Parallelism
===========

PyToolz tries to support other parallel processing libraries.  It does this
by ensuring easy serialization of ``toolz`` functions and providing
architecture-agnostic parallel algorithms.

In practice ``toolz`` is developed against ``multiprocessing`` and
``ipyparallel``.


Serialization
-------------

Multiprocessing or distributed computing requires the transmission of functions
between different processes or computers.  This is done through serializing the
function into text, sending that text over a wire, and deserializing the text
back into a function.  To the extent possible PyToolz functions are compatible
with the standard serialization library ``pickle``.

The ``pickle`` library often fails for complex functions including lambdas,
closures, and class methods.  When this occurs we recommend the alternative
serialization library ``dill``.


Example with parallel map
-------------------------

Most parallel processing tasks may be significantly accelerated using only a
parallel map operation.  A number of high quality parallel map operations exist
in other libraries, notably ``multiprocessing``, ``ipyparallel``, and
``threading`` (if your operation is not processor bound).

In the example below we extend our wordcounting solution with a parallel map.
We show how one can progress in development from sequential, to
multiprocessing, to distributed computation all with the same domain code.


.. code::

    from toolz.curried import map
    from toolz import frequencies, compose, concat, merge_with

    def stem(word):
        """ Stem word to primitive form

        >>> stem("Hello!")
        'hello'
        """
        return word.lower().rstrip(",.!)-*_?:;$'-\"").lstrip("-*'\"(_$'")


    wordcount = compose(frequencies, map(stem), concat, map(str.split), open)

    if __name__ == '__main__':
        # Filenames for thousands of books from which we'd like to count words
        filenames = ['Book_%d.txt'%i for i in range(10000)]

        # Start with sequential map for development
        # pmap = map

        # Advance to Multiprocessing map for heavy computation on single machine
        # from multiprocessing import Pool
        # p = Pool(8)
        # pmap = p.map

        # Finish with distributed parallel map for big data
        from ipyparallel import Client
        p = Client()[:]
        pmap = p.map_sync

        total = merge_with(sum, pmap(wordcount, filenames))

This smooth transition is possible because

1.  The ``map`` abstraction is a simple function call and so can be replaced.
    By contrast, this transformation would be difficult if we had written our code with a
    for loop or list comprehension.
2.  The operation ``wordcount`` is separate from the parallel solution.
3.  The task is embarrassingly parallel, needing only a very simple parallel
    strategy.  Fortunately this is the common case.


Parallel Algorithms
-------------------

PyToolz does not implement parallel processing systems.  It does however
provide parallel algorithms that can extend existing parallel systems.  Our
general solution is to build algorithms that operate around a user-supplied
parallel map function.

In particular we provide a parallel ``fold`` in ``toolz.sandbox.parallel.fold``.
This fold can work equally well with ``multiprocessing.Pool.map``,
``threading.Pool.map``, or ``ipyparallel``'s ``map_async``.