File: intro.txt

package info (click to toggle)
multiprocess 0.70.18-2
links: PTS, VCS
area: main
in suites: forky, sid
size: 12,260 kB
sloc: python: 85,072; ansic: 8,840; makefile: 16
file content (354 lines) | stat: -rw-r--r-- 12,397 bytes
parent folder | download | duplicates (49)
.. include:: header.txt

==============
 Introduction
==============

Threads, processes and the GIL
==============================

To run more than one piece of code at the same time on the same
computer one has the choice of either using multiple processes or
multiple threads.

Although a program can be made up of multiple processes, these
processes are in effect completely independent of one another:
different processes are not able to cooperate with one another unless
one sets up some means of communication between them (such as by using
sockets).  If a lot of data must be transferred between processes then
this can be inefficient.
     
On the other hand, multiple threads within a single process are
intimately connected: they share their data but often can interfere
badly with one another.  It is often argued that the only way to make
multithreaded programming "easy" is to avoid relying on any shared
state and for the threads to only communicate by passing messages to
each other.

CPython has a *Global Interpreter Lock* (GIL) which in many ways makes
threading easier than it is in most languages by making sure that only
one thread can manipulate the interpreter's objects at a time.  As a
result, it is often safe to let multiple threads access data without
using any additional locking as one would need to in a language such
as C.

One downside of the GIL is that on multi-processor (or multi-core)
systems a multithreaded Python program can only make use of one
processor at a time.  This is a problem that can be overcome by using
multiple processes instead.

Python gives little direct support for writing programs using multiple
process.  This package allows one to write multi-process programs
using much the same API that one uses for writing threaded programs.


Forking and spawning
====================

There are two ways of creating a new process in Python:

* The current process can *fork* a new child process by using the
  `os.fork()` function.  This effectively creates an identical copy
  of the current process which is now able to go off and perform some
  task set by the parent process.  This means that the child process
  inherits *copies* of all variables that the parent process had.

  However, `os.fork()` is not available on every platform: in
  particular Windows does not support it.

* Alternatively, the current process can spawn a completely new Python
  interpreter by using the `subprocess` module or one of the
  `os.spawn*()` functions.

  Getting this new interpreter in to a fit state to perform the task
  set for it by its parent process is, however, a bit of a challenge.

The `processing` package uses `os.fork()` if it is available since
it makes life a lot simpler.  Forking the process is also more
efficient in terms of memory usage and the time needed to create the
new process.


The Process class
=================

In the `processing` package processes are spawned by creating a
`Process` object and then calling its `start()` method.
`processing.Process` follows the API of `threading.Thread`.  A
trivial example of a multiprocess program is ::

   from processing import Process     

   def f(name):
       print 'hello', name

   if __name__ == '__main__':
       p = Process(target=f, args=('bob',))
       p.start()
       p.join()

Here the function `f` is run in a child process. 

For an explanation of why (on Windows) the `if __name__ == '__main__'`
part is necessary see `Programming guidelines
<programming-guidelines.html>`_.


Exchanging objects between processes
====================================

`processing` supports two types of communication channel between
processes:

**Queues**:
    The function `Queue()` returns a near clone of `Queue.Queue`
    -- see the Python standard documentation.  For example ::

        from processing import Process, Queue

        def f(q):
            q.put([42, None, 'hello'])

        if __name__ == '__main__':
            q = Queue()
            p = Process(target=f, args=(q,))
            p.start()
            print q.get()    # prints "[42, None, 'hello']"
            p.join()

    Queues are thread and process safe.  See `Queues
    <processing-ref.html#pipes-and-queues>`_.

**Pipes**:
    The `Pipe()` function returns a pair of connection objects
    connected by a pipe which by default is duplex (two-way).  For
    example ::

        from processing import Process, Pipe

        def f(conn):
            conn.send([42, None, 'hello'])
            conn.close()

        if __name__ == '__main__':
            parent_conn, child_conn = Pipe()
            p = Process(target=f, args=(child_conn,))
            p.start()
            print parent_conn.recv()   # prints "[42, None, 'hello']"
            p.join()

    The two connection objects returned by `Pipe()` represent the two
    ends of the pipe.  Each connection object has `send()` and
    `recv()` methods (among others).  Note that data in a pipe may
    become corrupted if two processes (or threads) try to read from or
    write to the *same* end of the pipe at the same time.  Of course
    there is no risk of corruption from processes using different ends
    of the pipe at the same time.  See `Pipes
    <processing-ref.html#pipes-and-queues>`_.


Synchronization between processes
=================================

`processing` contains equivalents of all the synchronization
primitives from `threading`.  For instance one can use a lock to
ensure that only one process prints to standard output at a time::
       
       from processing import Process, Lock
       
       def f(l, i):
           l.acquire()
           print 'hello world', i
           l.release()
           
       if __name__ == '__main__':
           lock = Lock()
           
           for num in range(10):
               Process(target=f, args=(lock, num)).start()
               
Without using the lock output from the different processes is liable
to get all mixed up.


Sharing state between processes
===============================

As mentioned above, when doing concurrent programming it is usually
best to avoid using shared state as far as possible.  This is
particularly true when using multiple processes.

However, if you really do need to use some shared data then
`processing` provides a couple of ways of doing so.

**Shared memory**: 
   Data can be stored in a shared memory map using `Value` or `Array`.
   For example the following code ::
      
       from processing import Process, Value, Array

       def f(n, a):
           n.value = 3.1415927
           for i in range(len(a)):
               a[i] = -a[i]

       if __name__ == '__main__':
           num = Value('d', 0.0)
           arr = Array('i', range(10))
           
           p = Process(target=f, args=(num, arr))
           p.start()
           p.join()
           
           print num.value
           print arr[:]
           
   will print ::
        
       3.1415927
       [0, -1, -2, -3, -4, -5, -6, -7, -8, -9]

   The `'d'` and `'i'` arguments used when creating `num` and `arr`
   are typecodes of the kind used by the `array` module: `'d'`
   indicates a double precision float and `'i'` inidicates a signed
   integer.  These shared objects will be process and thread safe.

   For more flexibility in using shared memory one can use the
   `processing.sharedctypes` module which supports the creation of
   arbitrary `ctypes objects allocated from shared memory
   <sharedctypes.html>`_.

**Server process**:
   A manager object returned by `Manager()` controls a server process
   which holds python objects and allows other processes to manipulate
   them using proxies.

   A manager returned by `Manager()` will support types `list`,
   `dict`, `Namespace`, `Lock`, `RLock`, `Semaphore`,
   `BoundedSemaphore`, `Condition`, `Event`, `Queue`, `Value`
   and `Array`.  For example::

       from processing import Process, Manager

       def f(d, l):
           d[1] = '1'
           d['2'] = 2
           d[0.25] = None
           l.reverse()

       if __name__ == '__main__':
           manager = Manager()

           d = manager.dict()
           l = manager.list(range(10))

           p = Process(target=f, args=(d, l))
           p.start()
           p.join()

           print d
           print l

   will print ::

       {0.25: None, 1: '1', '2': 2}
       [9, 8, 7, 6, 5, 4, 3, 2, 1, 0]

   Creating managers which support other types is not hard --- see
   `Customized managers <manager-objects.html#customized-managers>`_.
   
   Server process managers are more flexible than using shared memory
   objects because they can be made to support arbitrary object types.
   Also, a single manager can be shared by processes on different
   computers over a network.  They are, however, slower than using
   shared memory.  See `Server process managers
   <manager-objects.html#server-process-managers>`_.


Using a pool of workers
=======================

The `Pool()` function returns an object representing a pool of worker
processes.  It has methods which allows tasks to be offloaded to the
worker processes in a few different ways.

For example::

    from processing import Pool

    def f(x):
        return x*x

    if __name__ == '__main__':
        pool = Pool(processes=4)              # start 4 worker processes
        result = pool.applyAsync(f, [10])     # evaluate "f(10)" asynchronously
        print result.get(timeout=1)           # prints "100" unless your computer is *very* slow
        print pool.map(f, range(10))          # prints "[0, 1, 4,..., 81]"

See `Process pools <pool-objects.html>`_.


Speed
=====

The following benchmarks were performed on a single core Pentium 4,
2.5Ghz laptop running Windows XP and Ubuntu Linux 6.10 --- see
`benchmarks.py <../examples/benchmarks.py>`_.


*Number of 256 byte string objects passed between processes/threads per sec*:

================================== ========== ==================
Connection type                    Windows    Linux
================================== ========== ==================
Queue.Queue                         49,000    17,000-50,000 [1]_
processing.Queue                    22,000    21,000
Queue managed by server              6,900     6,500
processing.Pipe                     52,000    57,000
================================== ========== ==================

.. [1] For some reason the performance of `Queue.Queue` is very
       variable on Linux.


*Number of acquires/releases of a lock per sec*:

============================== ========== ==========
Lock type                       Windows    Linux
============================== ========== ==========
threading.Lock                   850,000    560,000
processing.Lock                  420,000    510,000
Lock managed by server            10,000      8,400
threading.RLock                   93,000     76,000
processing.RLock                 420,000    500,000
RLock managed by server            8,800      7,400
============================== ========== ==========


*Number of interleaved waits/notifies per sec on a
condition variable by two processes*:

============================== ========== ==========
 Condition type                  Windows    Linux
============================== ========== ==========
threading.Condition             27,000     31,000
processing.Condition            26,000     25,000
Condition managed by server      6,600      6,000
============================== ========== ==========


*Number of integers retrieved from a sequence per sec*:

============================== ========== ==========
Sequence type                   Windows    Linux
============================== ========== ==========
list                           6,400,000  5,100,000
unsynchornized shared array    3,900,000  3,100,000
synchronized shared array        200,000    220,000
list managed by server            20,000     17,000
============================== ========== ==========

.. _Prev: index.html
.. _Up: index.html
.. _Next: processing-ref.html