1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96
|
Overview
========
Dask Bag implements operations like ``map``, ``filter``, ``fold``, and
``groupby`` on collections of Python objects. It does this in parallel with a
small memory footprint using Python iterators. It is similar to a parallel
version of PyToolz_ or a Pythonic version of the `PySpark RDD`_.
.. _PyToolz: https://toolz.readthedocs.io/en/latest/
.. _`PySpark RDD`: http://spark.apache.org/docs/latest/api/python/pyspark.html
Design
------
Dask bags coordinate many Python lists or Iterators, each of which forms a
partition of a larger collection.
Common Uses
-----------
Dask bags are often used to parallelize simple computations on unstructured or
semi-structured data like text data, log files, JSON records, or user defined
Python objects.
Execution
---------
Execution on bags provide two benefits:
1. Parallel: data is split up, allowing multiple cores or machines to execute
in parallel
2. Iterating: data processes lazily, allowing smooth execution of
larger-than-memory data, even on a single machine within a single partition
Default scheduler
~~~~~~~~~~~~~~~~~
By default, ``dask.bag`` uses ``dask.multiprocessing`` for computation. As a
benefit, Dask bypasses the GIL_ and uses multiple cores on pure Python objects.
As a drawback, Dask Bag doesn't perform well on computations that include a
great deal of inter-worker communication. For common operations this is rarely
an issue as most Dask Bag workflows are embarrassingly parallel or result in
reductions with little data moving between workers.
.. _GIL: https://docs.python.org/3/glossary.html#term-gil
Shuffle
~~~~~~~
Some operations, like ``groupby``, require substantial inter-worker
communication. On a single machine, Dask uses partd_ to perform efficient,
parallel, spill-to-disk shuffles. When working in a cluster, Dask uses a task
based shuffle.
These shuffle operations are expensive and better handled by projects like
``dask.dataframe``. It is best to use ``dask.bag`` to clean and process data,
then transform it into an array or DataFrame before embarking on the more
complex operations that require shuffle steps.
.. _partd: https://github.com/mrocklin/partd
Known Limitations
-----------------
Bags provide very general computation (any Python function). This generality
comes at cost. Bags have the following known limitations:
1. By default, they rely on the multiprocessing scheduler, which has its own
set of known limitations (see :doc:`shared`)
2. Bags are immutable and so you can not change individual elements
3. Bag operations tend to be slower than array/DataFrame computations in the
same way that standard Python containers tend to be slower than NumPy
arrays and Pandas DataFrames
4. Bag's ``groupby`` is slow. You should try to use Bag's ``foldby`` if possible.
Using ``foldby`` requires more thought tough
Name
----
*Bag* is the mathematical name for an unordered collection allowing repeats. It
is a friendly synonym to multiset_. A bag, or a multiset, is a generalization of
the concept of a set that, unlike a set, allows multiple instances of the
multiset's elements:
* ``list``: *ordered* collection *with repeats*, ``[1, 2, 3, 2]``
* ``set``: *unordered* collection *without repeats*, ``{1, 2, 3}``
* ``bag``: *unordered* collection *with repeats*, ``{1, 2, 2, 3}``
So, a bag is like a list, but it doesn't guarantee an ordering among elements.
There can be repeated elements but you can't ask for the ith element.
.. _multiset: http://en.wikipedia.org/wiki/Bag_(mathematics)
|