File: graph_manipulation.rst

package info (click to toggle)
dask 2024.12.1%2Bdfsg-2
links: PTS, VCS
area: main
in suites: forky, sid, trixie
size: 20,024 kB
sloc: python: 105,182; javascript: 1,917; makefile: 159; sh: 88
file content (72 lines) | stat: -rw-r--r-- 2,541 bytes
.. _graph_manipulation:

Advanced graph manipulation
===========================
There are some situations where computations with Dask collections will result in
suboptimal memory usage (e.g. an entire Dask DataFrame is loaded into memory).
This may happen when Dask’s scheduler doesn’t automatically delay the computation of
nodes in a task graph to avoid occupying memory with their output for prolonged periods
of time, or in scenarios where recalculating nodes is much cheaper than holding their
output in memory.

This page highlights a set of graph manipulation utilities which can be used to help
avoid these scenarios. In particular, the utilities described below rewrite the
underlying Dask graph for Dask collections, producing equivalent collections with
different sets of keys.

Consider the following example:

.. code-block:: python

   >>> import dask.array as da
   >>> x = da.random.default_rng().normal(size=500_000_000, chunks=100_000)
   >>> x_mean = x.mean()
   >>> y = (x - x_mean).max().compute()

The above example computes the largest value of a distribution after removing its bias.
This involves loading the chunks of ``x`` into memory in order to compute ``x_mean``.
However, since the ``x`` array is needed later in the computation to compute ``y``, the
entire ``x`` array is kept in memory. For large Dask Arrays this can be very
problematic.

To alleviate the need for the entire ``x`` array to be kept in memory, one could rewrite
the last line as follows:

.. code-block:: python

   >>> from dask.graph_manipulation import bind
   >>> xb = bind(x, x_mean)
   >>> y = (xb - x_mean).max().compute()

Here we use :func:`~dask.graph_manipulation.bind` to create a new Dask Array, ``xb``,
which produces exactly the same output as ``x``, but whose underlying Dask graph has
different keys than ``x``, and will only be computed after ``x_mean`` has been
calculated.

This results in the chunks of ``x`` being computed and immediately individually reduced
by ``mean``; then recomputed and again immediately pipelined into the subtraction
followed by reduction with ``max``. This results in a much smaller peak memory usage as
the full ``x`` array is no longer loaded into memory. However, the tradeoff is that the
compute time increases as ``x`` is computed twice.


API
---

.. currentmodule:: dask.graph_manipulation

.. autosummary::

   checkpoint
   wait_on
   bind
   clone


Definitions
~~~~~~~~~~~

.. autofunction:: checkpoint
.. autofunction:: wait_on
.. autofunction:: bind
.. autofunction:: clone