File: delayed-collections.rst

package info (click to toggle)

dask 2024.12.1%2Bdfsg-2

links: PTS, VCS
area: main
in suites: forky, sid, trixie
size: 20,024 kB
sloc: python: 105,182; javascript: 1,917; makefile: 159; sh: 88

file content (39 lines) | stat: -rw-r--r-- 1,497 bytes

parent folder | download | duplicates (4)

Working with Collections
========================

Often we want to do a bit of custom work with ``dask.delayed`` (for example,
for complex data ingest), then leverage the algorithms in ``dask.array`` or
``dask.dataframe``, and then switch back to custom work.  To this end, all
collections support ``from_delayed`` functions and ``to_delayed``
methods.

As an example, consider the case where we store tabular data in a custom format
not known by Dask DataFrame.  This format is naturally broken apart into
pieces and we have a function that reads one piece into a Pandas DataFrame.
We use ``dask.delayed`` to lazily read these files into Pandas DataFrames,
use ``dd.from_delayed`` to wrap these pieces up into a single
Dask DataFrame, use the complex algorithms within the DataFrame
(groupby, join, etc.), and then switch back to ``dask.delayed`` to save our results
back to the custom format:

.. code-block:: python

   import dask.dataframe as dd
   from dask.delayed import delayed

   from my_custom_library import load, save

   filenames = ...
   dfs = [delayed(load)(fn) for fn in filenames]

   df = dd.from_delayed(dfs)
   df = ... # do work with dask.dataframe

   dfs = df.to_delayed()
   writes = [delayed(save)(df, fn) for df, fn in zip(dfs, filenames)]

   dd.compute(*writes)

Data science is often complex, and ``dask.delayed`` provides a release valve for
users to manage this complexity on their own, and solve the last mile problem
for custom formats and complex situations.