File: array-best-practices.rst

package info (click to toggle)
dask 2024.12.1%2Bdfsg-2
  • links: PTS, VCS
  • area: main
  • in suites: forky, sid, trixie
  • size: 20,024 kB
  • sloc: python: 105,182; javascript: 1,917; makefile: 159; sh: 88
file content (127 lines) | stat: -rw-r--r-- 4,740 bytes parent folder | download | duplicates (2)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
Best Practices
==============

It is easy to get started with Dask arrays, but using them *well* does require
some experience.  This page contains suggestions for best practices, and
includes solutions to common problems.

Use NumPy
---------

If your data fits comfortably in RAM and you are not performance bound, then
using NumPy might be the right choice.  Dask adds another layer of complexity
which may get in the way.

If you are just looking for speedups rather than scalability then you may want
to consider a project like `Numba <https://numba.pydata.org>`_


Select a good chunk size
------------------------

A common performance problem among Dask Array users is that they have chosen a
chunk size that is either too small (leading to lots of overhead) or poorly
aligned with their data (leading to inefficient reading).

While optimal sizes and shapes are highly problem specific, it is rare to see
chunk sizes below 100 MB in size.  If you are dealing with float64 data then
this is around ``(4000, 4000)`` in size for a 2D array or ``(100, 400, 400)``
for a 3D array.

You want to choose a chunk size that is large in order to reduce the number of
chunks that Dask has to think about (which affects overhead) but also small
enough so that many of them can fit in memory at once.  Dask will often have as
many chunks in memory as twice the number of active threads.


Orient your chunks
------------------

When reading data you should align your chunks with your storage format.
Most array storage formats store data in chunks themselves.  If your Dask array
chunks aren't multiples of these chunk shapes then you will have to read the
same data repeatedly, which can be expensive.  Note though that often
storage formats choose chunk sizes that are much smaller than is ideal for
Dask, closer to 1MB than 100MB.  In these cases you should choose a Dask chunk
size that aligns with the storage chunk size and that every Dask chunk
dimension is a multiple of the storage chunk dimension.

So for example if we have an HDF file that has chunks of size ``(128, 64)``, we
might choose a chunk shape of ``(1280, 6400)``.

.. code-block:: python

   >>> import h5py
   >>> storage = h5py.File('myfile.hdf5')['x']
   >>> storage.chunks
   (128, 64)

   >>> import dask.array as da
   >>> x = da.from_array(storage, chunks=(1280, 6400))

Note that if you provide ``chunks='auto'`` then Dask Array will look for a
``.chunks`` attribute and use that to provide a good chunking.


Avoid Oversubscribing Threads
-----------------------------

.. tip::

   When using the ``distributed`` scheduler, the ``OMP_NUM_THREADS``,
   ``MKL_NUM_THREADS``, and ``OPENBLAS_NUM_THREADS`` environment variables
   are automatically set to ``1`` when using :ref:`nanny` workers.
   This helps avoid oversubscribing threads in common cases.

By default Dask will run as many concurrent tasks as you have logical cores.
It assumes that each task will consume about one core.  However, many
array-computing libraries are themselves multi-threaded, which can cause
contention and low performance.  In particular the BLAS/LAPACK libraries that
back most of NumPy's linear algebra routines are often multi-threaded, and need
to be told to use only one thread explicitly.  You can do this with the
following environment variables (using bash ``export`` command below, but this
may vary depending on your operating system).

.. code-block:: bash

   export OMP_NUM_THREADS=1
   export MKL_NUM_THREADS=1
   export OPENBLAS_NUM_THREADS=1

You need to run this before you start your Python process for it to take effect.


Consider Xarray
---------------

The `Xarray <http://xarray.pydata.org/en/stable/>`_ package wraps around Dask
Array, and so offers the same scalability, but also adds convenience when
dealing with complex datasets.  In particular Xarray can help with the
following:

1.  Manage multiple arrays together as a consistent dataset
2.  Read from a stack of HDF or NetCDF files at once
3.  Switch between Dask Array and NumPy with a consistent API

Xarray is used in wide range of fields, including physics, astronomy, geoscience,
microscopy, bioinformatics, engineering, finance, and deep learning.
Xarray also has a thriving user community that is good at providing support.


Build your own Operations
-------------------------

Often we want to perform computations for which there is no exact function
in Dask Array.  In these cases we may be able to use some of the more generic
functions to build our own.  These include:

.. currentmodule:: dask.array

.. autosummary::
   blockwise
   map_blocks
   map_overlap
   reduction

These functions may help you to apply a function that you write for NumPy
functions onto larger Dask arrays.