File: faq.rst

package info (click to toggle)
dask 2024.12.1%2Bdfsg-2
  • links: PTS, VCS
  • area: main
  • in suites: forky, sid, trixie
  • size: 20,024 kB
  • sloc: python: 105,182; javascript: 1,917; makefile: 159; sh: 88
file content (438 lines) | stat: -rw-r--r-- 18,500 bytes parent folder | download
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
FAQ
===

**Question**: *Is Dask appropriate for adoption within a larger institutional context?*

**Answer**: *Yes.* Dask is used within the world's largest banks, national labs,
retailers, technology companies, and government agencies.  It is used in highly
secure environments.  It is used in conservative institutions as well as fast
moving ones.

This page contains Frequently Asked Questions and concerns from institutions and users
when they first investigate Dask.

.. contents:: :local:

For Management
--------------

Briefly, what problem does Dask solve for us?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Dask is a general purpose parallel programming solution.
As such it is used in *many* different ways.

However, the most common problem that Dask solves is connecting Python analysts
to distributed hardware, particularly for data science and machine learning
workloads.  The institutions for whom Dask has the greatest
impact are those who have a large body of Python users who are accustomed to
libraries like NumPy, Pandas, Jupyter, Scikit-Learn and others, but want to
scale those workloads across a cluster.  Often they also have distributed
computing resources that are going underused.

Dask removes both technological and cultural barriers to connect Python users
to computing resources in a way that is native to both the users and IT.

"*Help me scale my notebook onto the cluster*" is a common pain point for
institutions today, and it is a common entry point for Dask usage.


Is Dask mature?  Why should we trust it?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Yes.  While Dask itself is relatively new (it began in 2015) it is built by the
NumPy, Pandas, Jupyter, Scikit-Learn developer community, which is well trusted.
Dask is a relatively thin wrapper on top of these libraries and,
as a result, the project can be relatively small and simple.
It doesn't reinvent a whole new system.

Additionally, this tight integration with the broader technology stack
gives substantial benefits long term.  For example:

-   Because Pandas maintainers also maintain Dask,
    when Pandas issues a new releases Dask issues a release at the same time to
    ensure continuity and compatibility.

-   Because Scikit-Learn maintainers maintain and use Dask when they train on large clusters,
    you can be assured that Dask-ML focuses on pragmatic and important
    solutions like XGBoost integration, and hyper-parameter selection,
    and that the integration between the two feels natural for novice and
    expert users alike.

-   Because Jupyter maintainers also maintain Dask,
    powerful Jupyter technologies like JupyterHub and JupyterLab are designed
    with Dask's needs in mind, and new features are pushed quickly to provide a
    first class and modern user experience.

Additionally, Dask is maintained both by a broad community of maintainers,
as well as substantial institutional support (several full-time employees each)
by both Anaconda, the company behind the leading data science distribution, and
NVIDIA, the leading hardware manufacturer of GPUs.  Despite large corporate
support, Dask remains a community governed project, and is fiscally sponsored by
NumFOCUS, the same 501c3 that fiscally sponsors NumPy, Pandas, Jupyter, and many others.


Who else uses Dask?
~~~~~~~~~~~~~~~~~~~

Dask is used by individual researchers in practically every field today.  It
has millions of downloads per month, and is integrated into many PyData
software packages today.

On an *institutional* level Dask is used by analytics and research groups in a
similarly broad set of domains across both energetic startups as well as large
conservative household names.  A web search shows articles by Capital One,
Barclays, Walmart, NASA, Los Alamos National Laboratories, and hundreds of
other similar institutions.


How does Dask compare with Apache Spark?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

*This question has longer and more technical coverage* :doc:`here <spark>`

Dask and Apache Spark are similar in that they both ...

-  Promise easy parallelism for data science Python users
-  Provide Dataframe and ML APIs for ETL, data science, and machine learning
-  Scale out to similar scales, around 1-1000 machines

Dask differs from Apache Spark in a few ways:

-  Dask is more Python native, Spark is Scala/JVM native with Python bindings.

   Python users may find Dask more comfortable,
   but Dask is only useful for Python users,
   while Spark can also be used from JVM languages.

-  Dask is one component in the broader Python ecosystem alongside libraries
   like Numpy, Pandas, and Scikit-Learn,
   while Spark is an all-in-one system that re-invents much of the Python world
   in a single package.

   This means that it's often easier to compose Dask with new problem domains,
   but also that you need to install multiple things (like Dask and Pandas or
   Dask and Numpy) rather than just having everything in an all-in-one solution.

-  Apache Spark focuses strongly on traditional business intelligence workloads,
   like ETL, SQL queries, and then some lightweight machine learning,
   while Dask is more general purpose.

   This means that Dask is much more flexible and can handle other problem
   domains like multi-dimensional arrays, GIS, advanced machine learning, and
   custom systems, but that it is less focused and less tuned on typical SQL
   style computations.

   If you mostly want to focus on SQL queries then Spark is probably a better
   bet.  If you want to support a wide variety of custom workloads then Dask
   might be more natural.

See the section :doc:`spark`.


Are there companies that we can get support from?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

There are several companies that offer support for dask in different capacities. See
`Paid support <https://docs.dask.org/en/latest/support.html#paid-support>`_ for a full list.


For IT
------


How would I set up Dask on institutional hardware?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

You already have cluster resources.
Dask can run on them today without significant change.

Most institutional clusters today have a resource manager.
This is typically managed by IT, with some mild permissions given to users to
launch jobs.  Dask works with all major resource managers today, including
those on Hadoop, HPC, Kubernetes, and Cloud clusters.

1.  **Hadoop/Spark**: If you have a Hadoop/Spark cluster, such as one purchased
    through Cloudera/Hortonworks/MapR then you will likely want to deploy Dask
    with YARN, the resource manager that deploys services like Hadoop, Spark,
    Hive, and others.

    To help with this, you'll likely want to use `Dask-Yarn <https://yarn.dask.org>`_.

2.  **HPC**: If you have an HPC machine that runs resource managers like SGE,
    SLURM, PBS, LSF, Torque, Condor, or other job batch queuing systems, then
    users can launch Dask on these systems today using either:

    - `Dask Jobqueue <https://jobqueue.dask.org>`_ , which uses typical
      ``qsub``, ``sbatch``, ``bsub`` or other submission tools in interactive
      settings.
    - `Dask MPI <https://mpi.dask.org>`_ which uses MPI for deployment in
      batch settings

    For more information see :doc:`deploying-hpc`

3.  **Kubernetes/Cloud**: Newer clusters may employ Kubernetes for deployment.
    This is particularly commonly used today on major cloud providers,
    all of which provide hosted Kubernetes as a service.  People today use Dask
    on Kubernetes using either of the following:

    - **Helm**: an easy way to stand up a long-running Dask cluster and
      Jupyter notebook

    - **Dask-Kubernetes**: for native Kubernetes integration for fast moving
      or ephemeral deployments.

    For more information see :doc:`deploying-kubernetes`

4. **Commercial Dask deployment:**

   - You can use `Coiled <https://coiled.io?utm_source=dask-docs&utm_medium=faq>`_ to handle the creation and management of Dask clusters on cloud computing environments (AWS and GCP).
   - `Domino Data Lab <https://www.dominodatalab.com/>`_ Lets users create Dask clusters in a hosted platform.
   - `Saturn Cloud <https://saturncloud.io/>`_ Lets users create Dask clusters in a hosted platform or within their own AWS accounts.


Is Dask secure?
~~~~~~~~~~~~~~~

Dask is deployed today within highly secure institutions,
including major financial, healthcare, and government agencies.

That being said it's worth noting that, by its very nature, Dask enables the
execution of arbitrary user code on a large set of machines. Care should be
taken to isolate, authenticate, and govern access to these machines.  Fortunately,
your institution likely already does this and uses standard technologies like
SSL/TLS, Kerberos, and other systems with which Dask can integrate.


Do I need to purchase a new cluster?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

No.  It is easy to run Dask today on most clusters.
If you have a pre-existing HPC or Spark/Hadoop cluster then that will be fine
to start running Dask.

You can start using Dask without any capital expenditure.


How do I manage users?
~~~~~~~~~~~~~~~~~~~~~~

Dask doesn't manage users, you likely have existing systems that do this well.
In a large institutional setting we assume that you already have a resource
manager like Yarn (Hadoop), Kubernetes, or PBS/SLURM/SGE/LSF/..., each of which
have excellent user management capabilities, which are likely preferred by your
IT department anyway.

Dask is designed to operate with user-level permissions, which means that
your data science users should be able to ask those systems mentioned above for
resources, and have their processes tracked accordingly.

However, there are institutions where analyst-level users aren't given direct access to
the cluster.  This is particularly common in Cloudera/Hortonworks Hadoop/Spark deployments.
In these cases some level of explicit indirection may be required.  For this, we
recommend the `Dask Gateway project <https://gateway.dask.org>`_, which uses IT-level
permissions to properly route authenticated users into secure resources.

You may also want to consider a managed cluster solution (see :ref:`managed-cluster-solutions`).


How do I manage software environments?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

This depends on your cluster resource manager:

-  Most HPC users use their network file system
-  Hadoop/Spark/Yarn users package their environment into a tarball and ship it
   around with HDFS (Dask-Yarn integrates with `Conda Pack
   <https://conda.github.io/conda-pack/>`_ for this capability)
-  Kubernetes or Cloud users use Docker images

In each case Dask integrates with existing processes and technologies
that are well understood and familiar to the institution.


How does Dask communicate data between machines?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Dask usually communicates over TCP, using msgpack for small administrative
messages, and its own protocol for efficiently passing around large data.
The scheduler and each worker host their own TCP server, making Dask a
distributed peer-to-peer network that uses point-to-point communication.
We do not use Spark-style shuffle systems.  We do not use MPI-style
collectives.  Everything is direct point-to-point.

For high performance networks you can use either TCP-over-Infiniband for about
1 GB/s bandwidth, or UCX (experimental) for full speed communication.


Are deployments long running, or ephemeral?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

We see both, but ephemeral deployments are more common.

Most Dask use today is about enabling data science or data engineering users to
scale their interactive workloads across the cluster.
These are typically either interactive sessions with Jupyter, or batch scripts
that run at a pre-defined time.  In both cases, the user asks the resource
manager for a bunch of machines, does some work, and then gives up those
machines.

Some institutions also use Dask in an always-on fashion, either handling
real-time traffic in a scalable way, or responding to a broad set of
interactive users with large datasets that it keeps resident in memory.


For Users
---------

Will Dask "just work" on our existing code?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

No, you will need to make modifications,
but these modifications are usually small.

The vast majority of lines of business logic within your institution
will not have to change, assuming that they are in Python and use tooling like
Numpy, Pandas and Scikit-Learn.

How well does Dask scale?  What are Dask's limitations?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The largest Dask deployments that we see today are on around 1000 multi-core
machines, perhaps 20,000 cores in total, but these are rare.
Most institutional-level problems (1-100 TB) are well solved by deployments of 10-50 nodes.

Technically, the back-of-the-envelope number to keep in mind is that each task
(an individual Python function call) in Dask has an overhead of around *200
microseconds*.  So if these tasks take 1 second each, then Dask can saturate
around 5000 cores before scheduling overhead dominates costs.  As workloads
reach this limit they are encouraged to use larger chunk sizes to compensate.
The *vast majority* of institutional users though do not reach this limit.
For more information you may want to peruse our :doc:`best practices
<best-practices>`

Is Dask resilient?  What happens when a machine goes down?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Yes, Dask is resilient to the failure of worker nodes.  It knows how it came to
any result, and can replay the necessary work on other machines if one goes
down.

If Dask's centralized scheduler goes down then you would need to resubmit the
computation.  This is a fairly standard level of resiliency today, shared with
other tooling like Apache Spark, Flink, and others.

The resource managers that host Dask, like Yarn or Kubernetes, typically
provide long-term 24/7 resilience for always-on operation.

Is the API exactly the same as NumPy/Pandas/Scikit-Learn?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

No, but it's very close.  That being said your data scientists will still
have to learn some things.

What we find is that the Numpy/Pandas/Scikit-Learn APIs aren't the challenge
when institutions adopt Dask.  When API inconsistencies do exist, even
modestly skilled programmers are able to understand why and work around them
without much pain.

Instead, the challenge is building intuition around parallel performance.
We've all built up a mental model for what is fast and slow on a single
machine.  This model changes when we factor in network communication and
parallel algorithms, and the performance that we get for familiar operations
can be surprising.

Our main solution to build this intuition, other than
accumulated experience, is Dask's :doc:`Diagnostic Dashboard
<dashboard>`.
The dashboard delivers a ton of visual feedback to users as they are running
their computation to help them understand what is going on.  This both helps
them to identify and resolve immediate bottlenecks, and also builds up that
parallel performance intuition surprisingly quickly.


How much performance tuning does Dask require?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

*Some other systems are notoriously hard to tune for optimal performance.
What is Dask's story here?  How many knobs are there that we need to be aware
of?*

Like the rest of the Python software tools, Dask puts a lot of effort into
having sane defaults.  Dask workers automatically detect available memory and
cores, and choose sensible defaults that are decent in most situations.  Dask
algorithms similarly provide decent choices by default, and informative warnings
when tricky situations arise, so that, in common cases, things should be fine.

The most common knobs to tune include the following:

-   The thread/process mixture to deal with GIL-holding computations (which are
    rare in Numpy/Pandas/Scikit-Learn workflows)
-   Partition size, like if should you have 100 MB chunks or 1 GB chunks

That being said, almost no institution's needs are met entirely by the common
case, and given the variety of problems that people throw at Dask,
exceptional problems are commonplace.
In these cases we recommend watching the dashboard during execution to see what
is going on.  It can commonly inform you what's going wrong, so that you can
make changes to your system.


What Data formats does Dask support?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Because Dask builds on NumPy and Pandas, it supports most formats that they
support, which is most formats.
That being said, not all formats are well suited for
parallel access.  In general people using the following formats are usually
pretty happy:

-  **Tabular:** Parquet, ORC, CSV, Line Delimited JSON, Avro, text
-  **Arrays:** HDF5, NetCDF, Zarr, GRIB

More generally, if you have a Python function that turns a chunk of your stored
data into a Pandas dataframe or Numpy array then Dask can probably call that
function many times without much effort.

For groups looking for advice on which formats to use, we recommend Parquet
for tables and Zarr or HDF5 for arrays.


Does Dask have a SQL interface?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Dask supports various ways to communicate with SQL databases, some
requiring extra packages to be installed; see the section
:doc:`dataframe-sql`.


Does Dask work on GPUs?
~~~~~~~~~~~~~~~~~~~~~~~

Yes! Dask works with GPUs in a few ways.

The `RAPIDS <https://rapids.ai>`_ libraries provide a GPU-accelerated
Pandas-like library,
`cuDF <https://github.com/rapidsai/cudf>`_,
which interoperates well and is tested against Dask DataFrame.

`Chainer's CuPy <https://cupy.chainer.org/>`_ library provides a GPU
accelerated NumPy-like library that interoperates nicely with Dask Array.

For custom workflows people use Dask alongside GPU-accelerated libraries like PyTorch and
TensorFlow to manage workloads across several machines.  They typically use
Dask's custom APIs, notably :doc:`Delayed <delayed>` and :doc:`Futures
<futures>`.

See the section :doc:`gpu`.

For Marketing
-------------

There is a special subsite dedicated to addressing marketing concerns. You can
find it at `dask.org/brand-guide <https://dask.org/brand-guide>`_.

Where can I find logos?
~~~~~~~~~~~~~~~~~~~~~~~

Yes! You can find them at :doc:`logos`.