1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438
|
FAQ
===
**Question**: *Is Dask appropriate for adoption within a larger institutional context?*
**Answer**: *Yes.* Dask is used within the world's largest banks, national labs,
retailers, technology companies, and government agencies. It is used in highly
secure environments. It is used in conservative institutions as well as fast
moving ones.
This page contains Frequently Asked Questions and concerns from institutions and users
when they first investigate Dask.
.. contents:: :local:
For Management
--------------
Briefly, what problem does Dask solve for us?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Dask is a general purpose parallel programming solution.
As such it is used in *many* different ways.
However, the most common problem that Dask solves is connecting Python analysts
to distributed hardware, particularly for data science and machine learning
workloads. The institutions for whom Dask has the greatest
impact are those who have a large body of Python users who are accustomed to
libraries like NumPy, Pandas, Jupyter, Scikit-Learn and others, but want to
scale those workloads across a cluster. Often they also have distributed
computing resources that are going underused.
Dask removes both technological and cultural barriers to connect Python users
to computing resources in a way that is native to both the users and IT.
"*Help me scale my notebook onto the cluster*" is a common pain point for
institutions today, and it is a common entry point for Dask usage.
Is Dask mature? Why should we trust it?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Yes. While Dask itself is relatively new (it began in 2015) it is built by the
NumPy, Pandas, Jupyter, Scikit-Learn developer community, which is well trusted.
Dask is a relatively thin wrapper on top of these libraries and,
as a result, the project can be relatively small and simple.
It doesn't reinvent a whole new system.
Additionally, this tight integration with the broader technology stack
gives substantial benefits long term. For example:
- Because Pandas maintainers also maintain Dask,
when Pandas issues a new releases Dask issues a release at the same time to
ensure continuity and compatibility.
- Because Scikit-Learn maintainers maintain and use Dask when they train on large clusters,
you can be assured that Dask-ML focuses on pragmatic and important
solutions like XGBoost integration, and hyper-parameter selection,
and that the integration between the two feels natural for novice and
expert users alike.
- Because Jupyter maintainers also maintain Dask,
powerful Jupyter technologies like JupyterHub and JupyterLab are designed
with Dask's needs in mind, and new features are pushed quickly to provide a
first class and modern user experience.
Additionally, Dask is maintained both by a broad community of maintainers,
as well as substantial institutional support (several full-time employees each)
by both Anaconda, the company behind the leading data science distribution, and
NVIDIA, the leading hardware manufacturer of GPUs. Despite large corporate
support, Dask remains a community governed project, and is fiscally sponsored by
NumFOCUS, the same 501c3 that fiscally sponsors NumPy, Pandas, Jupyter, and many others.
Who else uses Dask?
~~~~~~~~~~~~~~~~~~~
Dask is used by individual researchers in practically every field today. It
has millions of downloads per month, and is integrated into many PyData
software packages today.
On an *institutional* level Dask is used by analytics and research groups in a
similarly broad set of domains across both energetic startups as well as large
conservative household names. A web search shows articles by Capital One,
Barclays, Walmart, NASA, Los Alamos National Laboratories, and hundreds of
other similar institutions.
How does Dask compare with Apache Spark?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
*This question has longer and more technical coverage* :doc:`here <spark>`
Dask and Apache Spark are similar in that they both ...
- Promise easy parallelism for data science Python users
- Provide Dataframe and ML APIs for ETL, data science, and machine learning
- Scale out to similar scales, around 1-1000 machines
Dask differs from Apache Spark in a few ways:
- Dask is more Python native, Spark is Scala/JVM native with Python bindings.
Python users may find Dask more comfortable,
but Dask is only useful for Python users,
while Spark can also be used from JVM languages.
- Dask is one component in the broader Python ecosystem alongside libraries
like Numpy, Pandas, and Scikit-Learn,
while Spark is an all-in-one system that re-invents much of the Python world
in a single package.
This means that it's often easier to compose Dask with new problem domains,
but also that you need to install multiple things (like Dask and Pandas or
Dask and Numpy) rather than just having everything in an all-in-one solution.
- Apache Spark focuses strongly on traditional business intelligence workloads,
like ETL, SQL queries, and then some lightweight machine learning,
while Dask is more general purpose.
This means that Dask is much more flexible and can handle other problem
domains like multi-dimensional arrays, GIS, advanced machine learning, and
custom systems, but that it is less focused and less tuned on typical SQL
style computations.
If you mostly want to focus on SQL queries then Spark is probably a better
bet. If you want to support a wide variety of custom workloads then Dask
might be more natural.
See the section :doc:`spark`.
Are there companies that we can get support from?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
There are several companies that offer support for dask in different capacities. See
`Paid support <https://docs.dask.org/en/latest/support.html#paid-support>`_ for a full list.
For IT
------
How would I set up Dask on institutional hardware?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
You already have cluster resources.
Dask can run on them today without significant change.
Most institutional clusters today have a resource manager.
This is typically managed by IT, with some mild permissions given to users to
launch jobs. Dask works with all major resource managers today, including
those on Hadoop, HPC, Kubernetes, and Cloud clusters.
1. **Hadoop/Spark**: If you have a Hadoop/Spark cluster, such as one purchased
through Cloudera/Hortonworks/MapR then you will likely want to deploy Dask
with YARN, the resource manager that deploys services like Hadoop, Spark,
Hive, and others.
To help with this, you'll likely want to use `Dask-Yarn <https://yarn.dask.org>`_.
2. **HPC**: If you have an HPC machine that runs resource managers like SGE,
SLURM, PBS, LSF, Torque, Condor, or other job batch queuing systems, then
users can launch Dask on these systems today using either:
- `Dask Jobqueue <https://jobqueue.dask.org>`_ , which uses typical
``qsub``, ``sbatch``, ``bsub`` or other submission tools in interactive
settings.
- `Dask MPI <https://mpi.dask.org>`_ which uses MPI for deployment in
batch settings
For more information see :doc:`deploying-hpc`
3. **Kubernetes/Cloud**: Newer clusters may employ Kubernetes for deployment.
This is particularly commonly used today on major cloud providers,
all of which provide hosted Kubernetes as a service. People today use Dask
on Kubernetes using either of the following:
- **Helm**: an easy way to stand up a long-running Dask cluster and
Jupyter notebook
- **Dask-Kubernetes**: for native Kubernetes integration for fast moving
or ephemeral deployments.
For more information see :doc:`deploying-kubernetes`
4. **Commercial Dask deployment:**
- You can use `Coiled <https://coiled.io?utm_source=dask-docs&utm_medium=faq>`_ to handle the creation and management of Dask clusters on cloud computing environments (AWS and GCP).
- `Domino Data Lab <https://www.dominodatalab.com/>`_ Lets users create Dask clusters in a hosted platform.
- `Saturn Cloud <https://saturncloud.io/>`_ Lets users create Dask clusters in a hosted platform or within their own AWS accounts.
Is Dask secure?
~~~~~~~~~~~~~~~
Dask is deployed today within highly secure institutions,
including major financial, healthcare, and government agencies.
That being said it's worth noting that, by its very nature, Dask enables the
execution of arbitrary user code on a large set of machines. Care should be
taken to isolate, authenticate, and govern access to these machines. Fortunately,
your institution likely already does this and uses standard technologies like
SSL/TLS, Kerberos, and other systems with which Dask can integrate.
Do I need to purchase a new cluster?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
No. It is easy to run Dask today on most clusters.
If you have a pre-existing HPC or Spark/Hadoop cluster then that will be fine
to start running Dask.
You can start using Dask without any capital expenditure.
How do I manage users?
~~~~~~~~~~~~~~~~~~~~~~
Dask doesn't manage users, you likely have existing systems that do this well.
In a large institutional setting we assume that you already have a resource
manager like Yarn (Hadoop), Kubernetes, or PBS/SLURM/SGE/LSF/..., each of which
have excellent user management capabilities, which are likely preferred by your
IT department anyway.
Dask is designed to operate with user-level permissions, which means that
your data science users should be able to ask those systems mentioned above for
resources, and have their processes tracked accordingly.
However, there are institutions where analyst-level users aren't given direct access to
the cluster. This is particularly common in Cloudera/Hortonworks Hadoop/Spark deployments.
In these cases some level of explicit indirection may be required. For this, we
recommend the `Dask Gateway project <https://gateway.dask.org>`_, which uses IT-level
permissions to properly route authenticated users into secure resources.
You may also want to consider a managed cluster solution (see :ref:`managed-cluster-solutions`).
How do I manage software environments?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
This depends on your cluster resource manager:
- Most HPC users use their network file system
- Hadoop/Spark/Yarn users package their environment into a tarball and ship it
around with HDFS (Dask-Yarn integrates with `Conda Pack
<https://conda.github.io/conda-pack/>`_ for this capability)
- Kubernetes or Cloud users use Docker images
In each case Dask integrates with existing processes and technologies
that are well understood and familiar to the institution.
How does Dask communicate data between machines?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Dask usually communicates over TCP, using msgpack for small administrative
messages, and its own protocol for efficiently passing around large data.
The scheduler and each worker host their own TCP server, making Dask a
distributed peer-to-peer network that uses point-to-point communication.
We do not use Spark-style shuffle systems. We do not use MPI-style
collectives. Everything is direct point-to-point.
For high performance networks you can use either TCP-over-Infiniband for about
1 GB/s bandwidth, or UCX (experimental) for full speed communication.
Are deployments long running, or ephemeral?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
We see both, but ephemeral deployments are more common.
Most Dask use today is about enabling data science or data engineering users to
scale their interactive workloads across the cluster.
These are typically either interactive sessions with Jupyter, or batch scripts
that run at a pre-defined time. In both cases, the user asks the resource
manager for a bunch of machines, does some work, and then gives up those
machines.
Some institutions also use Dask in an always-on fashion, either handling
real-time traffic in a scalable way, or responding to a broad set of
interactive users with large datasets that it keeps resident in memory.
For Users
---------
Will Dask "just work" on our existing code?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
No, you will need to make modifications,
but these modifications are usually small.
The vast majority of lines of business logic within your institution
will not have to change, assuming that they are in Python and use tooling like
Numpy, Pandas and Scikit-Learn.
How well does Dask scale? What are Dask's limitations?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The largest Dask deployments that we see today are on around 1000 multi-core
machines, perhaps 20,000 cores in total, but these are rare.
Most institutional-level problems (1-100 TB) are well solved by deployments of 10-50 nodes.
Technically, the back-of-the-envelope number to keep in mind is that each task
(an individual Python function call) in Dask has an overhead of around *200
microseconds*. So if these tasks take 1 second each, then Dask can saturate
around 5000 cores before scheduling overhead dominates costs. As workloads
reach this limit they are encouraged to use larger chunk sizes to compensate.
The *vast majority* of institutional users though do not reach this limit.
For more information you may want to peruse our :doc:`best practices
<best-practices>`
Is Dask resilient? What happens when a machine goes down?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Yes, Dask is resilient to the failure of worker nodes. It knows how it came to
any result, and can replay the necessary work on other machines if one goes
down.
If Dask's centralized scheduler goes down then you would need to resubmit the
computation. This is a fairly standard level of resiliency today, shared with
other tooling like Apache Spark, Flink, and others.
The resource managers that host Dask, like Yarn or Kubernetes, typically
provide long-term 24/7 resilience for always-on operation.
Is the API exactly the same as NumPy/Pandas/Scikit-Learn?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
No, but it's very close. That being said your data scientists will still
have to learn some things.
What we find is that the Numpy/Pandas/Scikit-Learn APIs aren't the challenge
when institutions adopt Dask. When API inconsistencies do exist, even
modestly skilled programmers are able to understand why and work around them
without much pain.
Instead, the challenge is building intuition around parallel performance.
We've all built up a mental model for what is fast and slow on a single
machine. This model changes when we factor in network communication and
parallel algorithms, and the performance that we get for familiar operations
can be surprising.
Our main solution to build this intuition, other than
accumulated experience, is Dask's :doc:`Diagnostic Dashboard
<dashboard>`.
The dashboard delivers a ton of visual feedback to users as they are running
their computation to help them understand what is going on. This both helps
them to identify and resolve immediate bottlenecks, and also builds up that
parallel performance intuition surprisingly quickly.
How much performance tuning does Dask require?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
*Some other systems are notoriously hard to tune for optimal performance.
What is Dask's story here? How many knobs are there that we need to be aware
of?*
Like the rest of the Python software tools, Dask puts a lot of effort into
having sane defaults. Dask workers automatically detect available memory and
cores, and choose sensible defaults that are decent in most situations. Dask
algorithms similarly provide decent choices by default, and informative warnings
when tricky situations arise, so that, in common cases, things should be fine.
The most common knobs to tune include the following:
- The thread/process mixture to deal with GIL-holding computations (which are
rare in Numpy/Pandas/Scikit-Learn workflows)
- Partition size, like if should you have 100 MB chunks or 1 GB chunks
That being said, almost no institution's needs are met entirely by the common
case, and given the variety of problems that people throw at Dask,
exceptional problems are commonplace.
In these cases we recommend watching the dashboard during execution to see what
is going on. It can commonly inform you what's going wrong, so that you can
make changes to your system.
What Data formats does Dask support?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Because Dask builds on NumPy and Pandas, it supports most formats that they
support, which is most formats.
That being said, not all formats are well suited for
parallel access. In general people using the following formats are usually
pretty happy:
- **Tabular:** Parquet, ORC, CSV, Line Delimited JSON, Avro, text
- **Arrays:** HDF5, NetCDF, Zarr, GRIB
More generally, if you have a Python function that turns a chunk of your stored
data into a Pandas dataframe or Numpy array then Dask can probably call that
function many times without much effort.
For groups looking for advice on which formats to use, we recommend Parquet
for tables and Zarr or HDF5 for arrays.
Does Dask have a SQL interface?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Dask supports various ways to communicate with SQL databases, some
requiring extra packages to be installed; see the section
:doc:`dataframe-sql`.
Does Dask work on GPUs?
~~~~~~~~~~~~~~~~~~~~~~~
Yes! Dask works with GPUs in a few ways.
The `RAPIDS <https://rapids.ai>`_ libraries provide a GPU-accelerated
Pandas-like library,
`cuDF <https://github.com/rapidsai/cudf>`_,
which interoperates well and is tested against Dask DataFrame.
`Chainer's CuPy <https://cupy.chainer.org/>`_ library provides a GPU
accelerated NumPy-like library that interoperates nicely with Dask Array.
For custom workflows people use Dask alongside GPU-accelerated libraries like PyTorch and
TensorFlow to manage workloads across several machines. They typically use
Dask's custom APIs, notably :doc:`Delayed <delayed>` and :doc:`Futures
<futures>`.
See the section :doc:`gpu`.
For Marketing
-------------
There is a special subsite dedicated to addressing marketing concerns. You can
find it at `dask.org/brand-guide <https://dask.org/brand-guide>`_.
Where can I find logos?
~~~~~~~~~~~~~~~~~~~~~~~
Yes! You can find them at :doc:`logos`.
|