1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131
|
Cloud
=====
There are a variety of ways to deploy Dask on the cloud.
Cloud providers offer managed services,
like VMs, Kubernetes, Yarn, or custom APIs with which Dask can connect easily.
.. grid:: 1 1 2 2
.. grid-item::
:columns: 7
Some common deployment options you may want to consider are:
- A commercial Dask deployment option like `Coiled <https://coiled.io?utm_source=dask-docs&utm_medium=deploying-cloud>`_
to handle the creation and management of Dask clusters on AWS, GCP, and Azure.
- A managed Kubernetes service and Dask's
:doc:`Kubernetes integration <deploying-kubernetes>`.
- Directly launching cloud resources such as VMs or containers via a cluster manager with
`Dask Cloud Provider <https://cloudprovider.dask.org/en/latest/>`_.
- A managed Yarn service,
like `Amazon EMR <https://aws.amazon.com/emr/>`_
or `Google Cloud DataProc <https://cloud.google.com/dataproc/>`_
and `Dask-Yarn <https://yarn.dask.org>`_
(specific documentation for the popular Amazon EMR service can be found
`here <https://yarn.dask.org/en/latest/aws-emr.html>`_.)
.. grid-item::
:columns: 5
:child-align: center
.. figure:: images/cloud-provider-logos.svg
Cloud Deployment Examples
-------------------------
.. tab-set::
.. tab-item:: Coiled
`Coiled <https://coiled.io?utm_source=dask-docs&utm_medium=deploying-cloud>`_
deploys managed Dask clusters on AWS, GCP, and Azure. It's free for most users and
has several features that address common
:doc:`deployment pain points <deployment-considerations>` like:
- Easy to use API
- Automatic software synchronization
- Easy access to any cloud hardware (like GPUs) in any region
- Robust logging, cost controls, and metrics collection
.. code-block:: python
>>> import coiled
>>> cluster = coiled.Cluster(
... n_workers=100, # Size of cluster
... region="us-west-2", # Same region as data
... vm_type="m6i.xlarge", # Hardware of your choosing
... )
>>> client = cluster.get_client()
Coiled is recommended for deploying Dask on the cloud.
Though there are non-commercial, open source options like
Dask Cloud Provider, Dask-Gateway, and Dask-Yarn that are also available
(see :ref:`cloud deployment options <cloud-deployment-options>`
for additional options.)
.. tab-item:: Dask Cloud Provider
Using `Dask Cloud Provider <https://cloudprovider.dask.org/en/latest/>`_ to launch a cluster of
VMs on a platform like `DigitalOcean <https://www.digitalocean.com/>`_ can be as convenient as
launching a local cluster.
.. code-block:: python
>>> import dask.config
>>> dask.config.set({"cloudprovider.digitalocean.token": "yourAPItoken"})
>>> from dask_cloudprovider.digitalocean import DropletCluster
>>> cluster = DropletCluster(n_workers=1)
Creating scheduler instance
Created droplet dask-38b817c1-scheduler
Waiting for scheduler to run
Scheduler is running
Creating worker instance
Created droplet dask-38b817c1-worker-dc95260d
Many of the cluster managers in Dask Cloud Provider work by launching VMs with a startup script
that pulls down the :doc:`Dask Docker image <deploying-docker>` and runs Dask components within that container.
As with all cluster managers the VM resources, Docker image, etc are all configurable.
You can then connect a client and work with the cluster as if it were on your local machine.
.. code-block:: python
>>> client = cluster.get_client()
Data Access
-----------
In addition to deploying Dask clusters on the cloud, most cloud users will also want
to access cloud-hosted data on their respective cloud provider.
We recommend installing additional libraries (listed below) for easy data access on your cloud provider.
See :doc:`how-to/connect-to-remote-data` for more information.
.. tab-set::
.. tab-item:: AWS
Use `s3fs <https://s3fs.readthedocs.io/>`_ for accessing data on Amazon's S3.
.. code-block:: bash
pip install s3fs
.. tab-item:: GCP
Use `gcsfs <https://gcsfs.readthedocs.io/>`_ for accessing data on Google's GCS.
.. code-block:: bash
pip install gcsfs
.. tab-item:: Azure
Use `adlfs <https://github.com/dask/adlfs/>`_ for accessing data on Microsoft's Data Lake or Blob Storage.
.. code-block:: bash
pip install adlfs
|