File: deploying-cloud.rst

package info (click to toggle)
dask 2024.12.1%2Bdfsg-2
  • links: PTS, VCS
  • area: main
  • in suites: forky, sid, trixie
  • size: 20,024 kB
  • sloc: python: 105,182; javascript: 1,917; makefile: 159; sh: 88
file content (131 lines) | stat: -rw-r--r-- 4,797 bytes parent folder | download
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
Cloud
=====

There are a variety of ways to deploy Dask on the cloud.
Cloud providers offer managed services,
like VMs, Kubernetes, Yarn, or custom APIs with which Dask can connect easily.

.. grid:: 1 1 2 2

   .. grid-item::
      :columns: 7

      Some common deployment options you may want to consider are:

      -   A commercial Dask deployment option like `Coiled <https://coiled.io?utm_source=dask-docs&utm_medium=deploying-cloud>`_
          to handle the creation and management of Dask clusters on AWS, GCP, and Azure.
      -   A managed Kubernetes service and Dask's
          :doc:`Kubernetes integration <deploying-kubernetes>`.
      -   Directly launching cloud resources such as VMs or containers via a cluster manager with
          `Dask Cloud Provider <https://cloudprovider.dask.org/en/latest/>`_.
      -   A managed Yarn service,
          like `Amazon EMR <https://aws.amazon.com/emr/>`_
          or `Google Cloud DataProc <https://cloud.google.com/dataproc/>`_
          and `Dask-Yarn <https://yarn.dask.org>`_
          (specific documentation for the popular Amazon EMR service can be found
          `here <https://yarn.dask.org/en/latest/aws-emr.html>`_.)


   .. grid-item::
      :columns: 5
      :child-align: center

      .. figure:: images/cloud-provider-logos.svg

Cloud Deployment Examples
-------------------------

.. tab-set::

   .. tab-item:: Coiled

        `Coiled <https://coiled.io?utm_source=dask-docs&utm_medium=deploying-cloud>`_
        deploys managed Dask clusters on AWS, GCP, and Azure. It's free for most users and
        has several features that address common
        :doc:`deployment pain points <deployment-considerations>` like:

        - Easy to use API
        - Automatic software synchronization
        - Easy access to any cloud hardware (like GPUs) in any region
        - Robust logging, cost controls, and metrics collection

        .. code-block:: python

            >>> import coiled
            >>> cluster = coiled.Cluster(
            ...     n_workers=100,             # Size of cluster
            ...     region="us-west-2",        # Same region as data
            ...     vm_type="m6i.xlarge",      # Hardware of your choosing
            ... )
            >>> client = cluster.get_client()
        
        Coiled is recommended for deploying Dask on the cloud.
        Though there are non-commercial, open source options like
        Dask Cloud Provider, Dask-Gateway, and Dask-Yarn that are also available
        (see :ref:`cloud deployment options <cloud-deployment-options>`
        for additional options.)


   .. tab-item:: Dask Cloud Provider

        Using `Dask Cloud Provider <https://cloudprovider.dask.org/en/latest/>`_ to launch a cluster of
        VMs on a platform like `DigitalOcean <https://www.digitalocean.com/>`_ can be as convenient as
        launching a local cluster.

        .. code-block:: python

            >>> import dask.config
            >>> dask.config.set({"cloudprovider.digitalocean.token": "yourAPItoken"})
            >>> from dask_cloudprovider.digitalocean import DropletCluster
            >>> cluster = DropletCluster(n_workers=1)
            Creating scheduler instance
            Created droplet dask-38b817c1-scheduler
            Waiting for scheduler to run
            Scheduler is running
            Creating worker instance
            Created droplet dask-38b817c1-worker-dc95260d

        Many of the cluster managers in Dask Cloud Provider work by launching VMs with a startup script
        that pulls down the :doc:`Dask Docker image <deploying-docker>` and runs Dask components within that container.
        As with all cluster managers the VM resources, Docker image, etc are all configurable.

        You can then connect a client and work with the cluster as if it were on your local machine.

        .. code-block:: python

            >>> client = cluster.get_client()

Data Access
-----------

In addition to deploying Dask clusters on the cloud, most cloud users will also want
to access cloud-hosted data on their respective cloud provider.

We recommend installing additional libraries (listed below) for easy data access on your cloud provider.
See :doc:`how-to/connect-to-remote-data` for more information.

.. tab-set::

   .. tab-item:: AWS

    Use `s3fs <https://s3fs.readthedocs.io/>`_ for accessing data on Amazon's S3.

    .. code-block:: bash

        pip install s3fs

   .. tab-item:: GCP

    Use `gcsfs <https://gcsfs.readthedocs.io/>`_ for accessing data on Google's GCS.

    .. code-block:: bash

        pip install gcsfs

   .. tab-item:: Azure

    Use `adlfs <https://github.com/dask/adlfs/>`_ for accessing data on Microsoft's Data Lake or Blob Storage.

    .. code-block:: bash

        pip install adlfs