File: spec.rst

package info (click to toggle)
dask 2024.12.1%2Bdfsg-2
  • links: PTS, VCS
  • area: main
  • in suites: forky, sid, trixie
  • size: 20,024 kB
  • sloc: python: 105,182; javascript: 1,917; makefile: 159; sh: 88
file content (153 lines) | stat: -rw-r--r-- 3,967 bytes parent folder | download
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
Specification
=============

Dask is a specification to encode a graph -- specifically, a directed 
acyclic graph of tasks with data dependencies -- using ordinary Python data 
structures, namely dicts, tuples, functions, and arbitrary Python
values. 


Definitions
-----------

A **Dask graph** is a dictionary mapping **keys** to **computations**:

.. code-block:: python

   {'x': 1,
    'y': 2,
    'z': (add, 'x', 'y'),
    'w': (sum, ['x', 'y', 'z']),
    'v': [(sum, ['w', 'z']), 2]}

A **key** is a str, bytes, int, float, or tuple thereof:

.. code-block:: python

   'x'
   ('x', 2, 3)

A **task** is a tuple with a callable first element.  Tasks represent atomic
units of work meant to be run by a single worker.  Example: 

.. code-block:: python

   (add, 'x', 'y')

We represent a task as a tuple such that the *first element is a callable
function* (like ``add``), and the succeeding elements are *arguments* for that
function. An *argument* may be any valid **computation**.

A **computation** may be one of the following:

1.  Any **key** present in the Dask graph like ``'x'``
2.  Any other value like ``1``, to be interpreted literally
3.  A **task** like ``(inc, 'x')`` (see below)
4.  A list of **computations**, like ``[1, 'x', (inc, 'x')]``

So all of the following are valid **computations**:

.. code-block:: python

   np.array([...])
   (add, 1, 2)
   (add, 'x', 2)
   (add, (inc, 'x'), 2)
   (sum, [1, 2])
   (sum, ['x', (inc, 'x')])
   (np.dot, np.array([...]), np.array([...]))
   [(sum, ['x', 'y']), 'z']

To encode keyword arguments, we recommend the use of ``functools.partial`` or
``toolz.curry``.


What functions should expect
----------------------------

In cases like ``(add, 'x', 'y')``, functions like ``add`` receive concrete
values instead of keys.  A Dask scheduler replaces keys (like ``'x'`` and ``'y'``) with
their computed values (like ``1``, and ``2``) *before* calling the ``add`` function.


Entry Point - The ``get`` function
----------------------------------

The ``get`` function serves as entry point to computation for all
:doc:`schedulers <scheduler-overview>`.  This function gets the value
associated to the given key.  That key may refer to stored data, as is the case
with ``'x'``, or to a task, as is the case with ``'z'``.  In the latter case,
``get`` should perform all necessary computation to retrieve the computed
value.

.. _scheduler: scheduler-overview.rst

.. code-block:: python

   >>> from dask.threaded import get

   >>> from operator import add

   >>> dsk = {'x': 1,
   ...        'y': 2,
   ...        'z': (add, 'x', 'y'),
   ...        'w': (sum, ['x', 'y', 'z'])}

.. code-block:: python

   >>> get(dsk, 'x')
   1

   >>> get(dsk, 'z')
   3

   >>> get(dsk, 'w')
   6

Additionally, if given a ``list``, get should simultaneously acquire values for
multiple keys:

.. code-block:: python

   >>> get(dsk, ['x', 'y', 'z'])
   [1, 2, 3]

Because we accept lists of keys as keys, we support nested lists:

.. code-block:: python

   >>> get(dsk, [['x', 'y'], ['z', 'w']])
   [[1, 2], [3, 6]]

Internally ``get`` can be arbitrarily complex, calling out to distributed
computing, using caches, and so on.


Why use tuples
--------------

With ``(add, 'x', 'y')``, we wish to encode the result of calling ``add`` on
the values corresponding to the keys ``'x'`` and ``'y'``.

We intend the following meaning:

.. code-block:: python

   add('x', 'y')  # after x and y have been replaced

But this will err because Python executes the function immediately
before we know values for ``'x'`` and ``'y'``.

We delay the execution by moving the opening parenthesis one term to the left,
creating a tuple:

.. code::

    Before: add( 'x', 'y')
    After: (add, 'x', 'y')

This lets us store the desired computation as data that we can analyze using
other Python code, rather than cause immediate execution.

LISP users will identify this as an s-expression, or as a rudimentary form of
quoting.