File: develop.rst

package info (click to toggle)
dask 1.0.0%2Bdfsg-2
  • links: PTS, VCS
  • area: main
  • in suites: buster
  • size: 6,856 kB
  • sloc: python: 51,266; sh: 178; makefile: 142
file content (279 lines) | stat: -rw-r--r-- 8,574 bytes parent folder | download
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
Development Guidelines
======================

Dask is a community maintained project.  We welcome contributions in the form
of bug reports, documentation, code, design proposals, and more.
This page provides resources on how best to contribute.

Where to ask for help
---------------------

Dask conversation happens in the following places:

1.  `StackOverflow #dask tag`_: for usage questions
2.  `Github Issue Tracker`_: for discussions around new features or established bugs
3.  `Gitter chat`_: for real-time discussion

For usage questions and bug reports we strongly prefer the use of StackOverflow
and Github issues over gitter chat.  Github and StackOverflow are more easily
searchable by future users and so is more efficient for everyone's time.
Gitter chat is generally reserved for community discussion.

.. _`StackOverflow #dask tag`: http://stackoverflow.com/questions/tagged/dask
.. _`Github Issue Tracker`: https://github.com/dask/dask/issues/
.. _`Gitter chat`: https://gitter.im/dask/dask


Separate Code Repositories
--------------------------

Dask maintains code and documentation in a few git repositories hosted on the
Github ``dask`` organization, http://github.com/dask.  This includes the primary
repository and several other repositories for different components.  A
non-exhaustive list follows:

*  http://github.com/dask/dask: The main code repository holding parallel
   algorithms, the single-machine scheduler, and most documentation.
*  http://github.com/dask/distributed: The distributed memory scheduler
*  http://github.com/dask/dask-ml: Machine learning algorithms
*  http://github.com/dask/s3fs: S3 Filesystem interface
*  http://github.com/dask/gcsfs: GCS Filesystem interface
*  http://github.com/dask/hdfs3: Hadoop Filesystem interface
*  ...

Git and Github can be challenging at first.  Fortunately good materials exist
on the internet.  Rather than repeat these materials here we refer you to
Pandas' documentation and links on this subject at
http://pandas.pydata.org/pandas-docs/stable/contributing.html


Issues
------

The community discusses and tracks known bugs and potential features in the
`Github Issue Tracker`_.  If you have a new idea or have identified a bug then
you should raise it there to start public discussion.

If you are looking for an introductory issue to get started with development
then check out the `introductory label`_, which contains issues that are good
for starting developers.  Generally familiarity with Python, NumPy, Pandas, and
some parallel computing are assumed.

.. _`introductory label`: https://github.com/dask/dask/issues?q=is%3Aissue+is%3Aopen+label%3Aintroductory


Development Environment
-----------------------

Download code
~~~~~~~~~~~~~

Clone the main dask git repository (or whatever repository you're working on.)::

   git clone git@github.com:dask/dask.git


Install
~~~~~~~

You may want to install larger dependencies like NumPy and Pandas using a
binary package manager, like conda_.  You can skip this step if you already
have these libraries, don't care to use them, or have sufficient build
environment on your computer to compile them when installing with ``pip``::

   conda install -y numpy pandas scipy bokeh

.. _conda: http://conda.pydata.org/docs/

Install dask and dependencies::

   cd dask
   pip install -e .[complete]

For development dask uses the following additional dependencies::

   pip install pytest moto mock


Run Tests
~~~~~~~~~

Dask uses py.test_ for testing.  You can run tests from the main dask directory
as follows::

   py.test dask --verbose

.. _py.test: http://pytest.org/latest/


Contributing to Code
--------------------

Dask maintains development standards that are similar to most PyData projects.  These standards include
language support, testing, documentation, and style.

Python Versions
~~~~~~~~~~~~~~~

Dask supports Python versions 2.7, 3.4, 3.5, and 3.6 in a single codebase.
Name changes are handled by the :file:`dask/compatibility.py` file.

Test
~~~~

Dask employs extensive unit tests to ensure correctness of code both for today
and for the future.  Test coverage is expected for all code contributions.

Tests are written in a py.test style with bare functions.

.. code-block:: python

   def test_fibonacci():
       assert fib(0) == 0
       assert fib(1) == 0
       assert fib(10) == 55
       assert fib(8) == fib(7) + fib(6)

       for x in [-3, 'cat', 1.5]:
           with pytest.raises(ValueError):
               fib(x)

These tests should compromise well between covering all branches and fail cases
and running quickly (slow test suites get run less often.)

You can run tests locally by running ``py.test`` in the local dask directory::

   py.test dask --verbose

You can also test certain modules or individual tests for faster response::

   py.test dask/dataframe --verbose

   py.test dask/dataframe/tests/test_dataframe_core.py::test_set_index

Tests run automatically on the Travis.ci and Appveyor continuous testing
frameworks on every push to every pull request on GitHub.

Tests are organized within the various modules' subdirectories::

    dask/array/tests/test_*.py
    dask/bag/tests/test_*.py
    dask/dataframe/tests/test_*.py
    dask/diagnostics/tests/test_*.py

For the Dask collections like dask.array and dask.dataframe behavior is
typically tested directly against the Numpy or Pandas libraries using the
``assert_eq`` functions:

.. code-block:: python

   import numpy as np
   import dask.array as da
   from dask.array.utils import assert_eq

   def test_aggregations():
       nx = np.random.random(100)
       dx = da.from_array(x, chunks=(10,))

       assert_eq(nx.sum(), dx.sum())
       assert_eq(nx.min(), dx.min())
       assert_eq(nx.max(), dx.max())
       ...

This technique helps to ensure compatibility with upstream libraries, and tends
to be simpler than testing correctness directly.  Additionally, by passing Dask
collections directly to the ``assert_eq`` function rather than call compute
manually the testing suite is able to run a number of checks on the lazy
collections themselves.


Docstrings
~~~~~~~~~~

User facing functions should roughly follow the numpydoc_ standard, including
sections for ``Parameters``, ``Examples`` and general explanatory prose.

By default examples will be doc-tested.  Reproducible examples in documentation
is valuable both for testing and, more importantly, for communication of common
usage to the user.  Documentation trumps testing in this case and clear
examples should take precedence over using the docstring as testing space.
To skip a test in the examples add the comment ``# doctest: +SKIP`` directly
after the line.

.. code-block:: python

   def fib(i):
       """ A single line with a brief explanation

       A more thorough description of the function, consisting of multiple
       lines or paragraphs.

       Parameters
       ----------
       i: int
            A short description of the argument if not immediately clear

       Examples
       --------
       >>> fib(4)
       3
       >>> fib(5)
       5
       >>> fib(6)
       8
       >>> fib(-1)  # Robust to bad inputs
       ValueError(...)
       """

.. _numpydoc: https://github.com/numpy/numpy/blob/master/doc/HOWTO_DOCUMENT.rst.txt

Docstrings are currently tested under Python 3.6 on travis.ci.  You can test
docstrings with pytest as follows::

   py.test dask --doctest-modules

Docstring testing requires graphviz to be installed. This can be done via::

   conda install -y graphviz


Style
~~~~~

Dask verifies style uniformity with the ``flake8`` tool.::

   pip install flake8
   flake8 dask


Changelog
~~~~~~~~~

Every significative code contribution should be listed in the
:doc:`changelog` under the corresponding version. When submitting a Pull
Request in Github please add to that file explaining what was added/modified.


Contributing to Documentation
-----------------------------

Dask uses Sphinx_ for documentation, hosted on http://readthedocs.org .
Documentation is maintained in the RestructuredText markup language (``.rst``
files) in ``dask/docs/source``.  The documentation consists both of prose
and API documentation.

To build the documentation locally, first install requirements::

   cd docs/
   pip install -r requirements-docs.txt

Then build documentation with ``make``::

   make html

The resulting HTML files end up in the ``build/html`` directory.

You can now make edits to rst files and run ``make html`` again to update
the affected pages.

.. _Sphinx: http://www.sphinx-doc.org/