File: roadmap.rst

package info (click to toggle)
python-xarray 2025.08.0-1
  • links: PTS, VCS
  • area: main
  • in suites: sid
  • size: 11,796 kB
  • sloc: python: 115,416; makefile: 258; sh: 47
file content (288 lines) | stat: -rw-r--r-- 12,153 bytes parent folder | download | duplicates (2)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
.. _roadmap:

Development roadmap
===================

Authors: Xarray developers

Date: September 7, 2021

Xarray is an open source Python library for labeled multidimensional
arrays and datasets.

Our philosophy
--------------

Why has xarray been successful? In our opinion:

-  Xarray does a great job of solving **specific use-cases** for
   multidimensional data analysis:

   -  The dominant use-case for xarray is for analysis of gridded
      dataset in the geosciences, e.g., as part of the
      `Pangeo <https://pangeo.io>`__ project.
   -  Xarray is also used more broadly in the physical sciences, where
      we've found the needs for analyzing multidimensional datasets are
      remarkably consistent (e.g., see
      `SunPy <https://github.com/sunpy/ndcube>`__ and
      `PlasmaPy <https://github.com/PlasmaPy/PlasmaPy/issues/59>`__).
   -  Finally, xarray is used in a variety of other domains, including
      finance, `probabilistic
      programming <https://arviz-devs.github.io/arviz/>`__ and
      genomics.

-  Xarray is also a **domain agnostic** solution:

   -  We focus on providing a flexible set of functionality related
      labeled multidimensional arrays, rather than solving particular
      problems.
   -  This facilitates collaboration between users with different needs,
      and helps us attract a broad community of contributors.
   -  Importantly, this retains flexibility, for use cases that don't
      fit particularly well into existing frameworks.

-  Xarray **integrates well** with other libraries in the scientific
   Python stack.

   -  We leverage first-class external libraries for core features of
      xarray (e.g., NumPy for ndarrays, pandas for indexing, dask for
      parallel computing)
   -  We expose our internal abstractions to users (e.g.,
      ``apply_ufunc()``), which facilitates extending xarray in various
      ways.

Together, these features have made xarray a first-class choice for
labeled multidimensional arrays in Python.

We want to double-down on xarray's strengths by making it an even more
flexible and powerful tool for multidimensional data analysis. We want
to continue to engage xarray's core geoscience users, and to also reach
out to new domains to learn from other successful data models like those
of `yt <https://yt-project.org>`__ or the `OLAP
cube <https://en.wikipedia.org/wiki/OLAP_cube>`__.

Specific needs
--------------

The user community has voiced a number specific needs related to how
xarray interfaces with domain specific problems. Xarray may not solve
all of these issues directly, but these areas provide opportunities for
xarray to provide better, more extensible, interfaces. Some examples of
these common needs are:

-  Non-regular grids (e.g., staggered and unstructured meshes).
-  Physical units.
-  Lazily computed arrays (e.g., for coordinate systems).
-  New file-formats.

Technical vision
----------------

We think the right approach to extending xarray's user community and the
usefulness of the project is to focus on improving key interfaces that
can be used externally to meet domain-specific needs.

We can generalize the community's needs into three main categories:

-  More flexible grids/indexing.
-  More flexible arrays/computing.
-  More flexible storage backends.
-  More flexible data structures.

Each of these are detailed further in the subsections below.

Flexible indexes
~~~~~~~~~~~~~~~~

.. note::
   Work on flexible grids and indexes is currently underway. See
   `GH Project #1 <https://github.com/pydata/xarray/projects/1>`__ for more detail.

Xarray currently keeps track of indexes associated with coordinates by
storing them in the form of a ``pandas.Index`` in special
``xarray.IndexVariable`` objects.

The limitations of this model became clear with the addition of
``pandas.MultiIndex`` support in xarray 0.9, where a single index
corresponds to multiple xarray variables. MultiIndex support is highly
useful, but xarray now has numerous special cases to check for
MultiIndex levels.

A cleaner model would be to elevate ``indexes`` to an explicit part of
xarray's data model, e.g., as attributes on the ``Dataset`` and
``DataArray`` classes. Indexes would need to be propagated along with
coordinates in xarray operations, but will no longer would need to have
a one-to-one correspondence with coordinate variables. Instead, an index
should be able to refer to multiple (possibly multidimensional)
coordinates that define it. See :issue:`1603` for full details.

Specific tasks:

-  Add an ``indexes`` attribute to ``xarray.Dataset`` and
   ``xarray.Dataset``, as dictionaries that map from coordinate names to
   xarray index objects.
-  Use the new index interface to write wrappers for ``pandas.Index``,
   ``pandas.MultiIndex`` and ``scipy.spatial.KDTree``.
-  Expose the interface externally to allow third-party libraries to
   implement custom indexing routines, e.g., for geospatial look-ups on
   the surface of the Earth.

In addition to the new features it directly enables, this clean up will
allow xarray to more easily implement some long-awaited features that
build upon indexing, such as groupby operations with multiple variables.

Flexible arrays
~~~~~~~~~~~~~~~

.. note::
   Work on flexible arrays is currently underway. See
   `GH Project #2 <https://github.com/pydata/xarray/projects/2>`__ for more detail.

Xarray currently supports wrapping multidimensional arrays defined by
NumPy, dask and to a limited-extent pandas. It would be nice to have
interfaces that allow xarray to wrap alternative N-D array
implementations, e.g.:

-  Arrays holding physical units.
-  Lazily computed arrays.
-  Other ndarray objects, e.g., sparse, xnd, xtensor.

Our strategy has been to pursue upstream improvements in NumPy (see
`NEP-22 <https://numpy.org/neps/nep-0022-ndarray-duck-typing-overview.html>`__)
for supporting a complete duck-typing interface using with NumPy's
higher level array API. Improvements in NumPy's support for custom data
types would also be highly useful for xarray users.

By pursuing these improvements in NumPy we hope to extend the benefits
to the full scientific Python community, and avoid tight coupling
between xarray and specific third-party libraries (e.g., for
implementing units). This will allow xarray to maintain its domain
agnostic strengths.

We expect that we may eventually add some minimal interfaces in xarray
for features that we delegate to external array libraries (e.g., for
getting units and changing units). If we do add these features, we
expect them to be thin wrappers, with core functionality implemented by
third-party libraries.

Flexible storage
~~~~~~~~~~~~~~~~

.. note::
   Work on flexible storage backends is currently underway. See
   `GH Project #3 <https://github.com/pydata/xarray/projects/3>`__ for more detail.

The xarray backends module has grown in size and complexity. Much of
this growth has been "organic" and mostly to support incremental
additions to the supported backends. This has left us with a fragile
internal API that is difficult for even experienced xarray developers to
use. Moreover, the lack of a public facing API for building xarray
backends means that users can not easily build backend interface for
xarray in third-party libraries.

The idea of refactoring the backends API and exposing it to users was
originally proposed in :issue:`1970`. The idea would be to develop a
well tested and generic backend base class and associated utilities
for external use. Specific tasks for this development would include:

-  Exposing an abstract backend for writing new storage systems.
-  Exposing utilities for features like automatic closing of files,
   LRU-caching and explicit/lazy indexing.
-  Possibly moving some infrequently used backends to third-party
   packages.

Flexible data structures
~~~~~~~~~~~~~~~~~~~~~~~~

Xarray provides two primary data structures, the ``xarray.DataArray`` and
the ``xarray.Dataset``. This section describes two possible data model
extensions.

Tree-like data structure
++++++++++++++++++++++++

.. note::

   After some time, the community DataTree project has now been updated and
   merged into xarray exposing :py:class:`xarray.DataTree`. This is just
   released and a bit experimental, but please try it out and let us know what
   you think. Take a look at our :ref:`quick-overview-datatrees` quickstart.

Xarray’s highest-level object was previously an ``xarray.Dataset``, whose data
model echoes that of a single netCDF group. However real-world datasets are
often better represented by a collection of related Datasets. Particular common
examples include:

-  Multi-resolution datasets,
-  Collections of time series datasets with differing lengths,
-  Heterogeneous datasets comprising multiple different types of related
   observational or simulation data,
-  Bayesian workflows involving various statistical distributions over multiple
   variables,
-  Whole netCDF files containing multiple groups.
-  Comparison of output from many similar models (such as in the IPCC's Coupled Model Intercomparison Projects)

A new tree-like data structure, ``xarray.DataTree``, which is essentially a
structured hierarchical collection of Datasets, represents these cases and
instead maps to multiple netCDF groups (see :issue:`4118`).

Currently there are several libraries which have wrapped xarray in order to
build domain-specific data structures (e.g. `xarray-multiscale
<https://github.com/JaneliaSciComp/xarray-multiscale>`__.), but the general
``xarray.DataTree`` object obviates the need for these and consolidates effort
in a single domain-agnostic tool, much as xarray has already achieved.


Labeled array without coordinates
+++++++++++++++++++++++++++++++++

There is a need for a lightweight array structure with named dimensions for
convenient indexing and broadcasting. Xarray includes such a structure internally
(``xarray.Variable``). We want to factor out xarray's “Variable”  object into a
standalone package with minimal dependencies for integration with libraries that
don't want to inherit xarray's dependency on pandas (e.g. scikit-learn).
The new “Variable” class will follow established array protocols and the new
data-apis standard. It will be capable of wrapping multiple array-like objects
(e.g. NumPy, Dask, Sparse, Pint, CuPy, Pytorch). While “DataArray” fits some of
these requirements, it offers a more complex data model than is desired for
many applications and depends on pandas.

Engaging more users
-------------------

.. note::
   Work on improving xarray’s documentation and user engagement is
   currently underway. See `GH Project #4 <https://github.com/pydata/xarray/projects/4>`__
   for more detail.

Like many open-source projects, the documentation of xarray has grown
together with the library's features. While we think that the xarray
documentation is comprehensive already, we acknowledge that the adoption
of xarray might be slowed down because of the substantial time
investment required to learn its working principles. In particular,
non-computer scientists or users less familiar with the pydata ecosystem
might find it difficult to learn xarray and realize how xarray can help
them in their daily work.

In order to lower this adoption barrier, we propose to:

-  Develop entry-level tutorials for users with different backgrounds. For
   example, we would like to develop tutorials for users with or without
   previous knowledge of pandas, NumPy, netCDF, etc. These tutorials may be
   built as part of xarray's documentation or included in a separate repository
   to enable interactive use (e.g. mybinder.org).
-  Document typical user workflows in a dedicated website, following the example
   of `dask-stories
   <https://matthewrocklin.com/blog/work/2018/07/16/dask-stories>`__.
-  Write a basic glossary that defines terms that might not be familiar to all
   (e.g. "lazy", "labeled", "serialization", "indexing", "backend").


Administrative
--------------

NumFOCUS
~~~~~~~~

On July 16, 2018, Joe and Stephan submitted xarray's fiscal sponsorship
application to NumFOCUS.