File: quick-overview.rst

package info (click to toggle)
python-xarray 2025.08.0-1
  • links: PTS, VCS
  • area: main
  • in suites: sid
  • size: 11,796 kB
  • sloc: python: 115,416; makefile: 258; sh: 47
file content (341 lines) | stat: -rw-r--r-- 12,473 bytes parent folder | download
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
##############
Quick overview
##############

Here are some quick examples of what you can do with :py:class:`xarray.DataArray`
objects. Everything is explained in much more detail in the rest of the
documentation.

To begin, import numpy, pandas and xarray using their customary abbreviations:

.. jupyter-execute::

    import numpy as np
    import pandas as pd
    import xarray as xr

Create a DataArray
------------------

You can make a DataArray from scratch by supplying data in the form of a numpy
array or list, with optional *dimensions* and *coordinates*:

.. jupyter-execute::

    data = xr.DataArray(np.random.randn(2, 3), dims=("x", "y"), coords={"x": [10, 20]})
    data

In this case, we have generated a 2D array, assigned the names *x* and *y* to the two dimensions respectively and associated two *coordinate labels* '10' and '20' with the two locations along the x dimension. If you supply a pandas :py:class:`~pandas.Series` or :py:class:`~pandas.DataFrame`, metadata is copied directly:

.. jupyter-execute::

    xr.DataArray(pd.Series(range(3), index=list("abc"), name="foo"))

Here are the key properties for a ``DataArray``:

.. jupyter-execute::

    # like in pandas, values is a numpy array that you can modify in-place
    data.values
    data.dims
    data.coords
    # you can use this dictionary to store arbitrary metadata
    data.attrs


Indexing
--------

Xarray supports four kinds of indexing. Since we have assigned coordinate labels to the x dimension we can use label-based indexing along that dimension just like pandas. The four examples below all yield the same result (the value at ``x=10``) but at varying levels of convenience and intuitiveness.

.. jupyter-execute::

    # positional and by integer label, like numpy
    data[0, :]

    # loc or "location": positional and coordinate label, like pandas
    data.loc[10]

    # isel or "integer select":  by dimension name and integer label
    data.isel(x=0)

    # sel or "select": by dimension name and coordinate label
    data.sel(x=10)


Unlike positional indexing, label-based indexing frees us from having to know how our array is organized. All we need to know are the dimension name and the label we wish to index i.e. ``data.sel(x=10)`` works regardless of whether ``x`` is the first or second dimension of the array and regardless of whether ``10`` is the first or second element of ``x``. We have already told xarray that x is the first dimension when we created ``data``: xarray keeps track of this so we don't have to. For more, see :ref:`indexing`.


Attributes
----------

While you're setting up your DataArray, it's often a good idea to set metadata attributes. A useful choice is to set ``data.attrs['long_name']`` and ``data.attrs['units']`` since xarray will use these, if present, to automatically label your plots. These special names were chosen following the `NetCDF Climate and Forecast (CF) Metadata Conventions <https://cfconventions.org/cf-conventions/cf-conventions.html>`_. ``attrs`` is just a Python dictionary, so you can assign anything you wish.

.. jupyter-execute::

    data.attrs["long_name"] = "random velocity"
    data.attrs["units"] = "metres/sec"
    data.attrs["description"] = "A random variable created as an example."
    data.attrs["random_attribute"] = 123
    data.attrs
    # you can add metadata to coordinates too
    data.x.attrs["units"] = "x units"


Computation
-----------

Data arrays work very similarly to numpy ndarrays:

.. jupyter-execute::

    data + 10
    np.sin(data)
    # transpose
    data.T
    data.sum()

However, aggregation operations can use dimension names instead of axis
numbers:

.. jupyter-execute::

    data.mean(dim="x")

Arithmetic operations broadcast based on dimension name. This means you don't
need to insert dummy dimensions for alignment:

.. jupyter-execute::

    a = xr.DataArray(np.random.randn(3), [data.coords["y"]])
    b = xr.DataArray(np.random.randn(4), dims="z")

    a
    b

    a + b

It also means that in most cases you do not need to worry about the order of
dimensions:

.. jupyter-execute::

    data - data.T

Operations also align based on index labels:

.. jupyter-execute::

    data[:-1] - data[:1]

For more, see :ref:`compute`.

GroupBy
-------

Xarray supports grouped operations using a very similar API to pandas (see :ref:`groupby`):

.. jupyter-execute::

    labels = xr.DataArray(["E", "F", "E"], [data.coords["y"]], name="labels")
    labels
    data.groupby(labels).mean("y")
    data.groupby(labels).map(lambda x: x - x.min())

Plotting
--------

Visualizing your datasets is quick and convenient:

.. jupyter-execute::

    data.plot()

Note the automatic labeling with names and units. Our effort in adding metadata attributes has paid off! Many aspects of these figures are customizable: see :ref:`plotting`.

pandas
------

Xarray objects can be easily converted to and from pandas objects using the :py:meth:`~xarray.DataArray.to_series`, :py:meth:`~xarray.DataArray.to_dataframe` and :py:meth:`~pandas.DataFrame.to_xarray` methods:

.. jupyter-execute::

    series = data.to_series()
    series

    # convert back
    series.to_xarray()

Datasets
--------

:py:class:`xarray.Dataset` is a dict-like container of aligned ``DataArray``
objects. You can think of it as a multi-dimensional generalization of the
:py:class:`pandas.DataFrame`:

.. jupyter-execute::

    ds = xr.Dataset(dict(foo=data, bar=("x", [1, 2]), baz=np.pi))
    ds


This creates a dataset with three DataArrays named ``foo``, ``bar`` and ``baz``. Use dictionary or dot indexing to pull out ``Dataset`` variables as ``DataArray`` objects but note that assignment only works with dictionary indexing:

.. jupyter-execute::

    ds["foo"]
    ds.foo


When creating ``ds``, we specified that ``foo`` is identical to ``data`` created earlier, ``bar`` is one-dimensional with single dimension ``x`` and associated values '1' and '2', and ``baz`` is a scalar not associated with any dimension in ``ds``. Variables in datasets can have different ``dtype`` and even different dimensions, but all dimensions are assumed to refer to points in the same shared coordinate system i.e. if two variables have dimension ``x``, that dimension must be identical in both variables.

For example, when creating ``ds`` xarray automatically *aligns* ``bar`` with ``DataArray`` ``foo``, i.e., they share the same coordinate system so that ``ds.bar['x'] == ds.foo['x'] == ds['x']``. Consequently, the following works without explicitly specifying the coordinate ``x`` when creating ``ds['bar']``:

.. jupyter-execute::

    ds.bar.sel(x=10)



You can do almost everything you can do with ``DataArray`` objects with
``Dataset`` objects (including indexing and arithmetic) if you prefer to work
with multiple variables at once.

Read & write netCDF files
-------------------------

NetCDF is the recommended file format for xarray objects. Users
from the geosciences will recognize that the :py:class:`~xarray.Dataset` data
model looks very similar to a netCDF file (which, in fact, inspired it).

You can directly read and write xarray objects to disk using :py:meth:`~xarray.Dataset.to_netcdf`, :py:func:`~xarray.open_dataset` and
:py:func:`~xarray.open_dataarray`:

.. jupyter-execute::

    ds.to_netcdf("example.nc")
    reopened = xr.open_dataset("example.nc")
    reopened

.. jupyter-execute::
    :hide-code:

    import os

    reopened.close()
    os.remove("example.nc")


It is common for datasets to be distributed across multiple files (commonly one file per timestep). Xarray supports this use-case by providing the :py:meth:`~xarray.open_mfdataset` and the :py:meth:`~xarray.save_mfdataset` methods. For more, see :ref:`io`.


.. _quick-overview-datatrees:

DataTrees
---------

:py:class:`xarray.DataTree` is a tree-like container of :py:class:`~xarray.DataArray` objects, organised into multiple mutually alignable groups. You can think of it like a (recursive) ``dict`` of :py:class:`~xarray.Dataset` objects, where coordinate variables and their indexes are inherited down to children.

Let's first make some example xarray datasets:

.. jupyter-execute::

    import numpy as np
    import xarray as xr

    data = xr.DataArray(np.random.randn(2, 3), dims=("x", "y"), coords={"x": [10, 20]})
    ds = xr.Dataset({"foo": data, "bar": ("x", [1, 2]), "baz": np.pi})
    ds

    ds2 = ds.interp(coords={"x": [10, 12, 14, 16, 18, 20]})
    ds2

    ds3 = xr.Dataset(
        {"people": ["alice", "bob"], "heights": ("people", [1.57, 1.82])},
        coords={"species": "human"},
    )
    ds3

Now we'll put these datasets into a hierarchical DataTree:

.. jupyter-execute::

    dt = xr.DataTree.from_dict(
        {"simulation/coarse": ds, "simulation/fine": ds2, "/": ds3}
    )
    dt

This created a DataTree with nested groups. We have one root group, containing information about individual
people.  This root group can be named, but here it is unnamed, and is referenced with ``"/"``. This structure is similar to a
unix-like filesystem.  The root group then has one subgroup ``simulation``, which contains no data itself but does
contain another two subgroups, named ``fine`` and ``coarse``.

The (sub)subgroups ``fine`` and ``coarse`` contain two very similar datasets.  They both have an ``"x"``
dimension, but the dimension is of different lengths in each group, which makes the data in each group
unalignable.  In the root group we placed some completely unrelated information, in order to show how a tree can
store heterogeneous data.

Remember to keep unalignable dimensions in sibling groups because a DataTree inherits coordinates down through its
child nodes.  You can see this inheritance in the above representation of the DataTree.  The coordinates
``people`` and ``species`` defined in the root ``/`` node are shown in the child nodes both
``/simulation/coarse`` and ``/simulation/fine``.  All coordinates in parent-descendent lineage must be
alignable to form a DataTree.  If your input data is not aligned, you can still get a nested ``dict`` of
:py:class:`~xarray.Dataset` objects with :py:func:`~xarray.open_groups` and then apply any required changes to ensure alignment
before converting to a :py:class:`~xarray.DataTree`.

The constraints on each group are the same as the constraint on DataArrays within a single dataset with the
addition of requiring parent-descendent coordinate agreement.

We created the subgroups using a filesystem-like syntax, and accessing groups works the same way.  We can access
individual DataArrays in a similar fashion.

.. jupyter-execute::

    dt["simulation/coarse/foo"]

We can also view the data in a particular group as a read-only :py:class:`~xarray.Datatree.DatasetView` using :py:attr:`xarray.Datatree.dataset`:

.. jupyter-execute::

    dt["simulation/coarse"].dataset

We can get a copy of the :py:class:`~xarray.Dataset` including the inherited coordinates by calling the :py:class:`~xarray.datatree.to_dataset` method:

.. jupyter-execute::

    ds_inherited = dt["simulation/coarse"].to_dataset()
    ds_inherited

And you can get a copy of just the node local values of :py:class:`~xarray.Dataset` by setting the ``inherit`` keyword to ``False``:

.. jupyter-execute::

    ds_node_local = dt["simulation/coarse"].to_dataset(inherit=False)
    ds_node_local

.. note::

    We intend to eventually implement most :py:class:`~xarray.Dataset` methods
    (indexing, aggregation, arithmetic, etc) on :py:class:`~xarray.DataTree`
    objects, but many methods have not been implemented yet.

.. Operations map over subtrees, so we can take a mean over the ``x`` dimension of both the ``fine`` and ``coarse`` groups just by:

.. .. jupyter-execute::

..     avg = dt["simulation"].mean(dim="x")
..     avg

.. Here the ``"x"`` dimension used is always the one local to that subgroup.


.. You can do almost everything you can do with :py:class:`~xarray.Dataset` objects with :py:class:`~xarray.DataTree` objects
.. (including indexing and arithmetic), as operations will be mapped over every subgroup in the tree.
.. This allows you to work with multiple groups of non-alignable variables at once.

.. tip::

    If all of your variables are mutually alignable (i.e., they live on the same
    grid, such that every common dimension name maps to the same length), then
    you probably don't need :py:class:`xarray.DataTree`, and should consider
    just sticking with :py:class:`xarray.Dataset`.