File: groupby.rst

package info (click to toggle)
python-xarray 0.16.2-2
  • links: PTS, VCS
  • area: main
  • in suites: bullseye
  • size: 6,568 kB
  • sloc: python: 60,570; makefile: 236; sh: 38
file content (229 lines) | stat: -rw-r--r-- 6,742 bytes parent folder | download
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
.. _groupby:

GroupBy: split-apply-combine
----------------------------

xarray supports `"group by"`__ operations with the same API as pandas to
implement the `split-apply-combine`__ strategy:

__ http://pandas.pydata.org/pandas-docs/stable/groupby.html
__ http://www.jstatsoft.org/v40/i01/paper

- Split your data into multiple independent groups.
- Apply some function to each group.
- Combine your groups back into a single data object.

Group by operations work on both :py:class:`~xarray.Dataset` and
:py:class:`~xarray.DataArray` objects. Most of the examples focus on grouping by
a single one-dimensional variable, although support for grouping
over a multi-dimensional variable has recently been implemented. Note that for
one-dimensional data, it is usually faster to rely on pandas' implementation of
the same pipeline.

Split
~~~~~

Let's create a simple example dataset:

.. ipython:: python
    :suppress:

    import numpy as np
    import pandas as pd
    import xarray as xr

    np.random.seed(123456)

.. ipython:: python

    ds = xr.Dataset(
        {"foo": (("x", "y"), np.random.rand(4, 3))},
        coords={"x": [10, 20, 30, 40], "letters": ("x", list("abba"))},
    )
    arr = ds["foo"]
    ds

If we groupby the name of a variable or coordinate in a dataset (we can also
use a DataArray directly), we get back a ``GroupBy`` object:

.. ipython:: python

    ds.groupby("letters")

This object works very similarly to a pandas GroupBy object. You can view
the group indices with the ``groups`` attribute:

.. ipython:: python

    ds.groupby("letters").groups

You can also iterate over groups in ``(label, group)`` pairs:

.. ipython:: python

    list(ds.groupby("letters"))

Just like in pandas, creating a GroupBy object is cheap: it does not actually
split the data until you access particular values.

Binning
~~~~~~~

Sometimes you don't want to use all the unique values to determine the groups
but instead want to "bin" the data into coarser groups. You could always create
a customized coordinate, but xarray facilitates this via the
:py:meth:`~xarray.Dataset.groupby_bins` method.

.. ipython:: python

    x_bins = [0, 25, 50]
    ds.groupby_bins("x", x_bins).groups

The binning is implemented via :func:`pandas.cut`, whose documentation details how
the bins are assigned. As seen in the example above, by default, the bins are
labeled with strings using set notation to precisely identify the bin limits. To
override this behavior, you can specify the bin labels explicitly. Here we
choose `float` labels which identify the bin centers:

.. ipython:: python

    x_bin_labels = [12.5, 37.5]
    ds.groupby_bins("x", x_bins, labels=x_bin_labels).groups


Apply
~~~~~

To apply a function to each group, you can use the flexible
:py:meth:`~xarray.core.groupby.DatasetGroupBy.map` method. The resulting objects are automatically
concatenated back together along the group axis:

.. ipython:: python

    def standardize(x):
        return (x - x.mean()) / x.std()


    arr.groupby("letters").map(standardize)

GroupBy objects also have a :py:meth:`~xarray.core.groupby.DatasetGroupBy.reduce` method and
methods like :py:meth:`~xarray.core.groupby.DatasetGroupBy.mean` as shortcuts for applying an
aggregation function:

.. ipython:: python

    arr.groupby("letters").mean(dim="x")

Using a groupby is thus also a convenient shortcut for aggregating over all
dimensions *other than* the provided one:

.. ipython:: python

    ds.groupby("x").std(...)

.. note::

    We use an ellipsis (`...`) here to indicate we want to reduce over all
    other dimensions  


First and last
~~~~~~~~~~~~~~

There are two special aggregation operations that are currently only found on
groupby objects: first and last. These provide the first or last example of
values for group along the grouped dimension:

.. ipython:: python

    ds.groupby("letters").first(...)

By default, they skip missing values (control this with ``skipna``).

Grouped arithmetic
~~~~~~~~~~~~~~~~~~

GroupBy objects also support a limited set of binary arithmetic operations, as
a shortcut for mapping over all unique labels. Binary arithmetic is supported
for ``(GroupBy, Dataset)`` and ``(GroupBy, DataArray)`` pairs, as long as the
dataset or data array uses the unique grouped values as one of its index
coordinates. For example:

.. ipython:: python

    alt = arr.groupby("letters").mean(...)
    alt
    ds.groupby("letters") - alt

This last line is roughly equivalent to the following::

    results = []
    for label, group in ds.groupby('letters'):
        results.append(group - alt.sel(letters=label))
    xr.concat(results, dim='x')

Squeezing
~~~~~~~~~

When grouping over a dimension, you can control whether the dimension is
squeezed out or if it should remain with length one on each group by using
the ``squeeze`` parameter:

.. ipython:: python

    next(iter(arr.groupby("x")))

.. ipython:: python

    next(iter(arr.groupby("x", squeeze=False)))

Although xarray will attempt to automatically
:py:attr:`~xarray.DataArray.transpose` dimensions back into their original order
when you use apply, it is sometimes useful to set ``squeeze=False`` to
guarantee that all original dimensions remain unchanged.

You can always squeeze explicitly later with the Dataset or DataArray
:py:meth:`~xarray.DataArray.squeeze` methods.

.. _groupby.multidim:

Multidimensional Grouping
~~~~~~~~~~~~~~~~~~~~~~~~~

Many datasets have a multidimensional coordinate variable (e.g. longitude)
which is different from the logical grid dimensions (e.g. nx, ny). Such
variables are valid under the `CF conventions`__. Xarray supports groupby
operations over multidimensional coordinate variables:

__ http://cfconventions.org/cf-conventions/v1.6.0/cf-conventions.html#_two_dimensional_latitude_longitude_coordinate_variables

.. ipython:: python

    da = xr.DataArray(
        [[0, 1], [2, 3]],
        coords={
            "lon": (["ny", "nx"], [[30, 40], [40, 50]]),
            "lat": (["ny", "nx"], [[10, 10], [20, 20]]),
        },
        dims=["ny", "nx"],
    )
    da
    da.groupby("lon").sum(...)
    da.groupby("lon").map(lambda x: x - x.mean(), shortcut=False)

Because multidimensional groups have the ability to generate a very large
number of bins, coarse-binning via :py:meth:`~xarray.Dataset.groupby_bins`
may be desirable:

.. ipython:: python

    da.groupby_bins("lon", [0, 45, 50]).sum()

These methods group by `lon` values. It is also possible to groupby each
cell in a grid, regardless of value, by stacking multiple dimensions, 
applying your function, and then unstacking the result:

.. ipython:: python

    stacked = da.stack(gridcell=["ny", "nx"])
    stacked.groupby("gridcell").sum(...).unstack("gridcell")