1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370
|
.. currentmodule:: xarray
.. _groupby:
GroupBy: Group and Bin Data
---------------------------
Often we want to bin or group data, produce statistics (mean, variance) on
the groups, and then return a reduced data set. To do this, Xarray supports
`"group by"`__ operations with the same API as pandas to implement the
`split-apply-combine`__ strategy:
__ https://pandas.pydata.org/pandas-docs/stable/groupby.html
__ https://www.jstatsoft.org/v40/i01/paper
- Split your data into multiple independent groups.
- Apply some function to each group.
- Combine your groups back into a single data object.
Group by operations work on both :py:class:`Dataset` and
:py:class:`DataArray` objects. Most of the examples focus on grouping by
a single one-dimensional variable, although support for grouping
over a multi-dimensional variable has recently been implemented. Note that for
one-dimensional data, it is usually faster to rely on pandas' implementation of
the same pipeline.
.. tip::
`Install the flox package <https://flox.readthedocs.io>`_ to substantially improve the performance
of GroupBy operations, particularly with dask. flox
`extends Xarray's in-built GroupBy capabilities <https://flox.readthedocs.io/en/latest/xarray.html>`_
by allowing grouping by multiple variables, and lazy grouping by dask arrays.
If installed, Xarray will automatically use flox by default.
Split
~~~~~
Let's create a simple example dataset:
.. jupyter-execute::
:hide-code:
import numpy as np
import pandas as pd
import xarray as xr
np.random.seed(123456)
.. jupyter-execute::
ds = xr.Dataset(
{"foo": (("x", "y"), np.random.rand(4, 3))},
coords={"x": [10, 20, 30, 40], "letters": ("x", list("abba"))},
)
arr = ds["foo"]
ds
If we groupby the name of a variable or coordinate in a dataset (we can also
use a DataArray directly), we get back a ``GroupBy`` object:
.. jupyter-execute::
ds.groupby("letters")
This object works very similarly to a pandas GroupBy object. You can view
the group indices with the ``groups`` attribute:
.. jupyter-execute::
ds.groupby("letters").groups
You can also iterate over groups in ``(label, group)`` pairs:
.. jupyter-execute::
list(ds.groupby("letters"))
You can index out a particular group:
.. jupyter-execute::
ds.groupby("letters")["b"]
To group by multiple variables, see :ref:`this section <groupby.multiple>`.
Binning
~~~~~~~
Sometimes you don't want to use all the unique values to determine the groups
but instead want to "bin" the data into coarser groups. You could always create
a customized coordinate, but xarray facilitates this via the
:py:meth:`Dataset.groupby_bins` method.
.. jupyter-execute::
x_bins = [0, 25, 50]
ds.groupby_bins("x", x_bins).groups
The binning is implemented via :func:`pandas.cut`, whose documentation details how
the bins are assigned. As seen in the example above, by default, the bins are
labeled with strings using set notation to precisely identify the bin limits. To
override this behavior, you can specify the bin labels explicitly. Here we
choose ``float`` labels which identify the bin centers:
.. jupyter-execute::
x_bin_labels = [12.5, 37.5]
ds.groupby_bins("x", x_bins, labels=x_bin_labels).groups
Apply
~~~~~
To apply a function to each group, you can use the flexible
:py:meth:`core.groupby.DatasetGroupBy.map` method. The resulting objects are automatically
concatenated back together along the group axis:
.. jupyter-execute::
def standardize(x):
return (x - x.mean()) / x.std()
arr.groupby("letters").map(standardize)
GroupBy objects also have a :py:meth:`core.groupby.DatasetGroupBy.reduce` method and
methods like :py:meth:`core.groupby.DatasetGroupBy.mean` as shortcuts for applying an
aggregation function:
.. jupyter-execute::
arr.groupby("letters").mean(dim="x")
Using a groupby is thus also a convenient shortcut for aggregating over all
dimensions *other than* the provided one:
.. jupyter-execute::
ds.groupby("x").std(...)
.. note::
We use an ellipsis (`...`) here to indicate we want to reduce over all
other dimensions
First and last
~~~~~~~~~~~~~~
There are two special aggregation operations that are currently only found on
groupby objects: first and last. These provide the first or last example of
values for group along the grouped dimension:
.. jupyter-execute::
ds.groupby("letters").first(...)
By default, they skip missing values (control this with ``skipna``).
Grouped arithmetic
~~~~~~~~~~~~~~~~~~
GroupBy objects also support a limited set of binary arithmetic operations, as
a shortcut for mapping over all unique labels. Binary arithmetic is supported
for ``(GroupBy, Dataset)`` and ``(GroupBy, DataArray)`` pairs, as long as the
dataset or data array uses the unique grouped values as one of its index
coordinates. For example:
.. jupyter-execute::
alt = arr.groupby("letters").mean(...)
alt
.. jupyter-execute::
ds.groupby("letters") - alt
This last line is roughly equivalent to the following::
results = []
for label, group in ds.groupby('letters'):
results.append(group - alt.sel(letters=label))
xr.concat(results, dim='x')
.. _groupby.multidim:
Multidimensional Grouping
~~~~~~~~~~~~~~~~~~~~~~~~~
Many datasets have a multidimensional coordinate variable (e.g. longitude)
which is different from the logical grid dimensions (e.g. nx, ny). Such
variables are valid under the `CF conventions`__. Xarray supports groupby
operations over multidimensional coordinate variables:
__ https://cfconventions.org/cf-conventions/v1.6.0/cf-conventions.html#_two_dimensional_latitude_longitude_coordinate_variables
.. jupyter-execute::
da = xr.DataArray(
[[0, 1], [2, 3]],
coords={
"lon": (["ny", "nx"], [[30, 40], [40, 50]]),
"lat": (["ny", "nx"], [[10, 10], [20, 20]]),
},
dims=["ny", "nx"],
)
da
.. jupyter-execute::
da.groupby("lon").sum(...)
.. jupyter-execute::
da.groupby("lon").map(lambda x: x - x.mean(), shortcut=False)
Because multidimensional groups have the ability to generate a very large
number of bins, coarse-binning via :py:meth:`Dataset.groupby_bins`
may be desirable:
.. jupyter-execute::
da.groupby_bins("lon", [0, 45, 50]).sum()
These methods group by ``lon`` values. It is also possible to groupby each
cell in a grid, regardless of value, by stacking multiple dimensions,
applying your function, and then unstacking the result:
.. jupyter-execute::
stacked = da.stack(gridcell=["ny", "nx"])
stacked.groupby("gridcell").sum(...).unstack("gridcell")
Alternatively, you can groupby both ``lat`` and ``lon`` at the :ref:`same time <groupby.multiple>`.
.. _groupby.groupers:
Grouper Objects
~~~~~~~~~~~~~~~
Both ``groupby_bins`` and ``resample`` are specializations of the core ``groupby`` operation for binning,
and time resampling. Many problems demand more complex GroupBy application: for example, grouping by multiple
variables with a combination of categorical grouping, binning, and resampling; or more specializations like
spatial resampling; or more complex time grouping like special handling of seasons, or the ability to specify
custom seasons. To handle these use-cases and more, Xarray is evolving to providing an
extension point using ``Grouper`` objects.
.. tip::
See the `grouper design`_ doc for more detail on the motivation and design ideas behind
Grouper objects.
.. _grouper design: https://github.com/pydata/xarray/blob/main/design_notes/grouper_objects.md
For now Xarray provides three specialized Grouper objects:
1. :py:class:`groupers.UniqueGrouper` for categorical grouping
2. :py:class:`groupers.BinGrouper` for binned grouping
3. :py:class:`groupers.TimeResampler` for resampling along a datetime coordinate
These provide functionality identical to the existing ``groupby``, ``groupby_bins``, and ``resample`` methods.
That is,
.. code-block:: python
ds.groupby("x")
is identical to
.. code-block:: python
from xarray.groupers import UniqueGrouper
ds.groupby(x=UniqueGrouper())
Similarly,
.. code-block:: python
ds.groupby_bins("x", bins=bins)
is identical to
.. code-block:: python
from xarray.groupers import BinGrouper
ds.groupby(x=BinGrouper(bins))
and
.. code-block:: python
ds.resample(time="ME")
is identical to
.. code-block:: python
from xarray.groupers import TimeResampler
ds.resample(time=TimeResampler("ME"))
The :py:class:`groupers.UniqueGrouper` accepts an optional ``labels`` kwarg that is not present
in :py:meth:`DataArray.groupby` or :py:meth:`Dataset.groupby`.
Specifying ``labels`` is required when grouping by a lazy array type (e.g. dask or cubed).
The ``labels`` are used to construct the output coordinate (say for a reduction), and aggregations
will only be run over the specified labels.
You may use ``labels`` to also specify the ordering of groups to be used during iteration.
The order will be preserved in the output.
.. _groupby.multiple:
Grouping by multiple variables
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Use grouper objects to group by multiple dimensions:
.. jupyter-execute::
from xarray.groupers import UniqueGrouper
da.groupby(["lat", "lon"]).sum()
The above is sugar for using ``UniqueGrouper`` objects directly:
.. jupyter-execute::
da.groupby(lat=UniqueGrouper(), lon=UniqueGrouper()).sum()
Different groupers can be combined to construct sophisticated GroupBy operations.
.. jupyter-execute::
from xarray.groupers import BinGrouper
ds.groupby(x=BinGrouper(bins=[5, 15, 25]), letters=UniqueGrouper()).sum()
Time Grouping and Resampling
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. seealso::
See :ref:`resampling`.
Shuffling
~~~~~~~~~
Shuffling is a generalization of sorting a DataArray or Dataset by another DataArray, named ``label`` for example, that follows from the idea of grouping by ``label``.
Shuffling reorders the DataArray or the DataArrays in a Dataset such that all members of a group occur sequentially. For example,
Shuffle the object using either :py:class:`DatasetGroupBy` or :py:class:`DataArrayGroupBy` as appropriate.
.. jupyter-execute::
da = xr.DataArray(
dims="x",
data=[1, 2, 3, 4, 5, 6],
coords={"label": ("x", "a b c a b c".split(" "))},
)
da.groupby("label").shuffle_to_chunks()
For chunked array types (e.g. dask or cubed), shuffle may result in a more optimized communication pattern when compared to direct indexing by the appropriate indexer.
Shuffling also makes GroupBy operations on chunked arrays an embarrassingly parallel problem, and may significantly improve workloads that use :py:meth:`DatasetGroupBy.map` or :py:meth:`DataArrayGroupBy.map`.
|