1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73
|
# flox: fast & furious GroupBy reductions for `dask.array`
[](https://github.com/xarray-contrib/flox/actions)
[](https://results.pre-commit.ci/latest/github/xarray-contrib/flox/main)
[](https://codecov.io/gh/xarray-contrib/flox)
[](https://flox.readthedocs.io/en/latest/?badge=latest)
[](https://pypi.org/project/flox/)
[](https://anaconda.org/conda-forge/flox)
[](https://earthdata.nasa.gov/esds/competitive-programs/access/pangeo-ml)
[](https://science.nasa.gov/open-science-overview)
## Overview
`flox` mainly provides strategies for fast GroupBy reductions with dask.array. `flox` uses the MapReduce paradigm (or a "tree reduction")
to run the GroupBy operation in a parallel-native way totally avoiding a sort or shuffle operation. It was motivated by
1. Dask Dataframe GroupBy
[blogpost](https://blog.dask.org/2019/10/08/df-groupby)
1. numpy_groupies in Xarray
[issue](https://github.com/pydata/xarray/issues/4473)
See a presentation ([video](https://discourse.pangeo.io/t/november-17-2021-flox-fast-furious-groupby-reductions-with-dask-at-pangeo-scale/2016), [slides](https://docs.google.com/presentation/d/1YubKrwu9zPHC_CzVBhvORuQBW-z148BvX3Ne8XcvWsQ/edit?usp=sharing)) about this package, from the Pangeo Showcase.
## Why flox?
1. {py:func}`flox.groupby_reduce` [wraps](engines.md) the `numpy-groupies` package for performant Groupby reductions on nD arrays.
1. {py:func}`flox.groupby_reduce` provides [parallel-friendly strategies](implementation.md) for GroupBy reductions by wrapping `numpy-groupies` for dask arrays.
1. `flox` [integrates with xarray](xarray.md) to provide more performant Groupby and Resampling operations.
1. {py:func}`flox.xarray.xarray_reduce` [extends](xarray.md) Xarray's GroupBy operations allowing lazy grouping by dask arrays, grouping by multiple arrays,
as well as combining categorical grouping and histogram-style binning operations using multiple variables.
1. `flox` also provides utility functions for rechunking both dask arrays and Xarray objects along a single dimension using the group labels as a guide:
1. To rechunk for blockwise operations: {py:func}`flox.rechunk_for_blockwise`, {py:func}`flox.xarray.rechunk_for_blockwise`.
1. To rechunk so that "cohorts", or groups of labels, tend to occur in the same chunks: {py:func}`flox.rechunk_for_cohorts`, {py:func}`flox.xarray.rechunk_for_cohorts`.
## Installing
```shell
$ pip install flox
```
```shell
$ conda install -c conda-forge flox
```
## Acknowledgements
This work was funded in part by
1. NASA-ACCESS 80NSSC18M0156 "Community tools for analysis of NASA Earth Observing System
Data in the Cloud" (PI J. Hamman),
1. NASA-OSTFL 80NSSC22K0345 "Enhancing analysis of NASA data with the open-source Python Xarray Library" (PIs Scott Henderson, University of Washington;
Deepak Cherian, NCAR; Jessica Scheick, University of New Hampshire), and
1. [NCAR's Earth System Data Science Initiative](https://ncar.github.io/esds/).
It was motivated by many discussions in the [Pangeo](https://pangeo.io) community.
## Contents
```{eval-rst}
.. toctree::
:maxdepth: 1
intro.md
aggregations.md
engines.md
arrays.md
implementation.md
xarray.md
user-stories.md
api.rst
```
|