File: dataframe-indexing.rst

package info (click to toggle)
dask 2024.12.1%2Bdfsg-2
links: PTS, VCS
area: main
in suites: forky, sid, trixie
size: 20,024 kB
sloc: python: 105,182; javascript: 1,917; makefile: 159; sh: 88
file content (205 lines) | stat: -rw-r--r-- 6,115 bytes
.. _dataframe.indexing:

Indexing into Dask DataFrames
=============================

Dask DataFrame supports some of Pandas' indexing behavior.

.. currentmodule:: dask.dataframe

.. autosummary::

   DataFrame.iloc
   DataFrame.loc

Label-based Indexing
--------------------

Just like Pandas, Dask DataFrame supports label-based indexing with the ``.loc``
accessor for selecting rows or columns, and ``__getitem__`` (square brackets)
for selecting just columns.

.. note::

   To select rows, the DataFrame's divisions must be known (see
   :ref:`dataframe.design` and :ref:`dataframe.performance` for more information.)

.. code-block:: python

   >>> import dask.dataframe as dd
   >>> import pandas as pd
   >>> df = pd.DataFrame({"A": [1, 2, 3], "B": [3, 4, 5]},
   ...                   index=['a', 'b', 'c'])
   >>> ddf = dd.from_pandas(df, npartitions=2)
   >>> ddf
   Dask DataFrame Structure:
                      A      B
   npartitions=1
   a              int64  int64
   c                ...    ...
   Dask Name: from_pandas, 1 tasks

Selecting columns:

.. code-block:: python

   >>> ddf[['B', 'A']]
   Dask DataFrame Structure:
                      B      A
   npartitions=1
   a              int64  int64
   c                ...    ...
   Dask Name: getitem, 2 tasks


Selecting a single column reduces to a Dask Series:

.. code-block:: python

   >>> ddf['A']
   Dask Series Structure:
   npartitions=1
   a    int64
   c      ...
   Name: A, dtype: int64
   Dask Name: getitem, 2 tasks

Slicing rows and (optionally) columns with ``.loc``:

.. code-block:: python

   >>> ddf.loc[['b', 'c'], ['A']]
   Dask DataFrame Structure:
                      A
   npartitions=1
   b              int64
   c                ...
   Dask Name: loc, 2 tasks
   
   >>> ddf.loc[df["A"] > 1, ["B"]]
   Dask DataFrame Structure:
                      B
   npartitions=1
   a              int64
   c                ...
   Dask Name: try_loc, 2 tasks

   >>> ddf.loc[lambda df: df["A"] > 1, ["B"]]
   Dask DataFrame Structure:
                      B
   npartitions=1
   a              int64
   c                ...
   Dask Name: try_loc, 2 tasks

Dask DataFrame supports Pandas' `partial-string indexing <https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#partial-string-indexing>`_:

.. code-block:: python

   >>> ts = dd.demo.make_timeseries()
   >>> ts
   Dask DataFrame Structure:
                      id    name        x        y
   npartitions=11
   2000-01-31      int64  object  float64  float64
   2000-02-29        ...     ...      ...      ...
   ...               ...     ...      ...      ...
   2000-11-30        ...     ...      ...      ...
   2000-12-31        ...     ...      ...      ...
   Dask Name: make-timeseries, 11 tasks

   >>> ts.loc['2000-02-12']
   Dask DataFrame Structure:
                                     id    name        x        y
   npartitions=1
   2000-02-12 00:00:00.000000000  int64  object  float64  float64
   2000-02-12 23:59:59.999999999    ...     ...      ...      ...
   Dask Name: loc, 12 tasks


Positional Indexing
-------------------

Dask DataFrame does not track the length of partitions, making positional
indexing with ``.iloc`` inefficient for selecting rows. :meth:`DataFrame.iloc`
only supports indexers where the row indexer is ``slice(None)`` (which ``:`` is
a shorthand for.)

.. code-block:: python

   >>> ddf.iloc[:, [1, 0]]
   Dask DataFrame Structure:
                      B      A
   npartitions=1
   a              int64  int64
   c                ...    ...
   Dask Name: iloc, 2 tasks

Trying to select specific rows with ``iloc`` will raise an exception:

.. code-block:: python

   >>> ddf.iloc[[0, 2], [1]]
   Traceback (most recent call last)
     File "<stdin>", line 1, in <module>
   ValueError: 'DataFrame.iloc' does not support slicing rows. The indexer must be a 2-tuple whose first item is 'slice(None)'.


Partition Indexing
------------------

In addition to pandas-style indexing, Dask DataFrame also supports indexing at a
partition level with :meth:`DataFrame.get_partition` and
:attr:`DataFrame.partitions`. These can be used to select subsets of the data by
partition, rather than by position in the entire DataFrame or index label.

Use :meth:`DataFrame.get_partition` to select a single partition by position.

.. code-block:: python

   >>> import dask
   >>> ddf = dask.datasets.timeseries(start="2021-01-01", end="2021-01-07", freq="1h")
   >>> ddf.get_partition(0)
   Dask DataFrame Structure:
                    name     id        x        y
   npartitions=1
   2021-01-01     object  int64  float64  float64
   2021-01-02        ...    ...      ...      ...
   Dask Name: get-partition, 2 graph layers

Note that the result is also a Dask DatFrame.

Index into :attr:`DataFrame.partitions` to select one or more partitions. For
example, you can select every other partition with a slice:


.. code-block:: python

   >>> ddf.partitions[::2]
   Dask DataFrame Structure:
                    name     id        x        y
   npartitions=3
   2021-01-01     object  int64  float64  float64
   2021-01-03        ...    ...      ...      ...
   2021-01-05        ...    ...      ...      ...
   2021-01-06        ...    ...      ...      ...
   Dask Name: blocks, 2 graph layers

Or even more complicated selections based on the data in the partitions
themselves (at the cost of computing the DataFrame up until that point). For
example, we can create a boolean mask with the partitions that have more than
some number of unique IDs using :meth:`DataFrame.map_partitions`:

.. code-block:: python

   >>> mask = ddf.id.map_partitions(lambda p: len(p.unique()) > 20).compute()
   >>> ddf.partitions[mask]
   Dask DataFrame Structure:
                    name     id        x        y
   npartitions=5
   2021-01-01     object  int64  float64  float64
   2021-01-02        ...    ...      ...      ...
   ...               ...    ...      ...      ...
   2021-01-06        ...    ...      ...      ...
   2021-01-07        ...    ...      ...      ...
   Dask Name: blocks, 2 graph layers