File: dataframe-indexing.rst

package info (click to toggle)
dask 2024.12.1%2Bdfsg-2
  • links: PTS, VCS
  • area: main
  • in suites: forky, sid, trixie
  • size: 20,024 kB
  • sloc: python: 105,182; javascript: 1,917; makefile: 159; sh: 88
file content (205 lines) | stat: -rw-r--r-- 6,115 bytes parent folder | download
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
.. _dataframe.indexing:

Indexing into Dask DataFrames
=============================

Dask DataFrame supports some of Pandas' indexing behavior.

.. currentmodule:: dask.dataframe

.. autosummary::

   DataFrame.iloc
   DataFrame.loc

Label-based Indexing
--------------------

Just like Pandas, Dask DataFrame supports label-based indexing with the ``.loc``
accessor for selecting rows or columns, and ``__getitem__`` (square brackets)
for selecting just columns.

.. note::

   To select rows, the DataFrame's divisions must be known (see
   :ref:`dataframe.design` and :ref:`dataframe.performance` for more information.)

.. code-block:: python

   >>> import dask.dataframe as dd
   >>> import pandas as pd
   >>> df = pd.DataFrame({"A": [1, 2, 3], "B": [3, 4, 5]},
   ...                   index=['a', 'b', 'c'])
   >>> ddf = dd.from_pandas(df, npartitions=2)
   >>> ddf
   Dask DataFrame Structure:
                      A      B
   npartitions=1
   a              int64  int64
   c                ...    ...
   Dask Name: from_pandas, 1 tasks

Selecting columns:

.. code-block:: python

   >>> ddf[['B', 'A']]
   Dask DataFrame Structure:
                      B      A
   npartitions=1
   a              int64  int64
   c                ...    ...
   Dask Name: getitem, 2 tasks


Selecting a single column reduces to a Dask Series:

.. code-block:: python

   >>> ddf['A']
   Dask Series Structure:
   npartitions=1
   a    int64
   c      ...
   Name: A, dtype: int64
   Dask Name: getitem, 2 tasks

Slicing rows and (optionally) columns with ``.loc``:

.. code-block:: python

   >>> ddf.loc[['b', 'c'], ['A']]
   Dask DataFrame Structure:
                      A
   npartitions=1
   b              int64
   c                ...
   Dask Name: loc, 2 tasks
   
   >>> ddf.loc[df["A"] > 1, ["B"]]
   Dask DataFrame Structure:
                      B
   npartitions=1
   a              int64
   c                ...
   Dask Name: try_loc, 2 tasks

   >>> ddf.loc[lambda df: df["A"] > 1, ["B"]]
   Dask DataFrame Structure:
                      B
   npartitions=1
   a              int64
   c                ...
   Dask Name: try_loc, 2 tasks

Dask DataFrame supports Pandas' `partial-string indexing <https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#partial-string-indexing>`_:

.. code-block:: python

   >>> ts = dd.demo.make_timeseries()
   >>> ts
   Dask DataFrame Structure:
                      id    name        x        y
   npartitions=11
   2000-01-31      int64  object  float64  float64
   2000-02-29        ...     ...      ...      ...
   ...               ...     ...      ...      ...
   2000-11-30        ...     ...      ...      ...
   2000-12-31        ...     ...      ...      ...
   Dask Name: make-timeseries, 11 tasks

   >>> ts.loc['2000-02-12']
   Dask DataFrame Structure:
                                     id    name        x        y
   npartitions=1
   2000-02-12 00:00:00.000000000  int64  object  float64  float64
   2000-02-12 23:59:59.999999999    ...     ...      ...      ...
   Dask Name: loc, 12 tasks


Positional Indexing
-------------------

Dask DataFrame does not track the length of partitions, making positional
indexing with ``.iloc`` inefficient for selecting rows. :meth:`DataFrame.iloc`
only supports indexers where the row indexer is ``slice(None)`` (which ``:`` is
a shorthand for.)

.. code-block:: python

   >>> ddf.iloc[:, [1, 0]]
   Dask DataFrame Structure:
                      B      A
   npartitions=1
   a              int64  int64
   c                ...    ...
   Dask Name: iloc, 2 tasks

Trying to select specific rows with ``iloc`` will raise an exception:

.. code-block:: python

   >>> ddf.iloc[[0, 2], [1]]
   Traceback (most recent call last)
     File "<stdin>", line 1, in <module>
   ValueError: 'DataFrame.iloc' does not support slicing rows. The indexer must be a 2-tuple whose first item is 'slice(None)'.


Partition Indexing
------------------

In addition to pandas-style indexing, Dask DataFrame also supports indexing at a
partition level with :meth:`DataFrame.get_partition` and
:attr:`DataFrame.partitions`. These can be used to select subsets of the data by
partition, rather than by position in the entire DataFrame or index label.

Use :meth:`DataFrame.get_partition` to select a single partition by position.

.. code-block:: python

   >>> import dask
   >>> ddf = dask.datasets.timeseries(start="2021-01-01", end="2021-01-07", freq="1h")
   >>> ddf.get_partition(0)
   Dask DataFrame Structure:
                    name     id        x        y
   npartitions=1
   2021-01-01     object  int64  float64  float64
   2021-01-02        ...    ...      ...      ...
   Dask Name: get-partition, 2 graph layers

Note that the result is also a Dask DatFrame.

Index into :attr:`DataFrame.partitions` to select one or more partitions. For
example, you can select every other partition with a slice:


.. code-block:: python

   >>> ddf.partitions[::2]
   Dask DataFrame Structure:
                    name     id        x        y
   npartitions=3
   2021-01-01     object  int64  float64  float64
   2021-01-03        ...    ...      ...      ...
   2021-01-05        ...    ...      ...      ...
   2021-01-06        ...    ...      ...      ...
   Dask Name: blocks, 2 graph layers

Or even more complicated selections based on the data in the partitions
themselves (at the cost of computing the DataFrame up until that point). For
example, we can create a boolean mask with the partitions that have more than
some number of unique IDs using :meth:`DataFrame.map_partitions`:

.. code-block:: python

   >>> mask = ddf.id.map_partitions(lambda p: len(p.unique()) > 20).compute()
   >>> ddf.partitions[mask]
   Dask DataFrame Structure:
                    name     id        x        y
   npartitions=5
   2021-01-01     object  int64  float64  float64
   2021-01-02        ...    ...      ...      ...
   ...               ...    ...      ...      ...
   2021-01-06        ...    ...      ...      ...
   2021-01-07        ...    ...      ...      ...
   Dask Name: blocks, 2 graph layers