1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145
|
.. _dataframe.indexing:
Indexing into Dask DataFrames
=============================
Dask DataFrame supports some of Pandas' indexing behavior.
.. currentmodule:: dask.dataframe
.. autosummary::
DataFrame.iloc
DataFrame.loc
Label-based Indexing
--------------------
Just like Pandas, Dask DataFrame supports label-based indexing with the ``.loc``
accessor for selecting rows or columns, and ``__getitem__`` (square brackets)
for selecting just columns.
.. note::
To select rows, the DataFrame's divisions must be known (see
:ref:`dataframe.design` and :ref:`dataframe.performance` for more information.)
.. code-block:: python
>>> import dask.dataframe as dd
>>> import pandas as pd
>>> df = pd.DataFrame({"A": [1, 2, 3], "B": [3, 4, 5]},
... index=['a', 'b', 'c'])
>>> ddf = dd.from_pandas(df, npartitions=2)
>>> ddf
Dask DataFrame Structure:
A B
npartitions=1
a int64 int64
c ... ...
Dask Name: from_pandas, 1 tasks
Selecting columns:
.. code-block:: python
>>> ddf[['B', 'A']]
Dask DataFrame Structure:
B A
npartitions=1
a int64 int64
c ... ...
Dask Name: getitem, 2 tasks
Selecting a single column reduces to a Dask Series:
.. code-block:: python
>>> ddf['A']
Dask Series Structure:
npartitions=1
a int64
c ...
Name: A, dtype: int64
Dask Name: getitem, 2 tasks
Slicing rows and (optionally) columns with ``.loc``:
.. code-block:: python
>>> ddf.loc[['b', 'c'], ['A']]
Dask DataFrame Structure:
A
npartitions=1
b int64
c ...
Dask Name: loc, 2 tasks
>>> ddf.loc[df["A"] > 1, ["B"]]
Dask DataFrame Structure:
B
npartitions=1
a int64
c ...
Dask Name: try_loc, 2 tasks
>>> ddf.loc[lambda df: df["A"] > 1, ["B"]]
Dask DataFrame Structure:
B
npartitions=1
a int64
c ...
Dask Name: try_loc, 2 tasks
Dask DataFrame supports Pandas' `partial-string indexing <https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#partial-string-indexing>`_:
.. code-block:: python
>>> ts = dd.demo.make_timeseries()
>>> ts
Dask DataFrame Structure:
id name x y
npartitions=11
2000-01-31 int64 object float64 float64
2000-02-29 ... ... ... ...
... ... ... ... ...
2000-11-30 ... ... ... ...
2000-12-31 ... ... ... ...
Dask Name: make-timeseries, 11 tasks
>>> ts.loc['2000-02-12']
Dask DataFrame Structure:
id name x y
npartitions=1
2000-02-12 00:00:00.000000000 int64 object float64 float64
2000-02-12 23:59:59.999999999 ... ... ... ...
Dask Name: loc, 12 tasks
Positional Indexing
-------------------
Dask DataFrame does not track the length of partitions, making positional
indexing with ``.iloc`` inefficient for selecting rows. :meth:`DataFrame.iloc`
only supports indexers where the row indexer is ``slice(None)`` (which ``:`` is
a shorthand for.)
.. code-block:: python
>>> ddf.iloc[:, [1, 0]]
Dask DataFrame Structure:
B A
npartitions=1
a int64 int64
c ... ...
Dask Name: iloc, 2 tasks
Trying to select specific rows with ``iloc`` will raise an exception:
.. code-block:: python
>>> ddf.iloc[[0, 2], [1]]
Traceback (most recent call last)
File "<stdin>", line 1, in <module>
ValueError: 'DataFrame.iloc' does not support slicing rows. The indexer must be a 2-tuple whose first item is 'slice(None)'.
|