1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129
|
.. _dataframe.indexing:
Indexing into Dask DataFrames
=============================
Dask DataFrame supports some of Pandas' indexing behavior.
.. currentmodule:: dask.dataframe
.. autosummary::
DataFrame.iloc
DataFrame.loc
Label-based Indexing
--------------------
Just like Pandas, Dask DataFrame supports label-based indexing with the ``.loc``
accessor for selecting rows or columns, and ``__getitem__`` (square brackets)
for selecting just columns.
.. note::
To select rows, the DataFrame's divisions must be known (see
:ref:`dataframe.design` and :ref:`dataframe.performance` for more information.)
.. code-block:: python
>>> import dask.dataframe as dd
>>> import pandas as pd
>>> df = pd.DataFrame({"A": [1, 2, 3], "B": [3, 4, 5]},
... index=['a', 'b', 'c'])
>>> ddf = dd.from_pandas(df, npartitions=2)
>>> ddf
Dask DataFrame Structure:
A B
npartitions=1
a int64 int64
c ... ...
Dask Name: from_pandas, 1 tasks
Selecting columns:
.. code-block:: python
>>> ddf[['B', 'A']]
Dask DataFrame Structure:
B A
npartitions=1
a int64 int64
c ... ...
Dask Name: getitem, 2 tasks
Selecting a single column reduces to a Dask Series:
.. code-block:: python
>>> ddf['A']
Dask Series Structure:
npartitions=1
a int64
c ...
Name: A, dtype: int64
Dask Name: getitem, 2 tasks
Slicing rows and (optionally) columns with ``.loc``:
.. code-block:: python
>>> ddf.loc[['b', 'c'], ['A']]
Dask DataFrame Structure:
A
npartitions=1
b int64
c ...
Dask Name: loc, 2 tasks
Dask DataFrame supports Pandas' `partial-string indexing <https://pandas.pydata.org/pandas-docs/stable/timeseries.html#partial-string-indexing>`_:
.. code-block:: python
>>> ts = dd.demo.make_timeseries()
>>> ts
Dask DataFrame Structure:
id name x y
npartitions=11
2000-01-31 int64 object float64 float64
2000-02-29 ... ... ... ...
... ... ... ... ...
2000-11-30 ... ... ... ...
2000-12-31 ... ... ... ...
Dask Name: make-timeseries, 11 tasks
>>> ts.loc['2000-02-12']
Dask DataFrame Structure:
id name x y
npartitions=1
2000-02-12 00:00:00.000000000 int64 object float64 float64
2000-02-12 23:59:59.999999999 ... ... ... ...
Dask Name: loc, 12 tasks
Positional Indexing
-------------------
Dask DataFrame does not track the length of partitions, making positional
indexing with ``.iloc`` inefficient for selecting rows. :meth:`DataFrame.iloc`
only supports indexers where the row indexer is ``slice(None)`` (which ``:`` is
a shorthand for.)
.. code-block:: python
>>> ddf.iloc[:, [1, 0]]
Dask DataFrame Structure:
B A
npartitions=1
a int64 int64
c ... ...
Dask Name: iloc, 2 tasks
Trying to select specific rows with ``iloc`` will raise an exception:
.. code-block:: python
>>> ddf.iloc[[0, 2], [1]]
Traceback (most recent call last)
File "<stdin>", line 1, in <module>
ValueError: 'DataFrame.iloc' does not support slicing rows. The indexer must be a 2-tuple whose first item is 'slice(None)'.
|