1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194
|
.. _pyarrow:
{{ header }}
*********************
PyArrow Functionality
*********************
pandas can utilize `PyArrow <https://arrow.apache.org/docs/python/index.html>`__ to extend functionality and improve the performance
of various APIs. This includes:
* More extensive `data types <https://arrow.apache.org/docs/python/api/datatypes.html>`__ compared to NumPy
* Missing data support (NA) for all data types
* Performant IO reader integration
* Facilitate interoperability with other dataframe libraries based on the Apache Arrow specification (e.g. polars, cuDF)
To use this functionality, please ensure you have :ref:`installed the minimum supported PyArrow version. <install.optional_dependencies>`
Data Structure Integration
--------------------------
A :class:`Series`, :class:`Index`, or the columns of a :class:`DataFrame` can be directly backed by a :external+pyarrow:py:class:`pyarrow.ChunkedArray`
which is similar to a NumPy array. To construct these from the main pandas data structures, you can pass in a string of the type followed by
``[pyarrow]``, e.g. ``"int64[pyarrow]""`` into the ``dtype`` parameter
.. ipython:: python
ser = pd.Series([-1.5, 0.2, None], dtype="float32[pyarrow]")
ser
idx = pd.Index([True, None], dtype="bool[pyarrow]")
idx
df = pd.DataFrame([[1, 2], [3, 4]], dtype="uint64[pyarrow]")
df
.. note::
The string alias ``"string[pyarrow]"`` maps to ``pd.StringDtype("pyarrow")`` which is not equivalent to
specifying ``dtype=pd.ArrowDtype(pa.string())``. Generally, operations on the data will behave similarly
except ``pd.StringDtype("pyarrow")`` can return NumPy-backed nullable types while ``pd.ArrowDtype(pa.string())``
will return :class:`ArrowDtype`.
.. ipython:: python
import pyarrow as pa
data = list("abc")
ser_sd = pd.Series(data, dtype="string[pyarrow]")
ser_ad = pd.Series(data, dtype=pd.ArrowDtype(pa.string()))
ser_ad.dtype == ser_sd.dtype
ser_sd.str.contains("a")
ser_ad.str.contains("a")
For PyArrow types that accept parameters, you can pass in a PyArrow type with those parameters
into :class:`ArrowDtype` to use in the ``dtype`` parameter.
.. ipython:: python
import pyarrow as pa
list_str_type = pa.list_(pa.string())
ser = pd.Series([["hello"], ["there"]], dtype=pd.ArrowDtype(list_str_type))
ser
.. ipython:: python
from datetime import time
idx = pd.Index([time(12, 30), None], dtype=pd.ArrowDtype(pa.time64("us")))
idx
.. ipython:: python
from decimal import Decimal
decimal_type = pd.ArrowDtype(pa.decimal128(3, scale=2))
data = [[Decimal("3.19"), None], [None, Decimal("-1.23")]]
df = pd.DataFrame(data, dtype=decimal_type)
df
If you already have an :external+pyarrow:py:class:`pyarrow.Array` or :external+pyarrow:py:class:`pyarrow.ChunkedArray`,
you can pass it into :class:`.arrays.ArrowExtensionArray` to construct the associated :class:`Series`, :class:`Index`
or :class:`DataFrame` object.
.. ipython:: python
pa_array = pa.array(
[{"1": "2"}, {"10": "20"}, None],
type=pa.map_(pa.string(), pa.string()),
)
ser = pd.Series(pd.arrays.ArrowExtensionArray(pa_array))
ser
To retrieve a pyarrow :external+pyarrow:py:class:`pyarrow.ChunkedArray` from a :class:`Series` or :class:`Index`, you can call
the pyarrow array constructor on the :class:`Series` or :class:`Index`.
.. ipython:: python
ser = pd.Series([1, 2, None], dtype="uint8[pyarrow]")
pa.array(ser)
idx = pd.Index(ser)
pa.array(idx)
To convert a :external+pyarrow:py:class:`pyarrow.Table` to a :class:`DataFrame`, you can call the
:external+pyarrow:py:meth:`pyarrow.Table.to_pandas` method with ``types_mapper=pd.ArrowDtype``.
.. ipython:: python
table = pa.table([pa.array([1, 2, 3], type=pa.int64())], names=["a"])
df = table.to_pandas(types_mapper=pd.ArrowDtype)
df
df.dtypes
Operations
----------
PyArrow data structure integration is implemented through pandas' :class:`~pandas.api.extensions.ExtensionArray` :ref:`interface <extending.extension-types>`;
therefore, supported functionality exists where this interface is integrated within the pandas API. Additionally, this functionality
is accelerated with PyArrow `compute functions <https://arrow.apache.org/docs/python/api/compute.html>`__ where available. This includes:
* Numeric aggregations
* Numeric arithmetic
* Numeric rounding
* Logical and comparison functions
* String functionality
* Datetime functionality
The following are just some examples of operations that are accelerated by native PyArrow compute functions.
.. ipython:: python
import pyarrow as pa
ser = pd.Series([-1.545, 0.211, None], dtype="float32[pyarrow]")
ser.mean()
ser + ser
ser > (ser + 1)
ser.dropna()
ser.isna()
ser.fillna(0)
.. ipython:: python
ser_str = pd.Series(["a", "b", None], dtype=pd.ArrowDtype(pa.string()))
ser_str.str.startswith("a")
.. ipython:: python
from datetime import datetime
pa_type = pd.ArrowDtype(pa.timestamp("ns"))
ser_dt = pd.Series([datetime(2022, 1, 1), None], dtype=pa_type)
ser_dt.dt.strftime("%Y-%m")
I/O Reading
-----------
PyArrow also provides IO reading functionality that has been integrated into several pandas IO readers. The following
functions provide an ``engine`` keyword that can dispatch to PyArrow to accelerate reading from an IO source.
* :func:`read_csv`
* :func:`read_json`
* :func:`read_orc`
* :func:`read_feather`
.. ipython:: python
import io
data = io.StringIO("""a,b,c
1,2.5,True
3,4.5,False
""")
df = pd.read_csv(data, engine="pyarrow")
df
By default, these functions and all other IO reader functions return NumPy-backed data. These readers can return
PyArrow-backed data by specifying the parameter ``dtype_backend="pyarrow"``. A reader does not need to set
``engine="pyarrow"`` to necessarily return PyArrow-backed data.
.. ipython:: python
import io
data = io.StringIO("""a,b,c,d,e,f,g,h,i
1,2.5,True,a,,,,,
3,4.5,False,b,6,7.5,True,a,
""")
df_pyarrow = pd.read_csv(data, dtype_backend="pyarrow")
df_pyarrow.dtypes
Several non-IO reader functions can also use the ``dtype_backend`` argument to return PyArrow-backed data including:
* :func:`to_numeric`
* :meth:`DataFrame.convert_dtypes`
* :meth:`Series.convert_dtypes`
|