File: pyarrow.rst

package info (click to toggle)
pandas 2.2.3%2Bdfsg-9
  • links: PTS, VCS
  • area: main
  • in suites: forky, sid, trixie
  • size: 66,784 kB
  • sloc: python: 422,228; ansic: 9,190; sh: 270; xml: 102; makefile: 83
file content (194 lines) | stat: -rw-r--r-- 6,367 bytes parent folder | download
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
.. _pyarrow:

{{ header }}

*********************
PyArrow Functionality
*********************

pandas can utilize `PyArrow <https://arrow.apache.org/docs/python/index.html>`__ to extend functionality and improve the performance
of various APIs. This includes:

* More extensive `data types <https://arrow.apache.org/docs/python/api/datatypes.html>`__ compared to NumPy
* Missing data support (NA) for all data types
* Performant IO reader integration
* Facilitate interoperability with other dataframe libraries based on the Apache Arrow specification (e.g. polars, cuDF)

To use this functionality, please ensure you have :ref:`installed the minimum supported PyArrow version. <install.optional_dependencies>`


Data Structure Integration
--------------------------

A :class:`Series`, :class:`Index`, or the columns of a :class:`DataFrame` can be directly backed by a :external+pyarrow:py:class:`pyarrow.ChunkedArray`
which is similar to a NumPy array. To construct these from the main pandas data structures, you can pass in a string of the type followed by
``[pyarrow]``, e.g. ``"int64[pyarrow]""`` into the ``dtype`` parameter

.. ipython:: python

   ser = pd.Series([-1.5, 0.2, None], dtype="float32[pyarrow]")
   ser

   idx = pd.Index([True, None], dtype="bool[pyarrow]")
   idx

   df = pd.DataFrame([[1, 2], [3, 4]], dtype="uint64[pyarrow]")
   df

.. note::

   The string alias ``"string[pyarrow]"`` maps to ``pd.StringDtype("pyarrow")`` which is not equivalent to
   specifying ``dtype=pd.ArrowDtype(pa.string())``. Generally, operations on the data will behave similarly
   except ``pd.StringDtype("pyarrow")`` can return NumPy-backed nullable types while ``pd.ArrowDtype(pa.string())``
   will return :class:`ArrowDtype`.

   .. ipython:: python

      import pyarrow as pa
      data = list("abc")
      ser_sd = pd.Series(data, dtype="string[pyarrow]")
      ser_ad = pd.Series(data, dtype=pd.ArrowDtype(pa.string()))
      ser_ad.dtype == ser_sd.dtype
      ser_sd.str.contains("a")
      ser_ad.str.contains("a")

For PyArrow types that accept parameters, you can pass in a PyArrow type with those parameters
into :class:`ArrowDtype` to use in the ``dtype`` parameter.

.. ipython:: python

   import pyarrow as pa
   list_str_type = pa.list_(pa.string())
   ser = pd.Series([["hello"], ["there"]], dtype=pd.ArrowDtype(list_str_type))
   ser

.. ipython:: python

   from datetime import time
   idx = pd.Index([time(12, 30), None], dtype=pd.ArrowDtype(pa.time64("us")))
   idx

.. ipython:: python

   from decimal import Decimal
   decimal_type = pd.ArrowDtype(pa.decimal128(3, scale=2))
   data = [[Decimal("3.19"), None], [None, Decimal("-1.23")]]
   df = pd.DataFrame(data, dtype=decimal_type)
   df

If you already have an :external+pyarrow:py:class:`pyarrow.Array` or :external+pyarrow:py:class:`pyarrow.ChunkedArray`,
you can pass it into :class:`.arrays.ArrowExtensionArray` to construct the associated :class:`Series`, :class:`Index`
or :class:`DataFrame` object.

.. ipython:: python

   pa_array = pa.array(
       [{"1": "2"}, {"10": "20"}, None],
       type=pa.map_(pa.string(), pa.string()),
   )
   ser = pd.Series(pd.arrays.ArrowExtensionArray(pa_array))
   ser

To retrieve a pyarrow :external+pyarrow:py:class:`pyarrow.ChunkedArray` from a :class:`Series` or :class:`Index`, you can call
the pyarrow array constructor on the :class:`Series` or :class:`Index`.

.. ipython:: python

   ser = pd.Series([1, 2, None], dtype="uint8[pyarrow]")
   pa.array(ser)

   idx = pd.Index(ser)
   pa.array(idx)

To convert a :external+pyarrow:py:class:`pyarrow.Table` to a :class:`DataFrame`, you can call the
:external+pyarrow:py:meth:`pyarrow.Table.to_pandas` method with ``types_mapper=pd.ArrowDtype``.

.. ipython:: python

   table = pa.table([pa.array([1, 2, 3], type=pa.int64())], names=["a"])

   df = table.to_pandas(types_mapper=pd.ArrowDtype)
   df
   df.dtypes


Operations
----------

PyArrow data structure integration is implemented through pandas' :class:`~pandas.api.extensions.ExtensionArray` :ref:`interface <extending.extension-types>`;
therefore, supported functionality exists where this interface is integrated within the pandas API. Additionally, this functionality
is accelerated with PyArrow `compute functions <https://arrow.apache.org/docs/python/api/compute.html>`__ where available. This includes:

* Numeric aggregations
* Numeric arithmetic
* Numeric rounding
* Logical and comparison functions
* String functionality
* Datetime functionality

The following are just some examples of operations that are accelerated by native PyArrow compute functions.

.. ipython:: python

   import pyarrow as pa
   ser = pd.Series([-1.545, 0.211, None], dtype="float32[pyarrow]")
   ser.mean()
   ser + ser
   ser > (ser + 1)

   ser.dropna()
   ser.isna()
   ser.fillna(0)

.. ipython:: python

   ser_str = pd.Series(["a", "b", None], dtype=pd.ArrowDtype(pa.string()))
   ser_str.str.startswith("a")

.. ipython:: python

   from datetime import datetime
   pa_type = pd.ArrowDtype(pa.timestamp("ns"))
   ser_dt = pd.Series([datetime(2022, 1, 1), None], dtype=pa_type)
   ser_dt.dt.strftime("%Y-%m")

I/O Reading
-----------

PyArrow also provides IO reading functionality that has been integrated into several pandas IO readers. The following
functions provide an ``engine`` keyword that can dispatch to PyArrow to accelerate reading from an IO source.

* :func:`read_csv`
* :func:`read_json`
* :func:`read_orc`
* :func:`read_feather`

.. ipython:: python

   import io
   data = io.StringIO("""a,b,c
      1,2.5,True
      3,4.5,False
   """)
   df = pd.read_csv(data, engine="pyarrow")
   df

By default, these functions and all other IO reader functions return NumPy-backed data. These readers can return
PyArrow-backed data by specifying the parameter ``dtype_backend="pyarrow"``. A reader does not need to set
``engine="pyarrow"`` to necessarily return PyArrow-backed data.

.. ipython:: python

    import io
    data = io.StringIO("""a,b,c,d,e,f,g,h,i
        1,2.5,True,a,,,,,
        3,4.5,False,b,6,7.5,True,a,
    """)
    df_pyarrow = pd.read_csv(data, dtype_backend="pyarrow")
    df_pyarrow.dtypes

Several non-IO reader functions can also use the ``dtype_backend`` argument to return PyArrow-backed data including:

* :func:`to_numeric`
* :meth:`DataFrame.convert_dtypes`
* :meth:`Series.convert_dtypes`