File: integer_na.rst

package info (click to toggle)
pandas 1.5.3%2Bdfsg-2
  • links: PTS, VCS
  • area: main
  • in suites: bookworm
  • size: 56,516 kB
  • sloc: python: 382,477; ansic: 8,695; sh: 119; xml: 102; makefile: 97
file content (151 lines) | stat: -rw-r--r-- 3,439 bytes parent folder | download
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
.. currentmodule:: pandas

{{ header }}

.. _integer_na:

**************************
Nullable integer data type
**************************

.. note::

   IntegerArray is currently experimental. Its API or implementation may
   change without warning.

.. versionchanged:: 1.0.0

   Now uses :attr:`pandas.NA` as the missing value rather
   than :attr:`numpy.nan`.

In :ref:`missing_data`, we saw that pandas primarily uses ``NaN`` to represent
missing data. Because ``NaN`` is a float, this forces an array of integers with
any missing values to become floating point. In some cases, this may not matter
much. But if your integer column is, say, an identifier, casting to float can
be problematic. Some integers cannot even be represented as floating point
numbers.

Construction
------------

pandas can represent integer data with possibly missing values using
:class:`arrays.IntegerArray`. This is an :ref:`extension type <extending.extension-types>`
implemented within pandas.

.. ipython:: python

   arr = pd.array([1, 2, None], dtype=pd.Int64Dtype())
   arr

Or the string alias ``"Int64"`` (note the capital ``"I"``, to differentiate from
NumPy's ``'int64'`` dtype:

.. ipython:: python

   pd.array([1, 2, np.nan], dtype="Int64")

All NA-like values are replaced with :attr:`pandas.NA`.

.. ipython:: python

   pd.array([1, 2, np.nan, None, pd.NA], dtype="Int64")

This array can be stored in a :class:`DataFrame` or :class:`Series` like any
NumPy array.

.. ipython:: python

   pd.Series(arr)

You can also pass the list-like object to the :class:`Series` constructor
with the dtype.

.. warning::

   Currently :meth:`pandas.array` and :meth:`pandas.Series` use different
   rules for dtype inference. :meth:`pandas.array` will infer a nullable-
   integer dtype

   .. ipython:: python

      pd.array([1, None])
      pd.array([1, 2])

   For backwards-compatibility, :class:`Series` infers these as either
   integer or float dtype

   .. ipython:: python

      pd.Series([1, None])
      pd.Series([1, 2])

   We recommend explicitly providing the dtype to avoid confusion.

   .. ipython:: python

      pd.array([1, None], dtype="Int64")
      pd.Series([1, None], dtype="Int64")

   In the future, we may provide an option for :class:`Series` to infer a
   nullable-integer dtype.

Operations
----------

Operations involving an integer array will behave similar to NumPy arrays.
Missing values will be propagated, and the data will be coerced to another
dtype if needed.

.. ipython:: python

   s = pd.Series([1, 2, None], dtype="Int64")

   # arithmetic
   s + 1

   # comparison
   s == 1

   # indexing
   s.iloc[1:3]

   # operate with other dtypes
   s + s.iloc[1:3].astype("Int8")

   # coerce when needed
   s + 0.01

These dtypes can operate as part of ``DataFrame``.

.. ipython:: python

   df = pd.DataFrame({"A": s, "B": [1, 1, 3], "C": list("aab")})
   df
   df.dtypes


These dtypes can be merged & reshaped & casted.

.. ipython:: python

   pd.concat([df[["A"]], df[["B", "C"]]], axis=1).dtypes
   df["A"].astype(float)

Reduction and groupby operations such as 'sum' work as well.

.. ipython:: python

   df.sum()
   df.groupby("B").A.sum()

Scalar NA Value
---------------

:class:`arrays.IntegerArray` uses :attr:`pandas.NA` as its scalar
missing value. Slicing a single element that's missing will return
:attr:`pandas.NA`

.. ipython:: python

   a = pd.array([1, None], dtype="Int64")
   a[1]