1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386
|
{{ header }}
.. _string_migration_guide:
=========================================================
Migration guide for the new string data type (pandas 3.0)
=========================================================
The upcoming pandas 3.0 release introduces a new, default string data type. This
will most likely cause some work when upgrading to pandas 3.0, and this page
provides an overview of the issues you might run into and gives guidance on how
to address them.
This new dtype is already available in the pandas 2.3 release, and you can
enable it with:
.. code-block:: python
pd.options.future.infer_string = True
This allows you to test your code before the final 3.0 release.
Background
----------
Historically, pandas has always used the NumPy ``object`` dtype as the default
to store text data. This has two primary drawbacks. First, ``object`` dtype is
not specific to strings: any Python object can be stored in an ``object``-dtype
array, not just strings, and seeing ``object`` as the dtype for a column with
strings is confusing for users. Second, this is not always very efficient (both
performance wise and for memory usage).
Since pandas 1.0, an opt-in string data type has been available, but this has
not yet been made the default, and uses the ``pd.NA`` scalar to represent
missing values.
Pandas 3.0 changes the default dtype for strings to a new string data type,
a variant of the existing optional string data type but using ``NaN`` as the
missing value indicator, to be consistent with the other default data types.
To improve performance, the new string data type will use the ``pyarrow``
package by default, if installed (and otherwise it uses object dtype under the
hood as a fallback).
See `PDEP-14: Dedicated string data type for pandas 3.0 <https://pandas.pydata.org/pdeps/0014-string-dtype.html>`__
for more background and details.
.. - brief primer on the new dtype
.. - Main characteristics:
.. - inferred by default (Default inference of a string dtype)
.. - only strings (setitem with non string fails)
.. - missing values sentinel is always NaN and uses NaN semantics
.. - Breaking changes:
.. - dtype is no longer object dtype
.. - None gets coerced to NaN
.. - setitem raises an error for non-string data
Brief introduction to the new default string dtype
--------------------------------------------------
By default, pandas will infer this new string dtype instead of object dtype for
string data (when creating pandas objects, such as in constructors or IO
functions).
Being a default dtype means that the string dtype will be used in IO methods or
constructors when the dtype is being inferred and the input is inferred to be
string data:
.. code-block:: python
>>> pd.Series(["a", "b", None])
0 a
1 b
2 NaN
dtype: str
It can also be specified explicitly using the ``"str"`` alias:
.. code-block:: python
>>> pd.Series(["a", "b", None], dtype="str")
0 a
1 b
2 NaN
dtype: str
Similarly, functions like :func:`read_csv`, :func:`read_parquet`, and others
will now use the new string dtype when reading string data.
In contrast to the current object dtype, the new string dtype will only store
strings. This also means that it will raise an error if you try to store a
non-string value in it (see below for more details).
Missing values with the new string dtype are always represented as ``NaN`` (``np.nan``),
and the missing value behavior is similar to other default dtypes.
This new string dtype should otherwise behave the same as the existing
``object`` dtype users are used to. For example, all string-specific methods
through the ``str`` accessor will work the same:
.. code-block:: python
>>> ser = pd.Series(["a", "b", None], dtype="str")
>>> ser.str.upper()
0 A
1 B
2 NaN
dtype: str
.. note::
The new default string dtype is an instance of the :class:`pandas.StringDtype`
class. The dtype can be constructed as ``pd.StringDtype(na_value=np.nan)``,
but for general usage we recommend to use the shorter ``"str"`` alias.
Overview of behavior differences and how to address them
---------------------------------------------------------
The dtype is no longer object dtype
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
When inferring or reading string data, the data type of the resulting DataFrame
column or Series will silently start being the new ``"str"`` dtype instead of
``"object"`` dtype, and this can have some impact on your code.
Checking the dtype
^^^^^^^^^^^^^^^^^^
When checking the dtype, code might currently do something like:
.. code-block:: python
>>> ser = pd.Series(["a", "b", "c"])
>>> ser.dtype == "object"
to check for columns with string data (by checking for the dtype being
``"object"``). This will no longer work in pandas 3+, since ``ser.dtype`` will
now be ``"str"`` with the new default string dtype, and the above check will
return ``False``.
To check for columns with string data, you should instead use:
.. code-block:: python
>>> ser.dtype == "str"
**How to write compatible code**
For code that should work on both pandas 2.x and 3.x, you can use the
:func:`pandas.api.types.is_string_dtype` function:
.. code-block:: python
>>> pd.api.types.is_string_dtype(ser.dtype)
True
This will return ``True`` for both the object dtype and the string dtypes.
Hardcoded use of object dtype
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
If you have code where the dtype is hardcoded in constructors, like
.. code-block:: python
>>> pd.Series(["a", "b", "c"], dtype="object")
this will keep using the object dtype. You will want to update this code to
ensure you get the benefits of the new string dtype.
**How to write compatible code?**
First, in many cases it can be sufficient to remove the specific data type, and
let pandas do the inference. But if you want to be specific, you can specify the
``"str"`` dtype:
.. code-block:: python
>>> pd.Series(["a", "b", "c"], dtype="str")
This is actually compatible with pandas 2.x as well, since in pandas < 3,
``dtype="str"`` was essentially treated as an alias for object dtype.
The missing value sentinel is now always NaN
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
When using object dtype, multiple possible missing value sentinels are
supported, including ``None`` and ``np.nan``. With the new default string dtype,
the missing value sentinel is always NaN (``np.nan``):
.. code-block:: python
# with object dtype, None is preserved as None and seen as missing
>>> ser = pd.Series(["a", "b", None], dtype="object")
>>> ser
0 a
1 b
2 None
dtype: object
>>> print(ser[2])
None
# with the new string dtype, any missing value like None is coerced to NaN
>>> ser = pd.Series(["a", "b", None], dtype="str")
>>> ser
0 a
1 b
2 NaN
dtype: str
>>> print(ser[2])
nan
Generally this should be no problem when relying on missing value behavior in
pandas methods (for example, ``ser.isna()`` will give the same result as before).
But when you relied on the exact value of ``None`` being present, that can
impact your code.
**How to write compatible code?**
When checking for a missing value, instead of checking for the exact value of
``None`` or ``np.nan``, you should use the :func:`pandas.isna` function. This is
the most robust way to check for missing values, as it will work regardless of
the dtype and the exact missing value sentinel:
.. code-block:: python
>>> pd.isna(ser[2])
True
One caveat: this function works both on scalars and on array-likes, and in the
latter case it will return an array of bools. When using it in a Boolean context
(for example, ``if pd.isna(..): ..``) be sure to only pass a scalar to it.
"setitem" operations will now raise an error for non-string data
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
With the new string dtype, any attempt to set a non-string value in a Series or
DataFrame will raise an error:
.. code-block:: python
>>> ser = pd.Series(["a", "b", None], dtype="str")
>>> ser[1] = 2.5
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
...
TypeError: Invalid value '2.5' for dtype 'str'. Value should be a string or missing value, got 'float' instead.
If you relied on the flexible nature of object dtype being able to hold any
Python object, but your initial data was inferred as strings, your code might be
impacted by this change.
**How to write compatible code?**
You can update your code to ensure you only set string values in such columns,
or otherwise you can explicitly ensure the column has object dtype first. This
can be done by specifying the dtype explicitly in the constructor, or by using
the :meth:`~pandas.Series.astype` method:
.. code-block:: python
>>> ser = pd.Series(["a", "b", None], dtype="str")
>>> ser = ser.astype("object")
>>> ser[1] = 2.5
This ``astype("object")`` call will be redundant when using pandas 2.x, but
this code will work for all versions.
Invalid unicode input
~~~~~~~~~~~~~~~~~~~~~
Python allows to have a built-in ``str`` object that represents invalid unicode
data. And since the ``object`` dtype can hold any Python object, you can have a
pandas Series with such invalid unicode data:
.. code-block:: python
>>> ser = pd.Series(["\u2600", "\ud83d"], dtype=object)
>>> ser
0 ☀
1 \ud83d
dtype: object
However, when using the string dtype using ``pyarrow`` under the hood, this can
only store valid unicode data, and otherwise it will raise an error:
.. code-block:: python
>>> ser = pd.Series(["\u2600", "\ud83d"])
---------------------------------------------------------------------------
UnicodeEncodeError Traceback (most recent call last)
...
UnicodeEncodeError: 'utf-8' codec can't encode character '\ud83d' in position 0: surrogates not allowed
If you want to keep the previous behaviour, you can explicitly specify
``dtype=object`` to keep working with object dtype.
When you have byte data that you want to convert to strings using ``decode()``,
the :meth:`~pandas.Series.str.decode` method now has a ``dtype`` parameter to be
able to specify object dtype instead of the default of string dtype for this use
case.
Notable bug fixes
~~~~~~~~~~~~~~~~~
``astype(str)`` preserving missing values
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
This is a long standing "bug" or misfeature, as discussed in https://github.com/pandas-dev/pandas/issues/25353.
With pandas < 3, when using ``astype(str)`` (using the built-in :func:`str`, not
``astype("str")``!), the operation would convert every element to a string,
including the missing values:
.. code-block:: python
# OLD behavior in pandas < 3
>>> ser = pd.Series(["a", np.nan], dtype=object)
>>> ser
0 a
1 NaN
dtype: object
>>> ser.astype(str)
0 a
1 nan
dtype: object
>>> ser.astype(str).to_numpy()
array(['a', 'nan'], dtype=object)
Note how ``NaN`` (``np.nan``) was converted to the string ``"nan"``. This was
not the intended behavior, and it was inconsistent with how other dtypes handled
missing values.
With pandas 3, this behavior has been fixed, and now ``astype(str)`` is an alias
for ``astype("str")``, i.e. casting to the new string dtype, which will preserve
the missing values:
.. code-block:: python
# NEW behavior in pandas 3
>>> pd.options.future.infer_string = True
>>> ser = pd.Series(["a", np.nan], dtype=object)
>>> ser.astype(str)
0 a
1 NaN
dtype: str
>>> ser.astype(str).values
array(['a', nan], dtype=object)
If you want to preserve the old behaviour of converting every object to a
string, you can use ``ser.map(str)`` instead.
``prod()`` raising for string data
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
In pandas < 3, calling the :meth:`~pandas.Series.prod` method on a Series with
string data would generally raise an error, except when the Series was empty or
contained only a single string (potentially with missing values):
.. code-block:: python
>>> ser = pd.Series(["a", None], dtype=object)
>>> ser.prod()
'a'
When the Series contains multiple strings, it will raise a ``TypeError``. This
behaviour stays the same in pandas 3 when using the flexible ``object`` dtype.
But by virtue of using the new string dtype, this will generally consistently
raise an error regardless of the number of strings:
.. code-block:: python
>>> ser = pd.Series(["a", None], dtype="str")
>>> ser.prod()
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
...
TypeError: Cannot perform reduction 'prod' with string dtype
.. For existing users of the nullable ``StringDtype``
.. --------------------------------------------------
.. TODO
|