1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148
|
.. _NEP34:
===========================================================
NEP 34 — Disallow inferring ``dtype=object`` from sequences
===========================================================
:Author: Matti Picus
:Status: Final
:Type: Standards Track
:Created: 2019-10-10
:Resolution: https://mail.python.org/pipermail/numpy-discussion/2019-October/080200.html
Abstract
--------
When users create arrays with sequences-of-sequences, they sometimes err in
matching the lengths of the nested sequences_, commonly called "ragged
arrays". Here we will refer to them as ragged nested sequences. Creating such
arrays via ``np.array([<ragged_nested_sequence>])`` with no ``dtype`` keyword
argument will today default to an ``object``-dtype array. Change the behaviour to
raise a ``ValueError`` instead.
Motivation and Scope
--------------------
Users who specify lists-of-lists when creating a `numpy.ndarray` via
``np.array`` may mistakenly pass in lists of different lengths. Currently we
accept this input and automatically create an array with ``dtype=object``. This
can be confusing, since it is rarely what is desired. Changing the automatic
dtype detection to never return ``object`` for ragged nested sequences (defined as a
recursive sequence of sequences, where not all the sequences on the same
level have the same length) will force users who actually wish to create
``object`` arrays to specify that explicitly. Note that ``lists``, ``tuples``,
and ``nd.ndarrays`` are all sequences [0]_. See for instance `issue 5303`_.
Usage and Impact
----------------
After this change, array creation with ragged nested sequences must explicitly
define a dtype:
>>> np.array([[1, 2], [1]])
ValueError: cannot guess the desired dtype from the input
>>> np.array([[1, 2], [1]], dtype=object)
# succeeds, with no change from current behaviour
The deprecation will affect any call that internally calls ``np.asarray``. For
instance, the ``assert_equal`` family of functions calls ``np.asarray``, so
users will have to change code like::
np.assert_equal(a, [[1, 2], 3])
to::
np.assert_equal(a, np.array([[1, 2], 3], dtype=object))
Detailed description
--------------------
To explicitly set the shape of the object array, since it is sometimes hard to
determine what shape is desired, one could use:
>>> arr = np.empty(correct_shape, dtype=object)
>>> arr[...] = values
We will also reject mixed sequences of non-sequence and sequence, for instance
all of these will be rejected:
>>> arr = np.array([np.arange(10), [10]])
>>> arr = np.array([[range(3), range(3), range(3)], [range(3), 0, 0]])
Related Work
------------
`PR 14341`_ tried to raise an error when ragged nested sequences were specified
with a numeric dtype ``np.array, [[1], [2, 3]], dtype=int)`` but failed due to
false-positives, for instance ``np.array([1, np.array([5])], dtype=int)``.
.. _`PR 14341`: https://github.com/numpy/numpy/pull/14341
Implementation
--------------
The code to be changed is inside ``PyArray_GetArrayParamsFromObject`` and the
internal ``discover_dimensions`` function. The first implementation in `PR
14794`_ caused a number of downstream library failures and was reverted before
the release of 1.18. Subsequently downstream libraries fixed the places they
were using ragged arrays. The reimplementation became `PR 15119`_ which was
merged for the 1.19 release.
Backward compatibility
----------------------
Anyone depending on creating object arrays from ragged nested sequences will
need to modify their code. There will be a deprecation period during which the
current behaviour will emit a ``DeprecationWarning``.
Alternatives
------------
- We could continue with the current situation.
- It was also suggested to add a kwarg ``depth`` to array creation, or perhaps
to add another array creation API function ``ragged_array_object``. The goal
was to eliminate the ambiguity in creating an object array from ``array([[1,
2], [1]], dtype=object)``: should the returned array have a shape of
``(1,)``, or ``(2,)``? This NEP does not deal with that issue, and only
deprecates the use of ``array`` with no ``dtype=object`` for ragged nested
sequences. Users of ragged nested sequences may face another deprecation
cycle in the future. Rationale: we expect that there are very few users who
intend to use ragged arrays like that, this was never intended as a use case
of NumPy arrays. Users are likely better off with `another library`_ or just
using list of lists.
- It was also suggested to deprecate all automatic creation of ``object``-dtype
arrays, which would require adding an explicit ``dtype=object`` for something
like ``np.array([Decimal(10), Decimal(10)])``. This too is out of scope for
the current NEP. Rationale: it's harder to asses the impact of this larger
change, we're not sure how many users this may impact.
Discussion
----------
Comments to `issue 5303`_ indicate this is unintended behaviour as far back as
2014. Suggestions to change it have been made in the ensuing years, but none
have stuck. The WIP implementation in `PR 14794`_ seems to point to the
viability of this approach.
References and Footnotes
------------------------
.. _`issue 5303`: https://github.com/numpy/numpy/issues/5303
.. _sequences: https://docs.python.org/3.7/glossary.html#term-sequence
.. _`PR 14794`: https://github.com/numpy/numpy/pull/14794
.. _`PR 15119`: https://github.com/numpy/numpy/pull/15119
.. _`another library`: https://github.com/scikit-hep/awkward-array
.. [0] ``np.ndarrays`` are not recursed into, rather their shape is used
directly. This will not emit warnings::
ragged = np.array([[1], [1, 2, 3]], dtype=object)
np.array([ragged, ragged]) # no dtype needed
Copyright
---------
This document has been placed in the public domain.
|