File: nan_policy.rst

package info (click to toggle)
scipy 1.6.0-2
  • links: PTS, VCS
  • area: main
  • in suites: bullseye
  • size: 132,464 kB
  • sloc: python: 207,830; ansic: 92,105; fortran: 76,906; cpp: 68,145; javascript: 32,742; makefile: 422; pascal: 421; sh: 158
file content (151 lines) | stat: -rwxr-xr-x 5,484 bytes parent folder | download
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
A Design Specification for `nan_policy`
=======================================

Many functions in `scipy.stats` have a parameter called ``nan_policy``
that determines how the function handles data that contains ``nan``.  In
this section, we provide SciPy developer guidelines for how ``nan_policy``
is intended to be used, to ensure that as this parameter is added to new
functions, we maintain a consistent API.

The basic API
-------------

The parameter ``nan_policy`` accepts three possible strings: ``'omit'``,
``'raise'`` and ``'propagate'``.  The meanings are:

* ``nan_policy='omit'``:
  Ignore occurrences of ``nan`` in the input.  Do not generate a warning
  if the input contains ``nan``. For example, for the simple case of a
  function that accepts a single array (and ignoring the possible use of
  ``axis`` for the moment)::

      func([1.0, 3.0, np.nan, 5.0], nan_policy='omit')

  should behave the same as::

      func([1.0, 3.0, 5.0])

  More generally, ``func(a, nan_policy='omit')`` should behave the same as
  ``func(a[~np.isnan(a)])``.

  Unit tests for this property should be used to test functions that
  handle ``nan_policy``.

  For functions that accept two or more arguments but whose values are
  not related, the same idea applies to each input array.  So::

      func(a, b, nan_policy='omit')

  should behave the same as::

      func(a[~np.isnan(a)], b[~np.isnan(b)])

  For inputs with *related* or *paired* values, the recommended behavior
  is to omit all the values for which any of the related values are ``nan``.
  For a function with two related array inputs, this means::

      y = func(a, b, nan_policy='omit')

  should behave the same as::

      hasnan = np.isnan(a) | np.isnan(b)  # Union of the isnan masks.
      y = func(a[~hasnan], b[~hasnan])

  The docstring for such a function should clearly state this behavior.

* ``nan_policy='raise'``:
  Raise a ``ValueError``.
* ``nan_policy='propagate'``:
  Propagate the ``nan`` value to the output.  Typically, this means just
  execute the function without checking for ``nan``, but see

      https://github.com/scipy/scipy/issues/7818

  for an example where that might lead to unexpected output.


``nan_policy`` combined with an ``axis`` parameter
--------------------------------------------------
There is nothing surprising here--the principle mentioned above still
applies when the function has an ``axis`` parameter.  Suppose, for example,
``func`` reduces a 1-d array to a scalar, and handles n-d arrays as a
collection of 1-d arrays, with the ``axis`` parameter specifying the axis
along which the reduction is to be applied.  If, say::

    func([1, 3, 4])     -> 10.0
    func([2, -3, 8, 2]) ->  4.2
    func([7, 8])        ->  9.5
    func([])            -> -inf

then::

    func([[  1, nan,   3,   4],
          [  2,  -3,   8,   2],
          [nan,   7, nan,   8],
          [nan, nan, nan, nan]], nan_policy='omit', axis=-1)

must give the result::

    np.array([10.0, 4.2, 9.5, -inf])


Edge cases
----------
A function that implements the ``nan_policy`` parameter should gracefully
handle the case where *all* the values in the input array(s) are ``nan``.
The basic principle described above still applies::

    func([nan, nan, nan], nan_policy='omit')

should behave the same as::

    func([])

In practice, when adding ``nan_policy`` to an existing function, it is
not unusual to find that the function doesn't already handle this case
in a well-defined manner, and some thought and design may have to be
applied to ensure that it works.  The correct behavior (whether that be
to return ``nan``, return some other value, raise an exception, or something
else) will be determined on a case-by-case basis.


Why doesn't ``nan_policy`` also apply to ``inf``?
--------------------------------------------------
Although we learn in grade school that "infinity is not a number", the
floating point values ``nan`` and ``inf`` are qualitatively different.
The values ``inf`` and ``-inf`` act much more like regular floating
point values than ``nan``.

* One can compare ``inf`` to other floating point values and it behaves
  as expected, e.g. ``3 < inf`` is True.
* For the most part, arithmetic works "as expected" with ``inf``,
  e.g. ``inf + inf = inf``, ``-2*inf = -inf``, ``1/inf = 0``,
  etc.
* Many existing functions work "as expected" with ``inf``:
  ``np.log(inf) = inf``, ``np.exp(-inf) = 0``,
  ``np.array([1.0, -1.0, np.inf]).min() = -1.0``, etc.

So while ``nan`` almost always means "something went wrong" or "something
is missing", ``inf`` can in many cases be treated as a useful floating
point value.

It is also consistent with the NumPy ``nan`` functions to not ignore
``inf``::

    >>> np.nanmax([1, 2, 3, np.inf, np.nan])
    inf
    >>> np.nansum([1, 2, 3, np.inf, np.nan])
    inf
    >>> np.nanmean([8, -np.inf, 9, 1, np.nan])
    -inf


How *not* to implement ``nan_policy``
-------------------------------------
In the past (and possibly currently), some ``stats`` functions handled
``nan_policy`` by using a masked array to mask the ``nan`` values, and
then computing the result using the functions in the ``mstats`` subpackage.
The problem with this approach is that the masked array code might convert
``inf`` to a masked value, which we don't want to do (see above).  It also
means that, if care is not taken, the return value will be a masked array,
which will likely be a surprise to the user if they passed in regular arrays.