1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270
|
---
jupytext:
text_representation:
extension: .md
format_name: myst
format_version: 0.13
jupytext_version: 1.14.1
kernelspec:
display_name: Python 3
language: python
name: python3
---
How to create arrays of missing data
====================================
Data at any level of an Awkward Array can be "missing," represented by `None` in Python.
This functionality is somewhat like NumPy's [masked arrays](https://numpy.org/doc/stable/reference/maskedarray.html), but masked arrays can only declare numerical values to be missing (not, for instance, a row of a 2-dimensional array) and they represent missing data with an `np.ma.masked` object instead of `None`.
Pandas also handles missing data, but in several different ways. For floating point columns, `NaN` (not a number) is used to mean "missing," and [as of version 1.0](https://pandas.pydata.org/pandas-docs/stable/user_guide/missing_data.html#missing-data-na), Pandas has a `pd.NA` object for missing data in other data types.
In Awkward Array, floating point `NaN` and a missing value are clearly distinct. Missing data, like all data in Awkward Arrays, are also not represented by any Python object; they are converted _to_ and _from_ `None` by {func}`ak.to_list` and {func}`ak.from_iter`.
```{code-cell} ipython3
import awkward as ak
import numpy as np
```
From Python None
----------------
The {class}`ak.Array` constructor and {func}`ak.from_iter` interpret `None` as a missing value, and {func}`ak.to_list` converts them back into `None`.
```{code-cell} ipython3
ak.Array([1, 2, 3, None, 4, 5])
```
The missing values can be deeply nested (missing integers):
```{code-cell} ipython3
ak.Array([[[[], [1, 2, None]]], [[[3]]], []])
```
They can be shallow (missing lists):
```{code-cell} ipython3
ak.Array([[[[], [1, 2]]], None, [[[3]]], []])
```
Or both:
```{code-cell} ipython3
ak.Array([[[[], [3]]], None, [[[None]]], []])
```
Records can also be missing:
```{code-cell} ipython3
ak.Array([{"x": 1, "y": 1}, None, {"x": 2, "y": 2}])
```
Potentially missing values are represented in the type string as "`?`" or "`option[...]`" (if the nested type is a list, which needs to be bracketed for clarity).
+++
From NumPy arrays
-----------------
Normal NumPy arrays can't represent missing data, but masked arrays can. Here is how one is constructed in NumPy:
```{code-cell} ipython3
numpy_array = np.ma.MaskedArray([1, 2, 3, 4, 5], [False, False, True, True, False])
numpy_array
```
It returns `np.ma.masked` objects if you try to access missing values:
```{code-cell} ipython3
numpy_array[0], numpy_array[1], numpy_array[2], numpy_array[3], numpy_array[4]
```
But it uses `None` for missing values in `tolist`:
```{code-cell} ipython3
numpy_array.tolist()
```
The {func}`ak.from_numpy` function converts masked arrays into Awkward Arrays with missing values, as does the {class}`ak.Array` constructor.
```{code-cell} ipython3
awkward_array = ak.Array(numpy_array)
awkward_array
```
The reverse, {func}`ak.to_numpy`, returns masked arrays if the Awkward Array has missing data.
```{code-cell} ipython3
ak.to_numpy(awkward_array)
```
But [np.asarray](https://numpy.org/doc/stable/reference/generated/numpy.asarray.html), the usual way of casting data as NumPy arrays, does not. ([np.asarray](https://numpy.org/doc/stable/reference/generated/numpy.asarray.html) is supposed to return a plain [np.ndarray](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html), which [np.ma.masked_array](https://numpy.org/doc/stable/reference/generated/numpy.ma.masked_array.html) is not.)
```{code-cell} ipython3
:tags: [raises-exception]
np.asarray(awkward_array)
```
Missing rows vs missing numbers
-------------------------------
In Awkward Array, a missing list is a different thing from a list whose values are missing. However, {func}`ak.to_numpy` converts it for you.
```{code-cell} ipython3
missing_row = ak.Array([[1, 2, 3], None, [4, 5, 6]])
missing_row
```
```{code-cell} ipython3
ak.to_numpy(missing_row)
```
NaN is not missing
------------------
Floating point `NaN` values are simply unrelated to missing values, in both Awkward Array and NumPy.
```{code-cell} ipython3
missing_with_nan = ak.Array([1.1, 2.2, np.nan, None, 3.3])
missing_with_nan
```
```{code-cell} ipython3
ak.to_numpy(missing_with_nan)
```
Missing values as empty lists
-----------------------------
Sometimes, it's useful to think about a potentially missing value as a length-1 list if it is not missing and a length-0 list if it is. (Some languages define the [option type as a kind of list](https://www.scala-lang.org/api/2.13.3/scala/Option.html).)
The Awkward functions {func}`ak.singletons` and {func}`ak.firsts` convert from "`None` form" to and from "lists form."
```{code-cell} ipython3
none_form = ak.Array([1, 2, 3, None, None, 5])
none_form
```
```{code-cell} ipython3
lists_form = ak.singletons(none_form)
lists_form
```
```{code-cell} ipython3
ak.firsts(lists_form)
```
Masking instead of slicing
--------------------------
The most common way of filtering data is to slice it with an array of booleans (usually the result of a calculation).
```{code-cell} ipython3
array = ak.Array([1, 2, 3, 4, 5])
array
```
```{code-cell} ipython3
booleans = ak.Array([True, True, False, False, True])
booleans
```
```{code-cell} ipython3
array[booleans]
```
The data can also be effectively filtered by replacing values with `None`. The following syntax does that:
```{code-cell} ipython3
array.mask[booleans]
```
(Or use the {func}`ak.mask` function.)
+++
An advantage of masking is that the length and nesting structure of the masked array is the same as the original array, so anything that broadcasts with one broadcasts with the other (so that unfiltered data can be used interchangeably with filtered data).
```{code-cell} ipython3
array + array.mask[booleans]
```
whereas
```{code-cell} ipython3
:tags: [raises-exception]
array + array[booleans]
```
With ArrayBuilder
-----------------
{class}`ak.ArrayBuilder` is described in more detail [in this tutorial](how-to-create-arraybuilder), but you can add missing values to an array using the `null` method or appending `None`.
(This is what {func}`ak.from_iter` uses internally to accumulate data.)
```{code-cell} ipython3
builder = ak.ArrayBuilder()
builder.append(1)
builder.append(2)
builder.null()
builder.append(None)
builder.append(3)
array = builder.snapshot()
array
```
In Numba
--------
Functions that Numba Just-In-Time (JIT) compiles can use {class}`ak.ArrayBuilder` or construct a boolean array for {func}`ak.mask`.
({class}`ak.ArrayBuilder` can't be constructed or converted to an array using `snapshot` inside a JIT-compiled function, but can be outside the compiled context.)
```{code-cell} ipython3
import numba as nb
```
```{code-cell} ipython3
@nb.jit
def example(builder):
builder.append(1)
builder.append(2)
builder.null()
builder.append(None)
builder.append(3)
return builder
builder = example(ak.ArrayBuilder())
array = builder.snapshot()
array
```
```{code-cell} ipython3
@nb.jit
def faster_example():
data = np.empty(5, np.int64)
mask = np.empty(5, np.bool_)
data[0] = 1
mask[0] = True
data[1] = 2
mask[1] = True
mask[2] = False
mask[3] = False
data[4] = 5
mask[4] = True
return data, mask
data, mask = faster_example()
array = ak.mask(data, mask)
array
```
|