File: how-to-filter-cut-mask.md

package info (click to toggle)
python-awkward 2.6.5-1
links: PTS, VCS
area: main
in suites: sid
size: 23,088 kB
sloc: python: 148,689; cpp: 33,562; sh: 432; makefile: 21; javascript: 8
file content (175 lines) | stat: -rw-r--r-- 4,889 bytes
---
jupytext:
  text_representation:
    extension: .md
    format_name: myst
    format_version: 0.13
    jupytext_version: 1.10.3
kernelspec:
  display_name: Python 3
  language: python
  name: python3
---

How to filter arrays: cutting vs. masking
=========================================

```{code-cell} ipython3
import awkward as ak
import numpy as np
```

## The problem with slicing

When you write a mathematical formula using binary operators like `+` and `*`, or [NumPy universal functions (ufuncs)](https://numpy.org/doc/stable/reference/ufuncs.html) like `np.sqrt`, the shapes of nested lists must align. If the arrays in an expression were derived from a single array, this is often automatic. For instance,

```{code-cell} ipython3
original_array = ak.Array([
    [
        {"title": "zero", "x": 0, "y": 0},
        {"title": "one", "x": 1, "y": 1.1},
        {"title": "two", "x": 2, "y": 2.2},
    ],
    [],
    [
        {"title": "three", "x": 3, "y": 3.3},
        {"title": "four", "x": 4, "y": 4.4},
    ],
    [
        {"title": "five", "x": 5, "y": 5.5},
    ],
    [
        {"title": "six", "x": 6, "y": 6.6},
        {"title": "seven", "x": 7, "y": 7.7},
        {"title": "eight", "x": 8, "y": 8.8},
        {"title": "nine", "x": 9, "y": 9.9},
    ],
])
```

```{code-cell} ipython3
array_x = original_array.x
array_y = original_array.y
```

The `array_x` and `array_y` have the same number of lists and the same numbers of items in each list because they were both slices of the `original_array`.

```{code-cell} ipython3
array_x
```

```{code-cell} ipython3
array_y
```

Thus, they can be used together in a mathematical formula.

```{code-cell} ipython3
array_x**2 + array_y**2
```

However, if one array is sliced, or if the two arrays are sliced by different criteria, they would no longer line up:

```{code-cell} ipython3
sliced_x = array_x[array_x > 3]
sliced_y = array_y[array_y > 3]
```

```{code-cell} ipython3
sliced_x
```

```{code-cell} ipython3
sliced_y
```

Notice that the first was sliced with `array_x > 3` and the second was sliced with `array_y > 3`, and as a result, the third list differs in length between the two arrays:

```{code-cell} ipython3
sliced_x[2], sliced_y[2]
```

+++ {"editable": true, "slideshow": {"slide_type": ""}}

If we try to use these together, we get a ValueError:

```{code-cell} ipython3
---
editable: true
slideshow:
  slide_type: ''
tags: [raises-exception]
---
sliced_x**2 + sliced_y**2
```

Sometimes, these misalignments are overt, but sometimes they're subtle and embedded deep within a very large array. You can start investigating a problem like this with {func}`ak.num`:

```{code-cell} ipython3
ak.num(sliced_x) != ak.num(sliced_y)
```

```{code-cell} ipython3
np.nonzero(ak.to_numpy(ak.num(sliced_x) != ak.num(sliced_y)))
```

But it's also possible to avoid them in the first place.

## Masking with missing values

The problem was that the two arrays' shapes changed differently; instead, we'll slice them in such a way that their shapes don't change at all.

The {func}`ak.mask` function uses a boolean array like a slice, but takes values that line up with `False` and returns `None` instead of removing them.

```{code-cell} ipython3
ak.mask(array_x, array_x > 3)
```

It can also be accessed as an array property, with square brackets, so that it resembles a slice:

```{code-cell} ipython3
masked_x = array_x.mask[array_x > 3]
masked_y = array_y.mask[array_y > 3]
```

```{code-cell} ipython3
masked_x
```

```{code-cell} ipython3
masked_y
```

The results of these two masks can be used in a mathematical expression because they line up:

```{code-cell} ipython3
result = masked_x**2 + masked_y**2
result
```

Now only one problem remains: the `None` (missing) values might be undesirable in the output. There are several ways to get rid of them:

* {func}`ak.drop_none` eliminates `None`, like a slice, but it can be done once at the end of a calculation,
* {func}`ak.fill_none` replaces `None` with a chosen value,
* {func}`ak.flatten` removes list structure, and if the `None` values are at the level of a list (the ones in `result` aren't), they'll be removed too,
* {func}`ak.singletons` replaces `None` with `[]` and any other value `x` with `[x]`. The resulting lists all have length 0 or length 1.

```{code-cell} ipython3
ak.drop_none(result, axis=1)
```

```{code-cell} ipython3
ak.fill_none(result, -1, axis=1)
```

```{code-cell} ipython3
ak.singletons(result, axis=1)
```

As a final note, the difference between using {func}`ak.drop_none` and slicing with the result of {func}`ak.is_none` is that {func}`ak.drop_none` also removes "missingness" from the data type; a slice does not.

```{code-cell} ipython3
result[~ak.is_none(result, axis=1)]
```

(Note the `?` for "option-type" before `float64`. This could have consequences, good or bad, at a later stage in processing.)