File: how-to-combinatorics-best-match.md

package info (click to toggle)
python-awkward 2.6.5-1
links: PTS, VCS
area: main
in suites: sid
size: 23,088 kB
sloc: python: 148,689; cpp: 33,562; sh: 432; makefile: 21; javascript: 8
file content (168 lines) | stat: -rw-r--r-- 5,727 bytes
---
jupytext:
  text_representation:
    extension: .md
    format_name: myst
    format_version: 0.13
    jupytext_version: 1.10.3
kernelspec:
  display_name: Python 3
  language: python
  name: python3
---

How to find the best match between two collections using Cartesian (cross) product
==================================================================================

In high energy physics (HEP), {func}`ak.combinations` is often needed to find particles whose trajectories are close to each other, separately in many high-energy collision events (`axis=1`). In some applications, the two collections that need to be matched are simulated particles and reconstructed versions of those particles ("gen-reco matching"), and in other applications, the two collections are different types of particles, such as electrons and jets.

I'll describe how to solve such a problem on this page, but avoid domain-specific jargon by casting it as a problem of finding the distance between bunnies and foxes—if a bunny is too close to a fox, it will get eaten!

```{code-cell} ipython3
import awkward as ak
import numpy as np
```

## Setting up the problem

In 1000 separate yards (big suburb), there's a random number of bunnies and a random number of foxes, each with random _x_, _y_ positions. We're making ragged arrays of records using {func}`ak.unflatten` and {func}`ak.zip`.

```{code-cell} ipython3
np.random.seed(12345)

number_of_bunnies = np.random.poisson(3.5, 1000)   # average of 3.5 bunnies/yard
number_of_foxes = np.random.poisson(1.5, 1000)     # average of 1.5 foxes/yard

bunny_xy = np.random.normal(0, 1, (number_of_bunnies.sum(), 2))
fox_xy = np.random.normal(0, 1, (number_of_foxes.sum(), 2))

bunnies = ak.unflatten(ak.zip({"x": bunny_xy[:, 0], "y": bunny_xy[:, 1]}), number_of_bunnies)
foxes = ak.unflatten(ak.zip({"x": fox_xy[:, 0], "y": fox_xy[:, 1]}), number_of_foxes)
```

```{code-cell} ipython3
bunnies
```

```{code-cell} ipython3
foxes
```

## Find all combinations

In each yard, we find all bunny-fox pairs, regardless of whether they're close or not using {func}`ak.cartesian`, and then unpacking the pairs with {func}`ak.unzip`.

```{code-cell} ipython3
pair_bunnies, pair_foxes = ak.unzip(ak.cartesian([bunnies, foxes]))
```

These two arrays, `pair_bunnies` and `pair_foxes`, have the same type as `bunnies` and `foxes`, but different numbers of items in each list because now they're paired to match each other. Both kinds of animals are duplicated to enable this match.

```{code-cell} ipython3
pair_bunnies
```

```{code-cell} ipython3
pair_foxes
```

The two arrays have the same list lengths as each other because they came from the same {func}`ak.unzip`.

```{code-cell} ipython3
ak.num(pair_bunnies), ak.num(pair_foxes)
```

## Calculating distances

Since the arrays have the same shapes, they can be used in the same mathematical formula. Here's the formula for distance:

```{code-cell} ipython3
distances = np.sqrt((pair_bunnies.x - pair_foxes.x)**2 + (pair_bunnies.y - pair_foxes.y)**2)
distances
```

Let's say that 1 unit is close enough for a bunny to be eaten.

```{code-cell} ipython3
eaten = (distances < 1)
eaten
```

This is great (not for the bunnies, but perhaps for the foxes). However, if we want to use this information on the original arrays, we're stuck: this array has a different shape from the original `bunnies` (and the original `foxes`).

Perhaps the question we really wanted to ask is, "For each bunny, is there _any_ fox that can eat it?"

## Combinations with `nested=True`

Asking a question about _any_ fox means performing a reducer, {func}`ak.any`, over lists, one list per bunny. The list would be all of the foxes in its yard. For that, we'll need to pass `nested=True` to {func}`ak.cartesian`.

```{code-cell} ipython3
pair_bunnies, pair_foxes = ak.unzip(ak.cartesian([bunnies, foxes], nested=True))
```

Now `pair_bunnies` and `pair_foxes` are one list-depth deeper than the original `bunnies` and `foxes`.

```{code-cell} ipython3
pair_bunnies
```

```{code-cell} ipython3
pair_foxes
```

We can compute `distances` in the same way, though it's also one list-depth deeper.

```{code-cell} ipython3
distances = np.sqrt((pair_bunnies.x - pair_foxes.x)**2 + (pair_bunnies.y - pair_foxes.y)**2)
distances
```

Similarly for `eaten`.

```{code-cell} ipython3
eaten = (distances < 1)
eaten
```

Now each inner list of booleans is answering the questions, "Can fox 0 eat me?", "Can fox 1 eat me?", ..., "Can fox _n_ eat me?" and there are exactly as many of these lists as there are bunnies. Applying {func}`ak.any` over the innermost lists (`axis=-1`),

```{code-cell} ipython3
bunny_eaten = ak.any(eaten, axis=-1)
bunny_eaten
```

We've now answered the question, "Can any fox eat me?" for each bunny. After the mayhem, these are the bunnies we have left:

```{code-cell} ipython3
bunnies[~bunny_eaten]
```

Whereas there was originally an average of 3.5 bunnies per yard, by construction,

```{code-cell} ipython3
ak.mean(ak.num(bunnies, axis=1))
```

Now there's only

```{code-cell} ipython3
ak.mean(ak.num(bunnies[~bunny_eaten], axis=1))
```

left.

## Asymmetry in the problem

The way we performed this calculation was asymmetric: for each bunny, we asked if it was eaten. We could have performed a similar, but different, calculation to ask, which foxes get to eat? To do that, we must reverse the order of arguments because `nested=True` groups from the left.

```{code-cell} ipython3
pair_foxes, pair_bunnies = ak.unzip(ak.cartesian([foxes, bunnies], nested=True))

distances = np.sqrt((pair_foxes.x - pair_bunnies.x)**2 + (pair_foxes.y - pair_bunnies.y)**2)

eating = (distances < 1)

fox_eats = ak.any(eating, axis=-1)

foxes[fox_eats]
```