File: how-to-combinatorics-best-match.md

package info (click to toggle)
python-awkward 2.6.5-1
  • links: PTS, VCS
  • area: main
  • in suites: sid
  • size: 23,088 kB
  • sloc: python: 148,689; cpp: 33,562; sh: 432; makefile: 21; javascript: 8
file content (168 lines) | stat: -rw-r--r-- 5,727 bytes parent folder | download
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
---
jupytext:
  text_representation:
    extension: .md
    format_name: myst
    format_version: 0.13
    jupytext_version: 1.10.3
kernelspec:
  display_name: Python 3
  language: python
  name: python3
---

How to find the best match between two collections using Cartesian (cross) product
==================================================================================

In high energy physics (HEP), {func}`ak.combinations` is often needed to find particles whose trajectories are close to each other, separately in many high-energy collision events (`axis=1`). In some applications, the two collections that need to be matched are simulated particles and reconstructed versions of those particles ("gen-reco matching"), and in other applications, the two collections are different types of particles, such as electrons and jets.

I'll describe how to solve such a problem on this page, but avoid domain-specific jargon by casting it as a problem of finding the distance between bunnies and foxes—if a bunny is too close to a fox, it will get eaten!

```{code-cell} ipython3
import awkward as ak
import numpy as np
```

## Setting up the problem

In 1000 separate yards (big suburb), there's a random number of bunnies and a random number of foxes, each with random _x_, _y_ positions. We're making ragged arrays of records using {func}`ak.unflatten` and {func}`ak.zip`.

```{code-cell} ipython3
np.random.seed(12345)

number_of_bunnies = np.random.poisson(3.5, 1000)   # average of 3.5 bunnies/yard
number_of_foxes = np.random.poisson(1.5, 1000)     # average of 1.5 foxes/yard

bunny_xy = np.random.normal(0, 1, (number_of_bunnies.sum(), 2))
fox_xy = np.random.normal(0, 1, (number_of_foxes.sum(), 2))

bunnies = ak.unflatten(ak.zip({"x": bunny_xy[:, 0], "y": bunny_xy[:, 1]}), number_of_bunnies)
foxes = ak.unflatten(ak.zip({"x": fox_xy[:, 0], "y": fox_xy[:, 1]}), number_of_foxes)
```

```{code-cell} ipython3
bunnies
```

```{code-cell} ipython3
foxes
```

## Find all combinations

In each yard, we find all bunny-fox pairs, regardless of whether they're close or not using {func}`ak.cartesian`, and then unpacking the pairs with {func}`ak.unzip`.

```{code-cell} ipython3
pair_bunnies, pair_foxes = ak.unzip(ak.cartesian([bunnies, foxes]))
```

These two arrays, `pair_bunnies` and `pair_foxes`, have the same type as `bunnies` and `foxes`, but different numbers of items in each list because now they're paired to match each other. Both kinds of animals are duplicated to enable this match.

```{code-cell} ipython3
pair_bunnies
```

```{code-cell} ipython3
pair_foxes
```

The two arrays have the same list lengths as each other because they came from the same {func}`ak.unzip`.

```{code-cell} ipython3
ak.num(pair_bunnies), ak.num(pair_foxes)
```

## Calculating distances

Since the arrays have the same shapes, they can be used in the same mathematical formula. Here's the formula for distance:

```{code-cell} ipython3
distances = np.sqrt((pair_bunnies.x - pair_foxes.x)**2 + (pair_bunnies.y - pair_foxes.y)**2)
distances
```

Let's say that 1 unit is close enough for a bunny to be eaten.

```{code-cell} ipython3
eaten = (distances < 1)
eaten
```

This is great (not for the bunnies, but perhaps for the foxes). However, if we want to use this information on the original arrays, we're stuck: this array has a different shape from the original `bunnies` (and the original `foxes`).

Perhaps the question we really wanted to ask is, "For each bunny, is there _any_ fox that can eat it?"

## Combinations with `nested=True`

Asking a question about _any_ fox means performing a reducer, {func}`ak.any`, over lists, one list per bunny. The list would be all of the foxes in its yard. For that, we'll need to pass `nested=True` to {func}`ak.cartesian`.

```{code-cell} ipython3
pair_bunnies, pair_foxes = ak.unzip(ak.cartesian([bunnies, foxes], nested=True))
```

Now `pair_bunnies` and `pair_foxes` are one list-depth deeper than the original `bunnies` and `foxes`.

```{code-cell} ipython3
pair_bunnies
```

```{code-cell} ipython3
pair_foxes
```

We can compute `distances` in the same way, though it's also one list-depth deeper.

```{code-cell} ipython3
distances = np.sqrt((pair_bunnies.x - pair_foxes.x)**2 + (pair_bunnies.y - pair_foxes.y)**2)
distances
```

Similarly for `eaten`.

```{code-cell} ipython3
eaten = (distances < 1)
eaten
```

Now each inner list of booleans is answering the questions, "Can fox 0 eat me?", "Can fox 1 eat me?", ..., "Can fox _n_ eat me?" and there are exactly as many of these lists as there are bunnies. Applying {func}`ak.any` over the innermost lists (`axis=-1`),

```{code-cell} ipython3
bunny_eaten = ak.any(eaten, axis=-1)
bunny_eaten
```

We've now answered the question, "Can any fox eat me?" for each bunny. After the mayhem, these are the bunnies we have left:

```{code-cell} ipython3
bunnies[~bunny_eaten]
```

Whereas there was originally an average of 3.5 bunnies per yard, by construction,

```{code-cell} ipython3
ak.mean(ak.num(bunnies, axis=1))
```

Now there's only

```{code-cell} ipython3
ak.mean(ak.num(bunnies[~bunny_eaten], axis=1))
```

left.

## Asymmetry in the problem

The way we performed this calculation was asymmetric: for each bunny, we asked if it was eaten. We could have performed a similar, but different, calculation to ask, which foxes get to eat? To do that, we must reverse the order of arguments because `nested=True` groups from the left.

```{code-cell} ipython3
pair_foxes, pair_bunnies = ak.unzip(ak.cartesian([foxes, bunnies], nested=True))

distances = np.sqrt((pair_foxes.x - pair_bunnies.x)**2 + (pair_foxes.y - pair_bunnies.y)**2)

eating = (distances < 1)

fox_eats = ak.any(eating, axis=-1)

foxes[fox_eats]
```