1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447
|
---
jupytext:
text_representation:
extension: .md
format_name: myst
format_version: 0.13
jupytext_version: 1.14.0
kernelspec:
display_name: Python 3 (ipykernel)
language: python
name: python3
---
How to create arrays with ArrayBuilder (easy and general)
=========================================================
If you're not getting data from a file or conversion from another format, you may need to create it from scratch. {class}`ak.ArrayBuilder` is a general, high-level way to do that, though it has performance limitations.
The biggest difference between an {class}`ak.ArrayBuilder` and an {class}`ak.Array` is that you can append data to a builder, but an array is immutable ([see qualifications on mutability](how-to-convert-numpy.html#mutability-of-awkward-arrays-from-numpy)). It's a bit like a Python list, which has an `append` method, but {class}`ak.ArrayBuilder` has many methods for appending different types of structures.
+++
Appending
---------
```{code-cell} ipython3
import awkward as ak
```
When a builder is first created, it has zero length and unknown type.
```{code-cell} ipython3
builder = ak.ArrayBuilder()
builder
```
Calling its `append` method adds data and also determines its type.
```{code-cell} ipython3
builder.append(1)
builder
```
```{code-cell} ipython3
builder.append(2.2)
builder
```
```{code-cell} ipython3
builder.append(3 + 1j)
builder
```
Note that this can include missing data by promoting to an {class}`option-type <ak.types.OptionType>`,
```{code-cell} ipython3
builder.append(None)
builder
```
and mix types by promoting to a {class}`union-type <ak.types.UnionType>`:
```{code-cell} ipython3
builder.append("five")
builder
```
```{code-cell} ipython3
builder.type.show()
```
We've been using "`append`" because it is generic (it recognizes the types of its arguments and builds that), but there are also methods for building structure explicitly.
```{code-cell} ipython3
builder = ak.ArrayBuilder()
builder.boolean(False)
builder.integer(1)
builder.real(2.2)
builder.complex(3 + 1j)
builder.null()
builder.string("five")
builder
```
```{code-cell} ipython3
builder.type.show()
```
Snapshot
--------
To turn an {class}`ak.ArrayBuilder` into an {class}`ak.Array`, call `snapshot`. This is an inexpensive operation (may be done multiple times; the builder is unaffected).
```{code-cell} ipython3
array = builder.snapshot()
array
```
Builders don't have all the high-level methods that arrays do, so if you want to use the array for normal analysis, remember to take a snapshot.
+++
Nested lists
------------
The most useful of these create nested data structures:
* `begin_list`/`end_list`
* `begin_record`/`end_record`
* `begin_tuple`/`end_tuple`
which switch into a mode that starts filling inside of a list, record, or tuple. For records and tuples, you additionally have to specify the `field` or `index` of the record or tuple (respectively).
```{code-cell} ipython3
builder = ak.ArrayBuilder()
builder.begin_list()
builder.append(1.1)
builder.append(2.2)
builder.append(3.3)
builder.end_list()
builder.begin_list()
builder.end_list()
builder.begin_list()
builder.append(4.4)
builder.append(5.5)
builder.end_list()
builder
```
Appending after the `begin_list` puts data inside the list, rather than outside:
```{code-cell} ipython3
builder.append(9.9)
builder
```
This `9.9` is outside of the lists, and hence the type is now "lists of numbers *or* numbers."
```{code-cell} ipython3
builder.type.show()
```
Since `begin_list` and `end_list` are imperative, the nesting structure of an array can be determined by program flow:
```{code-cell} ipython3
def arbitrary_nesting(builder, depth):
if depth == 0:
builder.append(1)
builder.append(2)
builder.append(3)
else:
builder.begin_list()
arbitrary_nesting(builder, depth - 1)
builder.end_list()
builder = ak.ArrayBuilder()
arbitrary_nesting(builder, 5)
builder
```
Often, you'll know the exact depth of nesting you want. The Python `with` statement can be used to restrict the generality (nd free you from having to remember to `end` what you `begin`).
```{code-cell} ipython3
builder = ak.ArrayBuilder()
with builder.list():
with builder.list():
builder.append(1)
builder.append(2)
builder.append(3)
builder
```
(Note that the Python `with` statement, a.k.a. "context manager," is not available in Numba jit-compiled functions, in case you're using {class}`ak.ArrayBuilder` in Numba.)
+++
Nested records
--------------
When using `begin_record`/`end_record` (or the equivalent `record` in the `with` statement), you have to specify which field each "`append`" is associated with.
* `field("fieldname")`: switches to fill a field with a given name (and returns the builder, for convenience).
```{code-cell} ipython3
builder = ak.ArrayBuilder()
with builder.record():
builder.field("x").append(1)
builder.field("y").append(2.2)
builder.field("z").append("three")
builder
```
The record type can also be given a name.
```{code-cell} ipython3
builder = ak.ArrayBuilder()
with builder.record("Point"):
builder.field("x").real(1.1)
builder.field("y").real(2.2)
builder.field("z").real(3.3)
builder
```
This gives the resulting records a type named "`Point`", which might have specialized behaviors.
```{code-cell} ipython3
array = builder.snapshot()
array
```
Nested tuples
-------------
The same is true for tuples, but the next field to fill is selected by "`index`" (integer), rather than "`field`" (string), and the tuple size has to be given up-front.
```{code-cell} ipython3
builder = ak.ArrayBuilder()
with builder.tuple(3):
builder.index(0).append(1)
builder.index(1).append(2.2)
builder.index(2).append("three")
builder
```
Records and unions
------------------
If the set of fields changes while collecting records, the builder algorithm could handle it one of two possible ways:
1. Assume that the new field or fields have simply been missing up to this point, and that any now-unspecified fields are also missing.
2. Assume that a different set of fields means a different type and make a union.
By default, {class}`ak.ArrayBuilder` follows policy (1), but it can be made to follow policy (2) if the names of the records are different.
```{code-cell} ipython3
policy1 = ak.ArrayBuilder()
with policy1.record():
policy1.field("x").append(1)
policy1.field("y").append(1.1)
with policy1.record():
policy1.field("y").append(2.2)
policy1.field("z").append("three")
policy1.type.show()
```
```{code-cell} ipython3
policy2 = ak.ArrayBuilder()
with policy2.record("First"):
policy2.field("x").append(1)
policy2.field("y").append(1.1)
with policy2.record("Second"):
policy2.field("y").append(2.2)
policy2.field("z").append("three")
policy2.type.show()
```
Comments on union-type
----------------------
Although it's easy to make {class}`union-type <ak.types.UnionType>` data with {class}`ak.ArrayBuilder`, the applications of union-type data are more limited. For instance, we can select a field that belongs to _all_ types of the union, but not any fields that don't share that field.
```{code-cell} ipython3
array2 = policy2.snapshot()
array2
```
```{code-cell} ipython3
array2.y
```
```{code-cell} ipython3
:tags: [raises-exception]
array2.x
```
The above would be no problem for records collected using policy 1 (see previous section).
```{code-cell} ipython3
array1 = policy1.snapshot()
array1
```
```{code-cell} ipython3
array1.y
```
```{code-cell} ipython3
array1.x
```
At the time of writing, {class}`union-type <ak.types.UnionType>` are not supported in Numba ([issue 174](https://github.com/scikit-hep/awkward-1.0/issues/174)).
+++
Use in Numba
------------
{class}`ak.ArrayBuilder` can be used in Numba-compiled functions, and that can often be the most convenient way to build up an array, relatively quickly (see below).
There are a few limitations, though:
* At the time of writing, [Numba doesn't support Python's `with` statement](https://numba.pydata.org/numba-doc/dev/reference/pysupported.html#language) (context manager), so `begin_list`/`end_list` will have to be used instead.
* Builders cannot be constructed inside of the compiled function; they have to be passed in.
* The `snapshot` method cannot be called inside of the compiled function; it has to be applied to the output.
Therefore, a common pattern is:
```{code-cell} ipython3
import numba as nb
@nb.jit
def build(builder):
builder.begin_list()
builder.append(1.1)
builder.append(2.2)
builder.append(3.3)
builder.end_list()
builder.begin_list()
builder.end_list()
builder.begin_list()
builder.append(4.4)
builder.append(5.5)
builder.end_list()
return builder
array = build(ak.ArrayBuilder()).snapshot()
array
```
Setting the type of empty lists
-------------------------------
In addition to supporting type-discovery at execution time, {class}`ak.ArrayBuilder` also makes it convenient to work with complex, ragged arrays when the type is known ahead of time. Although it is not the most performant means of constructing an array whose type is already known, it provides a readable abstraction in the event that building the array is not a limiting factor for performance. However, due to this "on-line" type-discovery, it is possible that for certain data the result of {meth}`ak.ArrayBuilder.snapshot` will have different types. Consider this function that builds an array from the contents of some iterable:
```{code-cell} ipython3
def process_data(builder, data):
for item in data:
if item < 0:
builder.null()
else:
builder.integer(item)
return builder.snapshot()
```
If we pass in only positive integers, the result is an array of integers:
```{code-cell} ipython3
process_data(
ak.ArrayBuilder(),
[1, 2, 3, 4],
)
```
If we pass in only negative integers, the result is an array of `None`s with an unknown type:
```{code-cell} ipython3
process_data(
ak.ArrayBuilder(),
[-1, -2, -3, -4],
)
```
It is only if we pass in a mix of these values that we see the "full" array type:
```{code-cell} ipython3
process_data(
ak.ArrayBuilder(),
[1, 2, 3, 4, -1, -2, -3, -4],
)
```
A simple way to solve this problem is to explore all code branches explicitly, and remove the generated entry(ies) from the final array:
```{code-cell} ipython3
def process_data(builder, data):
for item in data:
if item < 0:
builder.null()
else:
builder.integer(item)
# Ensure we have the proper type
builder.integer(1)
builder.null()
return builder.snapshot()[:-2]
```
The previous examples now have the same type:
```{code-cell} ipython3
process_data(
ak.ArrayBuilder(),
[1, 2, 3, 4],
)
```
```{code-cell} ipython3
process_data(
ak.ArrayBuilder(),
[-1, -2, -3, -4],
)
```
```{code-cell} ipython3
process_data(
ak.ArrayBuilder(),
[1, 2, 3, 4, -1, -2, -3, -4],
)
```
Comments on performance
-----------------------
Although {class}`ak.ArrayBuilder` is implemented in C++, it is dynamically typed by design. The advantage of compiled code over interpreted code often comes in the knowledge of data types at compile-time, enabling fewer runtime checks and more compiler optimizations.
If you're using a builder in Python, there's also the overhead of calling from Python.
If you're using a builder in Numba, the builder calls are external function calls and LLVM can't inline them for optimizations.
Whenever you have a choice between
1. using the {class}`ak.ArrayBuilder`,
2. constructing an array manually from layouts (next chapter), or
3. filling a NumPy array and using it as an index,
the alternatives are often faster. The point of {class}`ak.ArrayBuilder` is that it is *easy*.
|