File: how-to-strings-read-binary.md

package info (click to toggle)
python-awkward 2.6.5-1
  • links: PTS, VCS
  • area: main
  • in suites: sid
  • size: 23,088 kB
  • sloc: python: 148,689; cpp: 33,562; sh: 432; makefile: 21; javascript: 8
file content (144 lines) | stat: -rw-r--r-- 5,030 bytes parent folder | download
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
---
jupytext:
  text_representation:
    extension: .md
    format_name: myst
    format_version: 0.13
    jupytext_version: 1.15.0
kernelspec:
  display_name: Python 3 (ipykernel)
  language: python
  name: python3
---

# Read strings from a binary stream

Awkward Array implements support for ragged strings as ragged lists of [code-units](https://en.wikipedia.org/wiki/UTF-8). As such, successive strings are closely packed in memory, leading to high-performance operations.

+++

Let's imagine that we want to read some logging output that is stored in a text file. For example, [a subset of logs from the Android Application framework](https://zenodo.org/record/8196385).

```{code-cell} ipython3
import gzip
import itertools
import pathlib

# Preview logs
log_path = pathlib.Path("..", "samples", "Android.head.log.gz")
with gzip.open(log_path, "rt") as f:
    for line in itertools.islice(f, 8):
        print(line, end="")
```

To begin with, we can read the decompressed log-files as an array of {data}`np.uint8` dtype using NumPy, and convert the resulting array to an Awkward Array

```{code-cell} ipython3
import awkward as ak
import numpy as np

with gzip.open(log_path, "rb") as f:
    # `gzip.open` doesn't return a true file descriptor that NumPy can ingest
    # So, instead we read into memory.
    arr = np.frombuffer(f.read(), dtype=np.uint8)

raw_bytes = ak.from_numpy(arr)
raw_bytes.type.show()
```

Awkward Array doesn't support scalar values, so we can't treat these characters as a single-string. Instead we need at least one dimension. Let's unflatten our array of characters, to form a length-1 array of characters.

```{code-cell} ipython3
array_of_chars = ak.unflatten(raw_bytes, len(raw_bytes))
array_of_chars
```

We can then ask Awkward Array to treat this array of lists of characters as an array of strings, using {func}`ak.enforce_type`

```{code-cell} ipython3
string = ak.enforce_type(array_of_chars, "string")
string.type.show()
```

The underlying mechanism for implementing strings as lists of code-units can be seen if we inspect the low-level layout that builds the array

```{code-cell} ipython3
string.layout
```

The `__array__` parameter is special. It is reserved by Awkward Array, and signals that the layout is a special pre-undertood built-in type. In this case, that type of the outer {class}`ak.contents.ListOffsetArray` is "string". It can also be seen that the inner {class}`ak.contents.NumpyArray` also has an `__array__` parameter, this time with a value of `char`. In Awkward Array, an array of strings *must* look like this layout; a list with the `__array__="string"` parameter wrapping a {class}`ak.contents.NumpyArray` with the `__array__="char"` parameter.

+++

A single (very long) string isn't much use. Let's split this string at the line boundaries

```{code-cell} ipython3
split_at_newlines = ak.str.split_pattern(string, "\n")
split_at_newlines
```

Now we can remove the temporary length-1 outer dimension that was required to treat the data as a string

```{code-cell} ipython3
lines = split_at_newlines[0]
lines
```

In the low-level layout, we can see that these lines are still just variable-length lists

```{code-cell} ipython3
lines.layout
```

## Bytestrings vs strings

+++

In general, whilst strings can fundamentally be described as lists of bytes (code-units), many string operations do not operate at the byte-level. The {mod}`ak.str` submodule provides a suite of vectorised operations that operate at the code-point (*not* code-unit) level, such as computing the string length. Consider the following simple string

```{code-cell} ipython3
large_code_point = ak.Array(["Å"])
```

In Awkward Array, strings are UTF-8 encoded, meaning that a single code-point may comprise up to four code-units (bytes). Although it looks like this is a single character, if we look at the layout it's clear that the number of code-units is in-fact two

```{code-cell} ipython3
large_code_point.layout
```

This is reflected in the {func}`ak.num` function

```{code-cell} ipython3
ak.num(large_code_point)
```

The {mod}`ak.str` module provides a function for computing the length of a string

```{code-cell} ipython3
ak.str.length(large_code_point)
```

Clearly _this_ function is code-point aware.

+++

If one wants to drop the UTF-8 string abstraction, and instead deal with strings as raw byte arrays, there is the `bytes` type

```{code-cell} ipython3
large_code_point_bytes = ak.enforce_type(large_code_point, "bytes")
large_code_point_bytes
```

The layout of this array has different `"bytestring"` and `"byte"` parameters

```{code-cell} ipython3
large_code_point_bytes.layout
```

Many of the functions in the {mod}`ak.str` module treat bytestrings and strings differently; in the latter case, strings are often manipulated in terms of code-points instead of code-units. Consider {func}`ak.str.length` for this array

```{code-cell} ipython3
ak.str.length(large_code_point_bytes)
```

This is clearly counting the bytes (code-units), not code-points.