1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144
|
---
jupytext:
text_representation:
extension: .md
format_name: myst
format_version: 0.13
jupytext_version: 1.15.0
kernelspec:
display_name: Python 3 (ipykernel)
language: python
name: python3
---
# Read strings from a binary stream
Awkward Array implements support for ragged strings as ragged lists of [code-units](https://en.wikipedia.org/wiki/UTF-8). As such, successive strings are closely packed in memory, leading to high-performance operations.
+++
Let's imagine that we want to read some logging output that is stored in a text file. For example, [a subset of logs from the Android Application framework](https://zenodo.org/record/8196385).
```{code-cell} ipython3
import gzip
import itertools
import pathlib
# Preview logs
log_path = pathlib.Path("..", "samples", "Android.head.log.gz")
with gzip.open(log_path, "rt") as f:
for line in itertools.islice(f, 8):
print(line, end="")
```
To begin with, we can read the decompressed log-files as an array of {data}`np.uint8` dtype using NumPy, and convert the resulting array to an Awkward Array
```{code-cell} ipython3
import awkward as ak
import numpy as np
with gzip.open(log_path, "rb") as f:
# `gzip.open` doesn't return a true file descriptor that NumPy can ingest
# So, instead we read into memory.
arr = np.frombuffer(f.read(), dtype=np.uint8)
raw_bytes = ak.from_numpy(arr)
raw_bytes.type.show()
```
Awkward Array doesn't support scalar values, so we can't treat these characters as a single-string. Instead we need at least one dimension. Let's unflatten our array of characters, to form a length-1 array of characters.
```{code-cell} ipython3
array_of_chars = ak.unflatten(raw_bytes, len(raw_bytes))
array_of_chars
```
We can then ask Awkward Array to treat this array of lists of characters as an array of strings, using {func}`ak.enforce_type`
```{code-cell} ipython3
string = ak.enforce_type(array_of_chars, "string")
string.type.show()
```
The underlying mechanism for implementing strings as lists of code-units can be seen if we inspect the low-level layout that builds the array
```{code-cell} ipython3
string.layout
```
The `__array__` parameter is special. It is reserved by Awkward Array, and signals that the layout is a special pre-undertood built-in type. In this case, that type of the outer {class}`ak.contents.ListOffsetArray` is "string". It can also be seen that the inner {class}`ak.contents.NumpyArray` also has an `__array__` parameter, this time with a value of `char`. In Awkward Array, an array of strings *must* look like this layout; a list with the `__array__="string"` parameter wrapping a {class}`ak.contents.NumpyArray` with the `__array__="char"` parameter.
+++
A single (very long) string isn't much use. Let's split this string at the line boundaries
```{code-cell} ipython3
split_at_newlines = ak.str.split_pattern(string, "\n")
split_at_newlines
```
Now we can remove the temporary length-1 outer dimension that was required to treat the data as a string
```{code-cell} ipython3
lines = split_at_newlines[0]
lines
```
In the low-level layout, we can see that these lines are still just variable-length lists
```{code-cell} ipython3
lines.layout
```
## Bytestrings vs strings
+++
In general, whilst strings can fundamentally be described as lists of bytes (code-units), many string operations do not operate at the byte-level. The {mod}`ak.str` submodule provides a suite of vectorised operations that operate at the code-point (*not* code-unit) level, such as computing the string length. Consider the following simple string
```{code-cell} ipython3
large_code_point = ak.Array(["Å"])
```
In Awkward Array, strings are UTF-8 encoded, meaning that a single code-point may comprise up to four code-units (bytes). Although it looks like this is a single character, if we look at the layout it's clear that the number of code-units is in-fact two
```{code-cell} ipython3
large_code_point.layout
```
This is reflected in the {func}`ak.num` function
```{code-cell} ipython3
ak.num(large_code_point)
```
The {mod}`ak.str` module provides a function for computing the length of a string
```{code-cell} ipython3
ak.str.length(large_code_point)
```
Clearly _this_ function is code-point aware.
+++
If one wants to drop the UTF-8 string abstraction, and instead deal with strings as raw byte arrays, there is the `bytes` type
```{code-cell} ipython3
large_code_point_bytes = ak.enforce_type(large_code_point, "bytes")
large_code_point_bytes
```
The layout of this array has different `"bytestring"` and `"byte"` parameters
```{code-cell} ipython3
large_code_point_bytes.layout
```
Many of the functions in the {mod}`ak.str` module treat bytestrings and strings differently; in the latter case, strings are often manipulated in terms of code-points instead of code-units. Consider {func}`ak.str.length` for this array
```{code-cell} ipython3
ak.str.length(large_code_point_bytes)
```
This is clearly counting the bytes (code-units), not code-points.
|