File: how-to-create-strings.md

package info (click to toggle)
python-awkward 2.6.5-1
  • links: PTS, VCS
  • area: main
  • in suites: sid
  • size: 23,088 kB
  • sloc: python: 148,689; cpp: 33,562; sh: 432; makefile: 21; javascript: 8
file content (142 lines) | stat: -rw-r--r-- 4,093 bytes parent folder | download
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
---
jupytext:
  text_representation:
    extension: .md
    format_name: myst
    format_version: 0.13
    jupytext_version: 1.10.3
kernelspec:
  display_name: Python 3
  language: python
  name: python3
---

How to create arrays of strings
===============================

Awkward Arrays can contain strings, although these strings are just a special view of lists of `uint8` numbers. As such, the variable-length data are efficiently stored.

NumPy's strings are padded to have equal width, and Pandas's strings are Python objects. Awkward Array doesn't have nearly as many functions for manipulating arrays of strings as NumPy and Pandas, though.

```{code-cell} ipython3
import awkward as ak
import numpy as np
```

From Python strings
-------------------

The {class}`ak.Array` constructor and {func}`ak.from_iter` recognize strings, and strings are returned by {func}`ak.to_list`.

```{code-cell} ipython3
ak.Array(["one", "two", "three"])
```

They may be nested within anything.

```{code-cell} ipython3
ak.Array([["one", "two"], [], ["three"]])
```

From NumPy arrays
-----------------

NumPy strings are also recognized by {func}`ak.from_numpy` and {func}`ak.to_numpy`.

```{code-cell} ipython3
numpy_array = np.array(["one", "two", "three", "four"])
numpy_array
```

```{code-cell} ipython3
awkward_array = ak.Array(numpy_array)
awkward_array
```

Operations with strings
-----------------------

Since strings are really just lists, some of the list operations "just work" on strings.

```{code-cell} ipython3
ak.num(awkward_array)
```

```{code-cell} ipython3
awkward_array[:, 1:]
```

Others had to be specially overloaded for the string case, such as string-equality. The default meaning for `==` would be to descend to the lowest level and compare numbers (characters, in this case).

```{code-cell} ipython3
awkward_array == "three"
```

```{code-cell} ipython3
awkward_array == ak.Array(["ONE", "TWO", "three", "four"])
```

Similarly, {func}`ak.sort` and {func}`ak.argsort` sort strings lexicographically, not individual characters.

```{code-cell} ipython3
ak.sort(awkward_array)
```

Still other operations had to be inhibited, since they wouldn't make sense for strings.

```{code-cell} ipython3
:tags: [raises-exception]

np.sqrt(awkward_array)
```

Categorical strings
-------------------

A large set of strings with few unique values are more efficiently manipulated as integers than as strings. In Pandas, this is [categorical data](https://pandas.pydata.org/pandas-docs/stable/user_guide/categorical.html), in R, it's called a [factor](https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/factor), and in Arrow and Parquet, it's [dictionary encoding](https://arrow.apache.org/blog/2019/09/05/faster-strings-cpp-parquet/).

The {func}`ak.str.to_categorical` (requires PyArrow) function makes Awkward Arrays categorical in this sense. {func}`ak.to_arrow` and {func}`ak.to_parquet` recognize categorical data and convert it to the corresponding Arrow and Parquet types.

```{code-cell} ipython3
uncategorized = ak.Array(["three", "one", "two", "two", "three", "one", "one", "one"])
uncategorized
```

```{code-cell} ipython3
categorized = ak.str.to_categorical(uncategorized)
categorized
```

Internally, the data now have an index that selects from a set of unique strings.

```{code-cell} ipython3
categorized.layout.index
```

```{code-cell} ipython3
ak.Array(categorized.layout.content)
```

The main advantage to Awkward categorical data (other than proper conversions to Arrow and Parquet) is that equality is performed using the index integers.

```{code-cell} ipython3
categorized == "one"
```

With ArrayBuilder
-----------------

{func}`ak.ArrayBuilder` is described in more detail [in this tutorial](how-to-create-arraybuilder), but you can add strings by calling the `string` method or simply appending them.

(This is what {func}`ak.from_iter` uses internally to accumulate data.)

```{code-cell} ipython3
builder = ak.ArrayBuilder()

builder.string("one")
builder.append("two")
builder.append("three")

array = builder.snapshot()
array
```