File: how-to-strings-split-and-join.md

package info (click to toggle)
python-awkward 2.6.5-1
  • links: PTS, VCS
  • area: main
  • in suites: sid
  • size: 23,088 kB
  • sloc: python: 148,689; cpp: 33,562; sh: 432; makefile: 21; javascript: 8
file content (184 lines) | stat: -rw-r--r-- 5,734 bytes parent folder | download
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
---
jupytext:
  text_representation:
    extension: .md
    format_name: myst
    format_version: 0.13
    jupytext_version: 1.15.0
kernelspec:
  display_name: Python 3 (ipykernel)
  language: python
  name: python3
---

# Splitting and joining strings

+++

Strings in Awkward Array can arbitrarily be joined together, and split into sublists. Let's start by creating an array of strings that we can later manipulate. The following `timestamps` array contains a list of timestamp-like strings

```{code-cell} ipython3
import awkward as ak
timestamp = ak.from_iter(
    [
        "12-17 19:31:36.263",
        "12-17 19:31:36.263",
        "12-17 19:31:36.264",
        "12-17 19:31:36.264",
        "12-17 19:31:36.264",
        "12-17 19:31:36.264",
        "12-17 19:31:36.264",
        "12-17 19:31:36.264",
        "12-17 19:31:36.264",
        "12-17 19:31:36.264",
        "12-17 19:31:36.264",
        "12-17 19:31:36.264",
        "12-17 19:31:36.264",
        "12-17 19:31:36.264",
        "12-17 19:31:36.265",
        "12-17 19:31:36.265",
        "12-17 19:31:36.265",
        "12-17 19:31:36.265",
        "12-17 19:31:36.265",
        "12-17 19:31:36.265",
        "12-17 19:31:36.267",
        "12-17 19:31:36.270",
        "12-17 19:31:36.271",
        "12-17 19:31:36.275",
        "12-17 19:31:36.275",
        "12-17 19:31:36.275",
        "12-17 19:31:36.276",
        "12-17 19:31:36.278",
        "12-17 19:31:36.279",
        "12-17 19:31:36.279",
        "12-17 19:31:36.279",
        "12-17 19:31:36.280",
        "12-17 19:31:36.280",
        "12-17 19:31:36.280",
        "12-17 19:31:36.280",
        "12-17 19:31:36.280",
        "12-17 19:31:36.280",
        "12-17 19:31:36.280",
        "12-17 19:31:36.281",
        "12-17 19:31:36.282",
        "12-17 19:31:36.283",
        "12-17 19:31:36.284",
        "12-17 19:31:36.285",
        "12-17 19:31:36.285",
        "12-17 19:31:36.289",
        "12-17 19:31:36.295",
        "12-17 19:31:36.297",
        "12-17 19:31:36.297",
        "12-17 19:31:36.298",
        "12-17 19:31:36.299",
        "12-17 19:31:36.300",
        "12-17 19:31:36.301",
        "12-17 19:31:36.301",
        "12-17 19:31:36.301",
        "12-17 19:31:36.301",
        "12-17 19:31:36.301",
        "12-17 19:31:36.301",
        "12-17 19:31:36.302",
        "12-17 19:31:36.304",
        "12-17 19:31:36.311",
        "12-17 19:31:36.311",
        "12-17 19:31:36.311",
        "12-17 19:31:36.311",
        "12-17 19:31:36.313",
    ]
)
```

## Joining strings together

Parsing datetimes in a performant manner is tricky. Pandas has such an ability, but it uses NumPy's fixed-width strings. Arrow provides `strptime`, but it does not handle fractional seconds or timedeltas and requires a full date. In order to use Arrow's {func}`pyarrow.compute.strptime` function, we can manipulate the string to prepend the date, operating only on the non-fraction part of the match.

+++

Let's assume that these timestamps were recorded in the year 2022. We can prepend the string "2022" with the "-" delimiter to complete the timestamp string

```{code-cell} ipython3
timestamp_with_year = ak.str.join_element_wise(["2022"], timestamp, ["-"])
timestamp_with_year
```

The `["2022"]` and `["-"]` arrays are broadcast with the `timestamp` array before joining element-wise.

+++

{func}`ak.str.join_element_wise` is useful for building new strings from separate arrays. It might also be the case that one has a single array of strings that they wish to join along the final axis (like a reducer). There exists a separate function {func}`ak.str.join` for such a purpose

```{code-cell} ipython3
ak.str.join(
    [
        ["do", "re", "me"],
        ["fa", "so"],
        ["la"],
        ["ti", "da"],
    ],
    separator="-🎵-",
)
```

## Splitting strings apart

+++

The timestamps above still cannot be parsed by Arrow; the fractional time component is not (at time of writing) yet supported. To fix this, we can split the fractional component from the timestamp, and add it as a `timedelta64[ms]` later on.

+++

Let's split the fractional time component into two parts using {func}`ak.str.split_pattern`.

```{code-cell} ipython3
timestamp_split = ak.str.split_pattern(timestamp_with_year, ".", max_splits=1)
timestamp_split
```

```{code-cell} ipython3
timestamp_non_fractional = timestamp_split[:, 0]
timestamp_fractional = timestamp_split[:, 1]
```

Now we can parse these timestamps using Arrow!

```{code-cell} ipython3
import pyarrow.compute

datetime = ak.from_arrow(
    pyarrow.compute.strptime(
        ak.to_arrow(timestamp_non_fractional, extensionarray=False),
        "%Y-%m-%d %H:%M:%S",
        "ms",
    )
)
datetime
```

Finally, we build an offset for the fractional component (in milliseconds) using {func}`ak.strings_astype`

```{code-cell} ipython3
import numpy as np

datetime_offset = ak.strings_astype(timestamp_fractional, np.dtype("timedelta64[ms]"))
datetime_offset
```

This offset is added to the absolute datetime to obtain a timestamp

```{code-cell} ipython3
timestamp = datetime + datetime_offset
timestamp
```

If we had a different parsing library that could only handle dates and times separately, then we could also split on the whitespace. Although {func}`ak.str.split_pattern` supports whitespace, it is more performant (and versatile) to use {func}`ak.str.split_whitespace`

```{code-cell} ipython3
ak.str.split_whitespace(timestamp_with_year)
```

If we also needed to split off the fractional component (and manually build the time delta), then we could have used {func}`ak.str.split_pattern_regex` to split on both whitespace *and* the period

```{code-cell} ipython3
ak.str.split_pattern_regex(timestamp_with_year, r"\.|\s")
```