1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184
|
---
jupytext:
text_representation:
extension: .md
format_name: myst
format_version: 0.13
jupytext_version: 1.15.0
kernelspec:
display_name: Python 3 (ipykernel)
language: python
name: python3
---
# Splitting and joining strings
+++
Strings in Awkward Array can arbitrarily be joined together, and split into sublists. Let's start by creating an array of strings that we can later manipulate. The following `timestamps` array contains a list of timestamp-like strings
```{code-cell} ipython3
import awkward as ak
timestamp = ak.from_iter(
[
"12-17 19:31:36.263",
"12-17 19:31:36.263",
"12-17 19:31:36.264",
"12-17 19:31:36.264",
"12-17 19:31:36.264",
"12-17 19:31:36.264",
"12-17 19:31:36.264",
"12-17 19:31:36.264",
"12-17 19:31:36.264",
"12-17 19:31:36.264",
"12-17 19:31:36.264",
"12-17 19:31:36.264",
"12-17 19:31:36.264",
"12-17 19:31:36.264",
"12-17 19:31:36.265",
"12-17 19:31:36.265",
"12-17 19:31:36.265",
"12-17 19:31:36.265",
"12-17 19:31:36.265",
"12-17 19:31:36.265",
"12-17 19:31:36.267",
"12-17 19:31:36.270",
"12-17 19:31:36.271",
"12-17 19:31:36.275",
"12-17 19:31:36.275",
"12-17 19:31:36.275",
"12-17 19:31:36.276",
"12-17 19:31:36.278",
"12-17 19:31:36.279",
"12-17 19:31:36.279",
"12-17 19:31:36.279",
"12-17 19:31:36.280",
"12-17 19:31:36.280",
"12-17 19:31:36.280",
"12-17 19:31:36.280",
"12-17 19:31:36.280",
"12-17 19:31:36.280",
"12-17 19:31:36.280",
"12-17 19:31:36.281",
"12-17 19:31:36.282",
"12-17 19:31:36.283",
"12-17 19:31:36.284",
"12-17 19:31:36.285",
"12-17 19:31:36.285",
"12-17 19:31:36.289",
"12-17 19:31:36.295",
"12-17 19:31:36.297",
"12-17 19:31:36.297",
"12-17 19:31:36.298",
"12-17 19:31:36.299",
"12-17 19:31:36.300",
"12-17 19:31:36.301",
"12-17 19:31:36.301",
"12-17 19:31:36.301",
"12-17 19:31:36.301",
"12-17 19:31:36.301",
"12-17 19:31:36.301",
"12-17 19:31:36.302",
"12-17 19:31:36.304",
"12-17 19:31:36.311",
"12-17 19:31:36.311",
"12-17 19:31:36.311",
"12-17 19:31:36.311",
"12-17 19:31:36.313",
]
)
```
## Joining strings together
Parsing datetimes in a performant manner is tricky. Pandas has such an ability, but it uses NumPy's fixed-width strings. Arrow provides `strptime`, but it does not handle fractional seconds or timedeltas and requires a full date. In order to use Arrow's {func}`pyarrow.compute.strptime` function, we can manipulate the string to prepend the date, operating only on the non-fraction part of the match.
+++
Let's assume that these timestamps were recorded in the year 2022. We can prepend the string "2022" with the "-" delimiter to complete the timestamp string
```{code-cell} ipython3
timestamp_with_year = ak.str.join_element_wise(["2022"], timestamp, ["-"])
timestamp_with_year
```
The `["2022"]` and `["-"]` arrays are broadcast with the `timestamp` array before joining element-wise.
+++
{func}`ak.str.join_element_wise` is useful for building new strings from separate arrays. It might also be the case that one has a single array of strings that they wish to join along the final axis (like a reducer). There exists a separate function {func}`ak.str.join` for such a purpose
```{code-cell} ipython3
ak.str.join(
[
["do", "re", "me"],
["fa", "so"],
["la"],
["ti", "da"],
],
separator="-🎵-",
)
```
## Splitting strings apart
+++
The timestamps above still cannot be parsed by Arrow; the fractional time component is not (at time of writing) yet supported. To fix this, we can split the fractional component from the timestamp, and add it as a `timedelta64[ms]` later on.
+++
Let's split the fractional time component into two parts using {func}`ak.str.split_pattern`.
```{code-cell} ipython3
timestamp_split = ak.str.split_pattern(timestamp_with_year, ".", max_splits=1)
timestamp_split
```
```{code-cell} ipython3
timestamp_non_fractional = timestamp_split[:, 0]
timestamp_fractional = timestamp_split[:, 1]
```
Now we can parse these timestamps using Arrow!
```{code-cell} ipython3
import pyarrow.compute
datetime = ak.from_arrow(
pyarrow.compute.strptime(
ak.to_arrow(timestamp_non_fractional, extensionarray=False),
"%Y-%m-%d %H:%M:%S",
"ms",
)
)
datetime
```
Finally, we build an offset for the fractional component (in milliseconds) using {func}`ak.strings_astype`
```{code-cell} ipython3
import numpy as np
datetime_offset = ak.strings_astype(timestamp_fractional, np.dtype("timedelta64[ms]"))
datetime_offset
```
This offset is added to the absolute datetime to obtain a timestamp
```{code-cell} ipython3
timestamp = datetime + datetime_offset
timestamp
```
If we had a different parsing library that could only handle dates and times separately, then we could also split on the whitespace. Although {func}`ak.str.split_pattern` supports whitespace, it is more performant (and versatile) to use {func}`ak.str.split_whitespace`
```{code-cell} ipython3
ak.str.split_whitespace(timestamp_with_year)
```
If we also needed to split off the fractional component (and manually build the time delta), then we could have used {func}`ak.str.split_pattern_regex` to split on both whitespace *and* the period
```{code-cell} ipython3
ak.str.split_pattern_regex(timestamp_with_year, r"\.|\s")
```
|