
|
---
jupytext:
text_representation:
extension: .md
format_name: myst
format_version: 0.13
jupytext_version: 1.15.0
kernelspec:
display_name: Python 3 (ipykernel)
language: python
name: python3
---
# Splitting and joining strings
+++
Strings in Awkward Array can arbitrarily be joined together, and split into sublists. Let's start by creating an array of strings that we can later manipulate. The following `timestamps` array contains a list of timestamp-like strings
```{code-cell} ipython3
import awkward as ak
timestamp = ak.from_iter(
[
"12-17 19:31:36.263",
"12-17 19:31:36.263",
"12-17 19:31:36.264",
"12-17 19:31:36.264",
"12-17 19:31:36.264",
"12-17 19:31:36.264",
"12-17 19:31:36.264",
"12-17 19:31:36.264",
"12-17 19:31:36.264",
"12-17 19:31:36.264",
"12-17 19:31:36.264",
"12-17 19:31:36.264",
"12-17 19:31:36.264",
"12-17 19:31:36.264",
"12-17 19:31:36.265",
"12-17 19:31:36.265",
"12-17 19:31:36.265",
"12-17 19:31:36.265",
"12-17 19:31:36.265",
"12-17 19:31:36.265",
"12-17 19:31:36.267",
"12-17 19:31:36.270",
"12-17 19:31:36.271",
"12-17 19:31:36.275",
"12-17 19:31:36.275",
"12-17 19:31:36.275",
"12-17 19:31:36.276",
"12-17 19:31:36.278",
"12-17 19:31:36.279",
"12-17 19:31:36.279",
"12-17 19:31:36.279",
"12-17 19:31:36.280",
"12-17 19:31:36.280",
"12-17 19:31:36.280",
"12-17 19:31:36.280",
"12-17 19:31:36.280",
"12-17 19:31:36.280",
"12-17 19:31:36.280",
"12-17 19:31:36.281",
"12-17 19:31:36.282",
"12-17 19:31:36.283",
"12-17 19:31:36.284",
"12-17 19:31:36.285",
"12-17 19:31:36.285",
"12-17 19:31:36.289",
"12-17 19:31:36.295",
"12-17 19:31:36.297",
"12-17 19:31:36.297",
"12-17 19:31:36.298",
"12-17 19:31:36.299",
"12-17 19:31:36.300",
"12-17 19:31:36.301",
"12-17 19:31:36.301",
"12-17 19:31:36.301",
"12-17 19:31:36.301",
"12-17 19:31:36.301",
"12-17 19:31:36.301",
"12-17 19:31:36.302",
"12-17 19:31:36.304",
"12-17 19:31:36.311",
"12-17 19:31:36.311",
"12-17 19:31:36.311",
"12-17 19:31:36.311",
"12-17 19:31:36.313",
]
)
```
## Joining strings together
Parsing datetimes in a performant manner is tricky. Pandas has such an ability, but it uses NumPy's fixed-width strings. Arrow provides `strptime`, but it does not handle fractional seconds or timedeltas and requires a full date. In order to use Arrow's {func}`pyarrow.compute.strptime` function, we can manipulate the string to prepend the date, operating only on the non-fraction part of the match.
+++
Let's assume that these timestamps were recorded in the year 2022. We can prepend the string "2022" with the "-" delimiter to complete the timestamp string
```{code-cell} ipython3
timestamp_with_year = ak.str.join_element_wise(["2022"], timestamp, ["-"])
timestamp_with_year
```
The `["2022"]` and `["-"]` arrays are broadcast with the `timestamp` array before joining element-wise.
+++
{func}`ak.str.join_element_wise` is useful for building new strings from separate arrays. It might also be the case that one has a single array of strings that they wish to join along the final axis (like a reducer). There exists a separate function {func}`ak.str.join` for such a purpose
```{code-cell} ipython3
ak.str.join(
[
["do", "re", "me"],
["fa", "so"],
["la"],
["ti", "da"],
],
separator="-🎵-",
)
```
## Splitting strings apart
+++
The timestamps above still cannot be parsed by Arrow; the fractional time component is not (at time of writing) yet supported. To fix this, we can split the fractional component from the timestamp, and add it as a `timedelta64[ms]` later on.
+++
Let's split the fractional time component into two parts using {func}`ak.str.split_pattern`.
```{code-cell} ipython3
timestamp_split = ak.str.split_pattern(timestamp_with_year, ".", max_splits=1)
timestamp_split
```
```{code-cell} ipython3
timestamp_non_fractional = timestamp_split[:, 0]
timestamp_fractional = timestamp_split[:, 1]
```
Now we can parse these timestamps using Arrow!
```{code-cell} ipython3
import pyarrow.compute
datetime = ak.from_arrow(
pyarrow.compute.strptime(
ak.to_arrow(timestamp_non_fractional, extensionarray=False),
"%Y-%m-%d %H:%M:%S",
"ms",
)
)
datetime
```
Finally, we build an offset for the fractional component (in milliseconds) using {func}`ak.strings_astype`
```{code-cell} ipython3
import numpy as np
datetime_offset = ak.strings_astype(timestamp_fractional, np.dtype("timedelta64[ms]"))
datetime_offset
```
This offset is added to the absolute datetime to obtain a timestamp
```{code-cell} ipython3
timestamp = datetime + datetime_offset
timestamp
```
If we had a different parsing library that could only handle dates and times separately, then we could also split on the whitespace. Although {func}`ak.str.split_pattern` supports whitespace, it is more performant (and versatile) to use {func}`ak.str.split_whitespace`
```{code-cell} ipython3
ak.str.split_whitespace(timestamp_with_year)
```
If we also needed to split off the fractional component (and manually build the time delta), then we could have used {func}`ak.str.split_pattern_regex` to split on both whitespace *and* the period
```{code-cell} ipython3
ak.str.split_pattern_regex(timestamp_with_year, r"\.|\s")
```
|