File: how-to-strings-split-and-join.md

package info (click to toggle)
python-awkward 2.6.5-1
links: PTS, VCS
area: main
in suites: sid
size: 23,088 kB
sloc: python: 148,689; cpp: 33,562; sh: 432; makefile: 21; javascript: 8
file content (184 lines) | stat: -rw-r--r-- 5,734 bytes
---
jupytext:
  text_representation:
    extension: .md
    format_name: myst
    format_version: 0.13
    jupytext_version: 1.15.0
kernelspec:
  display_name: Python 3 (ipykernel)
  language: python
  name: python3
---

# Splitting and joining strings

+++

Strings in Awkward Array can arbitrarily be joined together, and split into sublists. Let's start by creating an array of strings that we can later manipulate. The following `timestamps` array contains a list of timestamp-like strings

```{code-cell} ipython3
import awkward as ak
timestamp = ak.from_iter(
    [
        "12-17 19:31:36.263",
        "12-17 19:31:36.263",
        "12-17 19:31:36.264",
        "12-17 19:31:36.264",
        "12-17 19:31:36.264",
        "12-17 19:31:36.264",
        "12-17 19:31:36.264",
        "12-17 19:31:36.264",
        "12-17 19:31:36.264",
        "12-17 19:31:36.264",
        "12-17 19:31:36.264",
        "12-17 19:31:36.264",
        "12-17 19:31:36.264",
        "12-17 19:31:36.264",
        "12-17 19:31:36.265",
        "12-17 19:31:36.265",
        "12-17 19:31:36.265",
        "12-17 19:31:36.265",
        "12-17 19:31:36.265",
        "12-17 19:31:36.265",
        "12-17 19:31:36.267",
        "12-17 19:31:36.270",
        "12-17 19:31:36.271",
        "12-17 19:31:36.275",
        "12-17 19:31:36.275",
        "12-17 19:31:36.275",
        "12-17 19:31:36.276",
        "12-17 19:31:36.278",
        "12-17 19:31:36.279",
        "12-17 19:31:36.279",
        "12-17 19:31:36.279",
        "12-17 19:31:36.280",
        "12-17 19:31:36.280",
        "12-17 19:31:36.280",
        "12-17 19:31:36.280",
        "12-17 19:31:36.280",
        "12-17 19:31:36.280",
        "12-17 19:31:36.280",
        "12-17 19:31:36.281",
        "12-17 19:31:36.282",
        "12-17 19:31:36.283",
        "12-17 19:31:36.284",
        "12-17 19:31:36.285",
        "12-17 19:31:36.285",
        "12-17 19:31:36.289",
        "12-17 19:31:36.295",
        "12-17 19:31:36.297",
        "12-17 19:31:36.297",
        "12-17 19:31:36.298",
        "12-17 19:31:36.299",
        "12-17 19:31:36.300",
        "12-17 19:31:36.301",
        "12-17 19:31:36.301",
        "12-17 19:31:36.301",
        "12-17 19:31:36.301",
        "12-17 19:31:36.301",
        "12-17 19:31:36.301",
        "12-17 19:31:36.302",
        "12-17 19:31:36.304",
        "12-17 19:31:36.311",
        "12-17 19:31:36.311",
        "12-17 19:31:36.311",
        "12-17 19:31:36.311",
        "12-17 19:31:36.313",
    ]
)
```

## Joining strings together

Parsing datetimes in a performant manner is tricky. Pandas has such an ability, but it uses NumPy's fixed-width strings. Arrow provides `strptime`, but it does not handle fractional seconds or timedeltas and requires a full date. In order to use Arrow's {func}`pyarrow.compute.strptime` function, we can manipulate the string to prepend the date, operating only on the non-fraction part of the match.

+++

Let's assume that these timestamps were recorded in the year 2022. We can prepend the string "2022" with the "-" delimiter to complete the timestamp string

```{code-cell} ipython3
timestamp_with_year = ak.str.join_element_wise(["2022"], timestamp, ["-"])
timestamp_with_year
```

The `["2022"]` and `["-"]` arrays are broadcast with the `timestamp` array before joining element-wise.

+++

{func}`ak.str.join_element_wise` is useful for building new strings from separate arrays. It might also be the case that one has a single array of strings that they wish to join along the final axis (like a reducer). There exists a separate function {func}`ak.str.join` for such a purpose

```{code-cell} ipython3
ak.str.join(
    [
        ["do", "re", "me"],
        ["fa", "so"],
        ["la"],
        ["ti", "da"],
    ],
    separator="-🎵-",
)
```

## Splitting strings apart

+++

The timestamps above still cannot be parsed by Arrow; the fractional time component is not (at time of writing) yet supported. To fix this, we can split the fractional component from the timestamp, and add it as a `timedelta64[ms]` later on.

+++

Let's split the fractional time component into two parts using {func}`ak.str.split_pattern`.

```{code-cell} ipython3
timestamp_split = ak.str.split_pattern(timestamp_with_year, ".", max_splits=1)
timestamp_split
```

```{code-cell} ipython3
timestamp_non_fractional = timestamp_split[:, 0]
timestamp_fractional = timestamp_split[:, 1]
```

Now we can parse these timestamps using Arrow!

```{code-cell} ipython3
import pyarrow.compute

datetime = ak.from_arrow(
    pyarrow.compute.strptime(
        ak.to_arrow(timestamp_non_fractional, extensionarray=False),
        "%Y-%m-%d %H:%M:%S",
        "ms",
    )
)
datetime
```

Finally, we build an offset for the fractional component (in milliseconds) using {func}`ak.strings_astype`

```{code-cell} ipython3
import numpy as np

datetime_offset = ak.strings_astype(timestamp_fractional, np.dtype("timedelta64[ms]"))
datetime_offset
```

This offset is added to the absolute datetime to obtain a timestamp

```{code-cell} ipython3
timestamp = datetime + datetime_offset
timestamp
```

If we had a different parsing library that could only handle dates and times separately, then we could also split on the whitespace. Although {func}`ak.str.split_pattern` supports whitespace, it is more performant (and versatile) to use {func}`ak.str.split_whitespace`

```{code-cell} ipython3
ak.str.split_whitespace(timestamp_with_year)
```

If we also needed to split off the fractional component (and manually build the time delta), then we could have used {func}`ak.str.split_pattern_regex` to split on both whitespace *and* the period

```{code-cell} ipython3
ak.str.split_pattern_regex(timestamp_with_year, r"\.|\s")
```