File: how-to-convert-arrow.md

package info (click to toggle)
python-awkward 2.6.5-1
links: PTS, VCS
area: main
in suites: sid
size: 23,088 kB
sloc: python: 148,689; cpp: 33,562; sh: 432; makefile: 21; javascript: 8
file content (295 lines) | stat: -rw-r--r-- 8,335 bytes
---
jupytext:
  text_representation:
    extension: .md
    format_name: myst
    format_version: 0.13
    jupytext_version: 1.14.1
kernelspec:
  display_name: Python 3 (ipykernel)
  language: python
  name: python3
---

How to convert to/from Arrow and Parquet
========================================

The [Apache Arrow](https://arrow.apache.org/) data format is very similar to Awkward Array's, but they're not exactly the same. As such, arrays can _usually_ be shared without copying, but not always.

The [Apache Parquet](https://parquet.apache.org/) file format has strong connections to Arrow with a large overlap in available tools, and while it's also a columnar format like Awkward and Arrow, it is implemented in a different way, which emphasizes compact storage over random access.

```{code-cell} ipython3
import awkward as ak
import pyarrow as pa
import pyarrow.csv
import urllib.request
```

From Arrow to Awkward
---------------------

The function for Arrow → Awkward conversion is {func}`ak.from_arrow`.

The argument to this function can be any of the following types from the pyarrow library:

   * {class}`pyarrow.lib.Array`
   * {class}`pyarrow.lib.ChunkedArray`
   * {class}`pyarrow.lib.RecordBatch`
   * {class}`pyarrow.lib.Table`

and they are converted into non-partitioned, non-virtual Awkward Arrays. (Any disjoint chunks in the Arrow array are concatenated.)

```{code-cell} ipython3
pa_array = pa.array([[1.1, 2.2, 3.3], [], [4.4, 5.5]])
pa_array
```

```{code-cell} ipython3
ak.from_arrow(pa_array)
```

Here is an example of an Arrow Table, derived from CSV. (Printing a table shows its field types.)

```{code-cell} ipython3
pokemon = urllib.request.urlopen(
    "https://gist.githubusercontent.com/armgilles/194bcff35001e7eb53a2a8b441e8b2c6/raw/92200bc0a673d5ce2110aaad4544ed6c4010f687/pokemon.csv"
)
table = pyarrow.csv.read_csv(pokemon)
table
```

Awkward Array doesn't make a deep distinction between "arrays" and "tables" the way Arrow does: the Awkward equivalent of an Arrow table is just an Awkward Array of record type.

```{code-cell} ipython3
array = ak.from_arrow(table)
array
```

The Awkward equivalent of Arrow's schemas is {func}`ak.type`.

```{code-cell} ipython3
ak.type(array)
```

```{code-cell} ipython3
ak.to_list(array[0])
```

This array is ready for data analysis.

```{code-cell} ipython3
array[array.Legendary].Attack - array[array.Legendary].Defense
```

From Awkward to Arrow
---------------------

The function for Awkward → Arrow conversion is {func}`ak.to_arrow`. This function always returns

   * {class}`pyarrow.lib.Array`

type.

```{code-cell} ipython3
ak_array = ak.Array(
    [{"x": 1.1, "y": [1]}, {"x": 2.2, "y": [1, 2]}, {"x": 3.3, "y": [1, 2, 3]}]
)
ak_array
```

```{code-cell} ipython3
pa_array = ak.to_arrow(ak_array)
pa_array
```

```{code-cell} ipython3
type(pa_array)
```

```{code-cell} ipython3
isinstance(pa_array, pa.lib.Array)
```

If you need {class}`pyarrow.lib.RecordBatch`, you can build this using pyarrow:

```{code-cell} ipython3
pa_batch = pa.RecordBatch.from_arrays(
    [
        ak.to_arrow(ak_array.x),
        ak.to_arrow(ak_array.y),
    ],
    ["x", "y"],
)
pa_batch
```

If you need {class}`pyarrow.lib.Table`, you can build this using pyarrow:

```{code-cell} ipython3
pa_table = pa.Table.from_batches([pa_batch])
pa_table
```

The columns of this Table are {class}`pa.lib.ChunkedArray` instances:

```{code-cell} ipython3
pa_table[0]
```

```{code-cell} ipython3
pa_table[1]
```

shares memory as much as is possible, which [can be faster than constructing Pandas directly](https://ursalabs.org/blog/fast-pandas-loading/).

+++ {"tags": []}

Reading/writing data streams and random access files
----------------------------------------------------

Arrow has several methods for interfacing to data streams and disk-bound files, [see the official documentation](https://arrow.apache.org/docs/python/ipc.html) for instructions.

When following those instructions, remember that {func}`ak.from_arrow` can accept {class}`pyarrow.lib.Array`, {class}`pyarrow.lib.ChunkedArray`, {class}`pyarrow.lib.RecordBatch`, and {class}`pyarrow.lib.Table`, but {func}`ak.to_arrow` _only returns {class}`pyarrow.lib.Array`_.

For instance, when writing to an IPC stream, Arrow requires {class}`pyarrow.lib.RecordBatch`, so you need to build them:

```{code-cell} ipython3
ak_array = ak.Array(
    [{"x": 1.1, "y": [1]}, {"x": 2.2, "y": [1, 2]}, {"x": 3.3, "y": [1, 2, 3]}]
)
ak_array
```

```{code-cell} ipython3
first_batch = pa.RecordBatch.from_arrays(
    [
        ak.to_arrow(ak_array.x),
        ak.to_arrow(ak_array.y),
    ],
    ["x", "y"],
)
first_batch.schema
```

```{code-cell} ipython3
sink = pa.BufferOutputStream()
writer = pa.ipc.new_stream(sink, first_batch.schema)

writer.write_batch(first_batch)

for i in range(5):
    next_batch = pa.RecordBatch.from_arrays(
        [
            ak.to_arrow(ak_array.x),
            ak.to_arrow(ak_array.y),
        ],
        ["x", "y"],
    )

    writer.write_batch(next_batch)

writer.close()
```

```{code-cell} ipython3
bytes(sink.getvalue())
```

But when reading them back, we can just pass the record batches (yielded by the {class}`pyarrow.lib.RecordBatchStreamReader` `reader`) to {func}`ak.from_arrow`:

```{code-cell} ipython3
reader = pa.ipc.open_stream(sink.getvalue())
reader.schema
```

```{code-cell} ipython3
for batch in reader:
    print(repr(ak.from_arrow(batch)))
```

Reading/writing the Feather file format
---------------------------------------

Feather is a lightweight file format that puts Arrow Tables in disk-bound files, [see the official documentation](https://arrow.apache.org/docs/python/feather.html) for instructions.

When following those instructions, remember that {func}`ak.from_arrow` can accept {class}`pyarrow.lib.Table`, but {func}`ak.to_arrow` _only returns {class}`pyarrow.lib.Array`_.

For instance, when writing to a Feather file, Arrow requires {class}`pyarrow.lib.Table`, so you need to build them:

```{code-cell} ipython3
ak_array = ak.Array(
    [{"x": 1.1, "y": [1]}, {"x": 2.2, "y": [1, 2]}, {"x": 3.3, "y": [1, 2, 3]}]
)
ak_array
```

```{code-cell} ipython3
pa_batch = pa.RecordBatch.from_arrays(
    [
        ak.to_arrow(ak_array.x),
        ak.to_arrow(ak_array.y),
    ],
    ["x", "y"],
)

pa_table = pa.Table.from_batches([pa_batch])
pa_table
```

```{code-cell} ipython3
import pyarrow.feather

pyarrow.feather.write_feather(pa_table, "/tmp/example.feather")
```

But when reading them back, we can just pass the Arrow Table to {func}`ak.from_arrow`.

```{code-cell} ipython3
from_feather = pyarrow.feather.read_table("/tmp/example.feather")
from_feather
```

```{code-cell} ipython3
type(from_feather)
```

```{code-cell} ipython3
ak.from_arrow(from_feather)
```

Reading/writing the Parquet file format
---------------------------------------

With data converted to and from Arrow, it can then be saved and loaded from Parquet files. [Arrow's official Parquet documentation](https://arrow.apache.org/docs/python/parquet.html) provides instructions for converting Arrow to and from Parquet, but Parquet is a sufficiently important file format that Awkward has specialized functions for it.

The {func}`ak.to_parquet` function writes Awkward Arrays as Parquet files. It has relatively few options.

```{code-cell} ipython3
ak_array = ak.Array(
    [{"x": 1.1, "y": [1]}, {"x": 2.2, "y": [1, 2]}, {"x": 3.3, "y": [1, 2, 3]}]
)
ak_array
```

```{code-cell} ipython3
ak.to_parquet(ak_array, "/tmp/example.parquet")
```

The {func}`ak.from_parquet` function reads Parquet files as Awkward Arrays, with quite a few more options. Basic usage just gives you the Awkward Array back.

```{code-cell} ipython3
ak.from_parquet("/tmp/example.parquet")
```

Since the data in a Parquet file may be huge, there are `columns` and `row_groups` options to read back only _part_ of the file.

   * Parquet's "columns" correspond to Awkward's record "fields," though Parquet columns cannot be nested.
   * Parquet's "row groups" are ranges of contiguous elements, such as "`1000-2000`". They correspond to Awkward's "partitioning." Neither Parquet row groups nor Awkward partitions can be nested.

For instance, the expression

```{code-cell} ipython3
ak.from_parquet("/tmp/example.parquet", columns=["x"])
```

Doesn't read column `"y"`.