1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126
|
---
jupytext:
text_representation:
extension: .md
format_name: myst
format_version: 0.13
jupytext_version: 1.14.1
kernelspec:
display_name: Python 3 (ipykernel)
language: python
name: python3
---
How to convert to/from ROOT RDataFrame
======================================
The [ROOT RDataFrame](https://root.cern.ch/doc/master/classROOT_1_1RDataFrame.html) is a declarative, parallel framework for data analysis and manipulation. `RDataFrame` reads columnar data via a data source. The transformations can be applied to the data to select rows and/or to define new columns, and to produce results: histograms, etc.
```{code-cell} ipython3
import awkward as ak
import ROOT
```
From Awkward to RDataFrame
--------------------------
The function for Awkward → `RDataFrame` conversion is {func}`ak.to_rdataframe`.
The argument to this function requires a dictionary: `{ <column name string> : <awkwad array> }`. This function always returns
* {class}`cppyy.gbl.ROOT.RDF.RInterface<ROOT::Detail::RDF::RLoopManager,void>`
object.
```{code-cell} ipython3
array_x = ak.Array(
[
{"x": [1.1, 1.2, 1.3]},
{"x": [2.1, 2.2]},
{"x": [3.1]},
{"x": [4.1, 4.2, 4.3, 4.4]},
{"x": [5.1]},
]
)
array_y = ak.Array([1, 2, 3, 4, 5])
array_z = ak.Array([[1.1], [2.1, 2.3, 2.4], [3.1], [4.1, 4.2, 4.3], [5.1]])
```
The arrays given for each column have to be equal length:
```{code-cell} ipython3
assert len(array_x) == len(array_y) == len(array_z)
```
The dictionary key defines a column name in RDataFrame.
```{code-cell} ipython3
df = ak.to_rdataframe({"x": array_x, "y": array_y, "z": array_z})
```
The {func}`ak.to_rdataframe` function presents a generated-on-demand Awkward Array view as an `RDataFrame` source. There is a small overhead of generating Awkward RDataSource C++ code. This operation does not execute the `RDataFrame` event loop. The array data are not copied.
The column readers are generated based on the run-time type of the views. Here is a description of the `RDataFrame` columns:
```{code-cell} ipython3
df.Describe().Print()
```
The `x` column contains an Awkward Array with a made-up type; `awkward::Record_cKnX5DyNVM`.
Awkward Arrays are dynamically typed, so in a C++ context, the type name is hashed. In practice, there is no need to know the type. The C++ code should use a placeholder type specifier `auto`. The type of the variable that is being declared will be automatically deduced from its initializer.
From RDataFrame to Awkward
--------------------------
The function for `RDataFrame` → Awkward conversion is {func}`ak.from_rdataframe`. The argument to this function accepts a tuple of strings that are the `RDataFrame` column names. By default this function returns
* {class}`ak.Array`
type.
```{code-cell} ipython3
array = ak.from_rdataframe(
df,
columns=(
"x",
"y",
"z",
),
)
array
```
When `RDataFrame` runs multi-threaded event loops, the entry processing order is not guaranteed:
```{code-cell} ipython3
ROOT.ROOT.EnableImplicitMT()
```
+++
Let's recreate the dataframe, to reflect the new multi-threading mode
```{code-cell} ipython3
df = ak.to_rdataframe({"x": array_x, "y": array_y, "z": array_z})
```
+++
If the `keep_order` parameter set to `True`, the columns will keep order after filtering:
```{code-cell} ipython3
df = df.Filter("y % 2 == 0")
array = ak.from_rdataframe(
df,
columns=(
"x",
"y",
"z",
),
keep_order=True,
)
array
```
|