File: arrow.Rmd

package info (click to toggle)
apache-arrow 23.0.1-1
  • links: PTS
  • area: main
  • in suites: sid
  • size: 76,220 kB
  • sloc: cpp: 654,608; python: 70,522; ruby: 45,964; ansic: 18,742; sh: 7,365; makefile: 669; javascript: 125; xml: 41
file content (221 lines) | stat: -rw-r--r-- 12,717 bytes parent folder | download
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
---
title: "Get started with Arrow"
description: >
  An overview of the Apache Arrow project and the arrow R package
output: rmarkdown::html_vignette
---

Apache Arrow is a software development platform for building high performance applications that process and transport large data sets. It is designed to improve the performance of data analysis methods, and to increase the efficiency of moving data from one system or programming language to another.

The arrow package provides a standard way to use Apache Arrow in R. It provides a low-level interface to the [Arrow C++ library](https://arrow.apache.org/docs/cpp), and some higher-level tools for working with it in a way designed to feel natural to R users. This article provides an overview of how the pieces fit together, and it describes the conventions that the classes and methods follow in R.

## Package conventions

The arrow R package builds on top of the Arrow C++ library, and C++ is an object oriented language. As a consequence, the core logic of the Arrow C++ library is encapsulated in classes and methods. In the arrow R package these are implemented as [`R6`](https://r6.r-lib.org) classes that all adopt "TitleCase" naming conventions. Some examples of these include:

- Two-dimensional, tabular data structures such as `Table`, `RecordBatch`, and `Dataset`
- One-dimensional, vector-like data structures such as `Array` and `ChunkedArray`
- Classes for reading, writing, and streaming data such as `ParquetFileReader` and `CsvTableReader`

This low-level interface allows you to interact with the Arrow C++ library in a very flexible way, but in many common situations you may never need to use it at all, because arrow also supplies a high-level interface using functions that follow a "snake_case" naming convention. Some examples of this include:

- `arrow_table()` allows you to create Arrow tables without directly using the `Table` object
- `read_parquet()` allows you to open Parquet files without directly using the `ParquetFileReader` object

All the examples used in this article rely on this high-level interface.

For developers interested in learning more about the package structure, see the [developer guide](./developing.html).


## Tabular data in Arrow 

A critical component of Apache Arrow is its in-memory columnar format, a standardized, language-agnostic specification for representing structured, table-like datasets in-memory. In the arrow R package, the `Table` class is used to store these objects. Tables are roughly analogous to data frames and have similar behavior. The `arrow_table()` function allows you to generate new Arrow Tables in much the same way that `data.frame()` is used to create new data frames:

```{r}
library(arrow, warn.conflicts = FALSE)

dat <- arrow_table(x = 1:3, y = c("a", "b", "c"))
dat
```

You can use `[` to specify subsets of Arrow Table in the same way you would for a data frame:

```{r}
dat[1:2, 1:2]
```

Along the same lines, the `$` operator can be used to extract named columns:

```{r}
dat$y
```

Note the output: individual columns in an Arrow Table are represented as Chunked Arrays, which are one-dimensional data structures in Arrow that are roughly analogous to vectors in R. 

Tables are the primary way to represent rectangular data in-memory using Arrow, but they are not the only rectangular data structure used by the Arrow C++ library: there are also Datasets which are used for data stored on-disk rather than in-memory, and Record Batches which are fundamental building blocks but not typically used in data analysis. 

To learn more about the different data object classes in arrow, see the article on [data objects](./data_objects.html).

## Converting Tables to data frames

Tables are a data structure used to represent rectangular data within memory allocated by the Arrow C++ library, but they can be coerced to native R data frames (or tibbles) using `as.data.frame()`

```{r}
as.data.frame(dat)
```

When this coercion takes place, each of the columns in the original Arrow Table must be converted to native R data objects. In the `dat` Table, for instance, `dat$x` is stored as the Arrow data type int32 inherited from C++, which becomes an R integer type when `as.data.frame()` is called. 

It is possible to exercise fine-grained control over this conversion process. To learn more about the different types and how they are converted, see the [data types](./data_types.html) article. 


## Reading and writing data

One of the main ways to use arrow is to read and write data files in
several common formats. The arrow package supplies extremely fast CSV reading and writing capabilities, but in addition supports data formats like Parquet and Arrow (also called Feather) that are not widely supported in other packages. In addition, the arrow package supports multi-file data sets in which a single rectangular data set is stored across multiple files. 

### Individual files

When the goal is to read a single data file into memory, there are several functions you can use:

-   `read_parquet()`: read a file in Parquet format
-   `read_feather()`: read a file in Arrow/Feather format
-   `read_delim_arrow()`: read a delimited text file 
-   `read_csv_arrow()`: read a comma-separated values (CSV) file
-   `read_tsv_arrow()`: read a tab-separated values (TSV) file
-   `read_json_arrow()`: read a JSON data file

In every case except JSON, there is a corresponding `write_*()` function 
that allows you to write data files in the appropriate format. 

By default, the `read_*()` functions will return a data frame or tibble, but you can also use them to read data into an Arrow Table. To do this, you need to set the `as_data_frame` argument to `FALSE`. 

In the example below, we take the `starwars` data provided by the dplyr package and write it to a Parquet file using `write_parquet()`

```{r}
library(dplyr, warn.conflicts = FALSE)

file_path <- tempfile(fileext = ".parquet")
write_parquet(starwars, file_path)
```

We can then use `read_parquet()` to load the data from this file. As shown below, the default behavior is to return a data frame (`sw_frame`) but when we set `as_data_frame = FALSE` the data are read as an Arrow Table (`sw_table`):

```{r}
sw_frame <- read_parquet(file_path)
sw_table <- read_parquet(file_path, as_data_frame = FALSE)
sw_table
```

To learn more about reading and writing individual data files, see the [read/write article](./read_write.html).

### Multi-file data sets

When a tabular data set becomes large, it is often good practice to partition the data into meaningful subsets and store each one in a separate file. Among other things, this means that if only one subset of the data are relevant to an analysis, only one (smaller) file needs to be read. The arrow package provides the Dataset interface, a convenient way to read, write, and analyze a single data file that is larger-than-memory and multi-file data sets. 

To illustrate the concepts, we'll create a nonsense data set with 100000 rows that can be split into 10 subsets:

```{r}
set.seed(1234)
nrows <- 100000
random_data <- data.frame(
  x = rnorm(nrows),
  y = rnorm(nrows),
  subset = sample(10, nrows, replace = TRUE)
)
```

What we might like to do is partition this data and then write it to 10 separate Parquet files, one corresponding to each value of the `subset` column. To do this we first specify the path to a folder into which we will write the data files:

```{r}
dataset_path <- file.path(tempdir(), "random_data")
```

We can then use `group_by()` function from dplyr to specify that the data will be partitioned using the `subset` column, and then pass the grouped data to `write_dataset()`:

```{r}
random_data |>
  group_by(subset) |>
  write_dataset(dataset_path)
```

This creates a set of 10 files, one for each subset. These files are named according to the "hive partitioning" format as shown below:

```{r}
list.files(dataset_path, recursive = TRUE)
```

Each of these Parquet files can be opened individually using `read_parquet()` but is often more convenient -- especially for very large data sets -- to scan the folder and "connect" to the data set without loading it into memory. We can do this using `open_dataset()`:

```{r}
dset <- open_dataset(dataset_path)
dset
```

This `dset` object does not store the data in-memory, only some metadata. However, as discussed in the next section, it is possible to analyze the data referred to be `dset` as if it had been loaded.

To learn more about Arrow Datasets, see the [dataset article](./dataset.html).

## Analyzing Arrow data with dplyr

Arrow Tables and Datasets can be analyzed using dplyr syntax. This is possible because the arrow R package supplies a backend that translates dplyr verbs into commands that are understood by the Arrow C++ library, and will similarly translate R expressions that appear within a call to a dplyr verb. For example, although the `dset` Dataset is not a data frame (and does not store the data values in memory), you can still pass it to a dplyr pipeline like the one shown below:

```{r}
dset |>
  group_by(subset) |>
  summarize(mean_x = mean(x), min_y = min(y)) |>
  filter(mean_x > 0) |>
  arrange(subset) |>
  collect()
```

Notice that we call `collect()` at the end of the pipeline. No actual computations are performed until `collect()` (or the related `compute()` function) is called. This "lazy evaluation" makes it possible for the Arrow C++ compute engine to optimize how the computations are performed. 

To learn more about analyzing Arrow data, see the [data wrangling article](./data_wrangling.html). The [list of functions available in dplyr queries](https://arrow.apache.org/docs/r/reference/acero.html) page may also be useful.

## Connecting to cloud storage

Another use for the arrow R package is to read, write, and analyze data sets stored remotely on cloud services. The package currently supports both Amazon Simple Storage Service (S3) and Google Cloud Storage (GCS). The example below illustrates how you can use `s3_bucket()` to refer to a an S3 bucket, and use `open_dataset()` to connect to the data set stored there:

```{r, eval=FALSE}
bucket <- s3_bucket("arrow-datasets/nyc-taxi")
nyc_taxi <- open_dataset(bucket)
```

To learn more about the support for cloud services in arrow, see the [cloud storage](./fs.html) article.

## Efficient data interchange between R and Python

The [reticulate](https://rstudio.github.io/reticulate/) package provides an interface that allows you to call Python code from R. The arrow package is designed to be interoperable with reticulate. If the Python environment has the pyarrow library installed (the Python equivalent to the arrow package), you can pass an Arrow Table from R to Python using the `r_to_py()` function in reticulate as shown below:

```{r}
library(reticulate)

sw_table_python <- r_to_py(sw_table)
```

The `sw_table_python` object is now stored as a pyarrow Table: the Python equivalent of the Table class. You can see this when you print the object:

```{r}
sw_table_python
```

It is important to recognize that when this transfer takes place, only the C++ pointer (i.e., metadata referring to the underlying data object stored by the Arrow C++ library) is copied. The data values themselves in the same place within memory. The consequence of this is that it is much faster to pass an Arrow Table from R to Python than to copy a data frame in R to a Pandas DataFrame in Python. 

To learn more about passing Arrow data between R and Python, see the article on [python integrations](./python.html).

## Access to Arrow messages, buffers, and streams

The arrow package also provides many lower-level bindings to the C++ library, which enable you
to access and manipulate Arrow objects. You can use these to build connectors
to other applications and services that use Arrow. One example is Spark: the
[`sparklyr`](https://spark.rstudio.com/) package has support for using Arrow to
move data to and from Spark, yielding [significant performance
gains](https://arrow.apache.org/blog/2019/01/25/r-spark-improvements/).

## Contributing to arrow

Apache Arrow is an extensive project spanning multiple languages, and the arrow R package is only one part of this large project. Because of this there are a number of special considerations for developers who would like to contribute to the package. To help make this process easier, there are several articles in the arrow documentation that discuss topics that are relevant to arrow developers, but are very unlikely to be needed by users.

For an overview of the development process and a list of related articles for developers, see the [developer guide](./developing.html).