1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82
|
---
title: "Metadata"
description: >
Learn how Arrow uses Schemas to document structure of data objects,
and how R metadata are supported in Arrow
output: rmarkdown::html_vignette
---
This article describes the various data and metadata object types supplied by arrow, and documents how these objects are structured.
```{r include=FALSE}
library(arrow, warn.conflicts = FALSE)
```
## Arrow metadata classes
The arrow package defines the following classes for representing metadata:
- A `Schema` is a list of `Field` objects used to describe the structure of a tabular data object; where
- A `Field` specifies a character string name and a `DataType`; and
- A `DataType` is an attribute controlling how values are represented
Consider this:
```{r}
df <- data.frame(x = 1:3, y = c("a", "b", "c"))
tb <- arrow_table(df)
tb$schema
```
The schema that has been automatically inferred could also be manually created:
```{r}
schema(
field(name = "x", type = int32()),
field(name = "y", type = utf8())
)
```
The `schema()` function allows the following shorthand to define fields:
```{r}
schema(x = int32(), y = utf8())
```
Sometimes it is important to specify the schema manually, particularly if you want fine-grained control over the Arrow data types:
```{r}
arrow_table(df, schema = schema(x = int64(), y = utf8()))
arrow_table(df, schema = schema(x = float64(), y = utf8()))
```
## R object attributes
Arrow supports custom key-value metadata attached to Schemas. When we convert a `data.frame` to an Arrow Table or RecordBatch, the package stores any `attributes()` attached to the columns of the `data.frame` in the Arrow object Schema. Attributes added to objects in this fashion are stored under the `r` key, as shown below:
```{r}
# data frame with custom metadata
df <- data.frame(x = 1:3, y = c("a", "b", "c"))
attr(df, "df_meta") <- "custom data frame metadata"
attr(df$y, "col_meta") <- "custom column metadata"
# when converted to a Table, the metadata is preserved
tb <- arrow_table(df)
tb$metadata
```
It is also possible to assign additional string metadata under any other key you wish, using a command like this:
```{r}
tb$metadata$new_key <- "new value"
```
Metadata attached to a Schema is preserved when writing the Table to Arrow/Feather or Parquet formats. When reading those files into R, or when calling `as.data.frame()` on a Table or RecordBatch, the column attributes are restored to the columns of the resulting `data.frame`. This means that custom data types, including `haven::labelled`, `vctrs` annotations, and others, are preserved when doing a round-trip through Arrow.
Note that the attributes stored in `$metadata$r` are only understood by R. If you write a `data.frame` with `haven` columns to a Feather file and read that in Pandas, the `haven` metadata won't be recognized there. Similarly, Pandas writes its own custom metadata, which the R package does not consume. You are free, however, to define custom metadata conventions for your application and assign any (string) values you want to other metadata keys.
## Further reading
- To learn more about arrow metadata, see the documentation for `schema()`.
- To learn more about data types, see the [data types article](./data_types.html).
|