1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226
|
---
title: The Data Lake and Schema On Read
author: Carl Boettiger
date: "`r Sys.Date()`"
output:
rmarkdown::html_vignette
vignette: >
%\VignetteIndexEntry{The Data Lake and Schema On Read}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
```{r include = FALSE}
knitr::opts_chunk$set(message=FALSE, warning = FALSE, cache = TRUE)
```
A provocative recent analogy (e.g. [Archer, 2017](https://www.w3.org/blog/2017/04/dive-into-the-semantic-data-lake/)) for thinking about RDF is that of the Data Lake. Whereas adding new data to a traditional relational database frequently involves laborious wrangling into the an existing rigid data *schema* of tables and columns, RDF can enable a far more simple approach of just 'tossing everything into the data lake.' While this may sound like a recipe for a terrible mess down the road, RDF gives the promise of *schema-on-read*: rather than dictating the shape of your data when first adding it to the database (i.e. schema-on-write), a SPARQL query auto-magically reaches into the lake and retrieves all the relevant data, returning it in exactly the schema you ask for; that is, as a nice single table containing only the requested columns.
This can serve as a very effective means of data integration (provided a reasonably consistent and diligent use of URIs in identifying subjects and properties (predicates)), since just about any data can be added to the lake without worrying about whether it comes in a schema that matches the existing architecture of the database. It is this flexibility not to have to define your database schema at the start that is the primary strength of the RDF approach.
A key advantage of this approach is that it is equally as easy to extract your desired data.frame from non-rectangular data as it is from tables or relational database systems. While the previous vignette focused on simple examples including data from single `data.frames` or small JSON files, this vignette showcases the Data Lake approach on a more complex relational database organized over several tables (`nyflights13` data, used to teach `joins` and other relational data in [Grolemund & Wickham, 2017](http://r4ds.had.co.nz)), and also in a large non-tabular file returned from the GitHub API (e.g. as used in [Bryan & Wickham, 2017 ](https://dcl-2017-04.github.io/curriculum/rectangling.html) to teach data rectangling.)
Load the necessary libraries to get started:
```{r libraries}
## Data
library(nycflights13)
library(repurrrsive)
## for comparison approaches
library(dplyr)
library(purrr)
library(jsonlite)
## Our focal package:
library(rdflib)
## experimental functions for rdflib package
source(system.file("examples/tidy_schema.R", package="rdflib"))
```
Configure RDF storage to use the BDB backend for on-disk storage.
```{r options, include = TRUE}
rdf <- rdf(storage = "BDB", new_db = TRUE)
```
# Relational data
## The `tidyverse` approach
`tidyverse` operations are incredibly effective for working with relational data. These `dplyr` on the `nyflights13` dataset are easy to write and fast to execute:
```{r tidyverse}
df <- flights %>%
left_join(airlines) %>%
left_join(planes, by="tailnum") %>%
select(carrier, name, manufacturer, model) %>%
distinct()
head(df)
```
Still, joins are often a challenge in data preparation. Tabular formats can often be sloppy about what is a key column and what is a literal value, and also whether a column with the same name in different tables means the same thing in both. Both of these things pose challenges for later use when joining data. RDF representation encourages greater discipline through the use of URIs (though we'll run a bit roughshod over that with the caviler use of `x:` here.)
This example uses only data that is already part of the relational database. Adding additional information to the database is frequently more tricky, and can result in a rapidly expanding number of tables that can become difficult to work across.
## RDF approach
Okay, now let's dump the `nyflights13` into the data lake. Foreign keys in any table must be represented as URIs and not literal strings:
```{r as_uri}
uri_flights <- flights %>%
mutate(tailnum = paste0("planes:", tailnum),
carrier = paste0("airlines:", carrier))
```
We write the `data.frame`s out as nquads. Recall that each cell of a `data.frame` can be represented as a triple, in which the column is the predicate, the primary key (or row number) the subject, and the cell value the object. We turn column names and primary keys into URIs using a prefix based on the table name. (Note that `rdflib` does this conversion by merely munging cells and calling `write.table`, it is not a standard `redland` library transform).
```{r into_quads}
system.time({
write_nquads(airlines, "airlines.nq", key = "carrier", prefix = "airlines:")
write_nquads(planes, "planes.nq", key = "tailnum", prefix = "planes:")
write_nquads(uri_flights, "flights.nq", prefix = "flights:")
})
```
We can now read these into our RDF data lake:
```{r write_rdf}
system.time({
read_nquads("airlines.nq", rdf = rdf)
read_nquads("flights.nq", rdf = rdf)
read_nquads("planes.nq", rdf = rdf)
})
```
Note that flights does not have a natural key (somewhat surprisingly, `flight` number is not a unique key for this table, as the same flight number is reused on the same route at different times.) So, we will treat each row as a unique anonymous key by setting the key to `NULL`.
## Schema on read
We simply define the columns we want and we immediately get back the desired `data.frame`:
```{r query}
s <-
'SELECT ?carrier ?name ?manufacturer ?model ?dep_delay
WHERE {
?flight <flights:tailnum> ?tailnum .
?flight <flights:carrier> ?carrier .
?flight <flights:dep_delay> ?dep_delay .
?tailnum <planes:manufacturer> ?manufacturer .
?tailnum <planes:model> ?model .
?carrier <airlines:name> ?name
}'
system.time(
df <- rdf_query(rdf, s)
)
head(df)
```
Note that in place of joins, we give more semantically meaningful statements about the data:
e.g. `manufacturer` is a property of a `tailnum` (corresponding to a particular physical aircraft), not of a `flight` number. Departure delay `dep_delay` is a property of a flight, not of an aircraft (`tailnum`).
This is reminiscent of the way in which these data are organized in the relational database tables to begin with: we find `deb_delay` in the `flights` table and `manufacturer` in the `planes` table. Good relational design encourages this, but to work with the data the user is often left having to do the required joins, which also creates tables where these semantics are less clear.
# Non-tabular data
## The `tidyverse` approach
We start with data from [Bryan & Wickham, 2017 ](https://dcl-2017-04.github.io/curriculum/rectangling.html) lesson on data rectangling:
```{r}
f <- system.file("extdata/gh_repos.json", package="repurrrsive")
gh_data <- jsonlite::read_json(f)
```
The original lesson illustrates the power and reasonably concise syntax of the `tidyverse` package, `purrr`, to iterate over this complex structure to extract the necessary data. In this approach, nesting of the data is largely a nuisance to overcome rather than an asset to the data analyst:
```{r }
gh_flat <- gh_data %>% purrr::flatten() # abandon nested structure and hope we didn't need it
gh_tibble <- tibble(
name = gh_flat %>% map_chr("name"),
issues = gh_flat %>% map_int("open_issues_count"),
wiki = gh_flat %>% map_lgl("has_wiki"),
homepage = gh_flat %>% map_chr("homepage", .default = ""),
owner = gh_flat %>% map_chr(c("owner", "login"))
)
gh_tibble %>% arrange(name) %>% head()
```
## RDF on non-tabular data
The RDF approach merely treats JSON as JSON-LD within a given vocabulary. In this context, nesting implicitly provides important semantic information about relationships between the data, which are captured in the RDF triples. Here, we import the JSON data as RDF (and add it to our existing triplestore just for fun)
```{r gh_add}
gh_rdf <- as_rdf(gh_data, rdf = rdf, prefix = "gh:")
```
And we can query it back out of the lake just by selecting the columns of interest.
```{r gh_query}
s <-
'SELECT ?name ?issues ?wiki ?homepage ?owner
WHERE {
?repo <gh:homepage> ?homepage .
?repo <gh:has_wiki> ?wiki .
?repo <gh:open_issues_count> ?issues .
?repo <gh:name> ?name .
?repo <gh:owner> ?owner_id .
?owner_id <gh:login> ?owner
}'
system.time(
rdf_tibble <- rdf_query(rdf, s)
)
head(rdf_tibble)
```
## Going further: A `dplyr`-style syntax for SPARQL?
`dplyr` provides a reasonably intuitive, powerful, and concise interface for many common SQL commands. Indeed, `dplyr` function calls are literally serialized to the corresponding SQL query when working with a relational database back-end (via `dbplyr`). Can a similar API be developed for SPARQL queries?
SPARQL syntax is obviously inspired by SQL syntax, and includes many of the same operations found in SQL and `dplyr` (e.g. SELECT, WHERE, FILTER, DISTINCT) as well as other more RDF specific queries.
If we are willing to make some assumptions about the most common queries, we can start to make some simplifying functions. For instance, in the above patterns, the variables being returned are always objects from the triples, with columns being named using the corresponding predicate. Assuming some additional convention to define the prefixes and indicate graph traversal (i.e. nested values), we could have constructed the above query from a call:
```r
tidy_schema(name, open_issues_count, has_wiki, homepage owner.login)
```
Though to generalize to arbitrary labels and predicates this might need to support something like:
```r
tidy_schema(list = list(name = "gh:name",
issues = "gh:open_issues_count",
wiki = "gh:has_wiki",
homepage = "gh:homepage",
owner = list("gh:owner", login = "gh:login")))
```
|