File: storage.Rmd

package info (click to toggle)
r-cran-rdflib 0.2.5%2Bdfsg-1
  • links: PTS, VCS
  • area: main
  • in suites: bookworm
  • size: 596 kB
  • sloc: xml: 66; sh: 13; makefile: 2
file content (385 lines) | stat: -rw-r--r-- 12,192 bytes parent folder | download | duplicates (2)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
---
title: "Working with Database Storage Backends in `rdflib`"
author: "Carl Boettiger"
date: "2020-01-09"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Vignette Title}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---




By default, `rdflib` stores triples in memory, which can be read in from and written out to files on disk using a variety of serializations (`rdfxml`, `json-ld`, `ntriples`, `nquads` or `turtle`).  This is analgous to reading in a `data.frame` object from a `csv`, `tsv`, or other text-delimited file, which requires the full data object to be read into memory before a user can manipulate it.  This approach runs into limitations with very large datasets, which may exceed the memory available.  Just as R packages such as `dplyr` offer the ability to perform manipulations with very large data through a connection to a database backend such as SQLite, MySQL or PostgreSQL, `rdflib` can rely on these and other databases to provide a disk-based backend. 

### Installation

Unfortunately, at this time, support for these storage devices is not included in the prebuild Mac and Windows binaries of the `redland` R package.  Users wanting to take advantage of disk-based storage must instead build and install the `redland` R package from source; and in some cases, also build the `redland` C libraries from source.  This package provides a [Dockerfile](https://github.com/ropensci/rdflib/blob/master/inst/docker/Dockerfile) containing a portable recipe for building the library with support for the 5 main backend storage devices: SQLite, MySQL, PostgreSQL, Virtuoso, and Berkeley DB.  This vignette documents the installation and use of these devices.  

Also see the [official documentation](http://librdf.org/docs/api/redland-storage-modules.html) for the redland C library discussing all supported storage devices.  


# Benchmarks

Here we show examples of reading in a modest set of 190,000 triples from an `nquads` file and executing a simple SPARQL query using each the five backends supported by `rdflib`.  First we load the libraries and prepare an example file by triple-ifying 10,000 rows of the `flights` data.frame:


```r
library(rdflib)
library(nycflights13)
```


```r
example <- flights[1e4,]

system.time(
write_nquads(example, "example.nq", prefix = "flights")
)
#>    user  system elapsed 
#>   0.009   0.001   0.008

system.time(
write_nquads(flights, "flights.nq", prefix = "flights")
)
#>    user  system elapsed 
#>  15.107   0.474  15.632
```

## In Memory

Because the dataset is small enough to easily fit in memory, the default in-memory option is an obvious choice with excellent performance.  Note that this option will not be possible with larger triplestores (e.g. with millions or more triples).  Our testing has found that even on machines with 100GB+ of RAM, the redland in-memory backend is not able to take advantage of that additional memory and disk-based backends are required. 


```r
triplestore <- rdf()
```


```r

system.time(
  read_nquads("example.nq", rdf = triplestore) # smaller set
)
#>    user  system elapsed 
#>   0.002   0.000   0.002
```


```r
query <- 
  'SELECT  ?carrier ?flight ?origin ?dep_delay
WHERE {
?flight <flights:carrier>  ?carrier .
?flight <flights:dep_delay>  ?dep_delay .
?flight <flights:origin> ?origin
}'
```


```r
system.time(
df <- rdf_query(triplestore, query)
)
#>    user  system elapsed 
#>   0.011   0.008   0.019
```


```r
rdf_free(triplestore)
```


## BDB

The [Berkeley DB](https://en.wikipedia.org/wiki/Berkeley_DB) is a simple key-value store 

> It is the most mature and primary persistent store and suitable for large models, tested in the 2-3 million range.

Berkeley DB is a simple disk-based storage option.  Install both the redland libraries and the berkeley-db (e.g. `bd-dev` on Debian/Ubuntu) libraries, and then install `redland` from source.  BDB is relatively fast for data too large for memory.



```r
triplestore <- rdf(storage="BDB", new_db = TRUE)
system.time(
  read_nquads("flights.nq", rdf = triplestore)
)
#>    user  system elapsed 
#>  55.985  15.686  71.787
```



```r
system.time(
df <- rdf_query(triplestore, query)
)
#>    user  system elapsed 
#>  15.001   0.216  15.222
```


```r
rdf_free(triplestore)
## Becuase BDB is just a hash table, redland needs three separate files:
unlink("rdflib-po2s.db")
unlink("rdflib-so2p.db")
unlink("rdflib-sp2o.db")
```

## Virtuoso

Unlike the other backends which use general purpose relational databases or key-value stores, [Virtuoso](https://en.wikipedia.org/wiki/Virtuoso_Universal_Server) dedicated to RDF-based data.  Virtuoso is a popular open source database for RDF with a rich set of built-in interfaces and features, but we can also interact with it directly through the `redland` bindings just like we do any other backend in `rdflib`.  Virtuoso setup may be slightly more involved for individuals unfamiliar with it, but will probably provide the best performance in the case of very large triplestores.  The example below shows a new database setup, but `rdflib` can also connect to any existing Virtuoso database. 


```r
triplestore <- rdf(storage = "virtuoso", 
                   user = "dba", 
                   password = "dba", 
                   dsn = "Local Virtuoso"
                   )

system.time(
  read_nquads("flights.nq", rdf = triplestore)
)
#>     user   system  elapsed 
#>  111.500   84.856 3336.717
```



```r
system.time(
df <- rdf_query(triplestore, query)
)
#>    user  system elapsed 
#>  17.899   9.056 133.143
df <- rdf_query(triplestore, "SELECT ?s ?p ?o WHERE{ ?s ?p ?o }")
#> Warning: 1391187 parsing failures.
#>     row col expected     actual         file
#> 1338850   o a double _:r1011167 literal data
#> 1338851   o a double _:r1011168 literal data
#> 1338852   o a double _:r1011169 literal data
#> 1338853   o a double _:r1011170 literal data
#> 1338854   o a double _:r1012106 literal data
#> ....... ... ........ .......... ............
#> See problems(...) for more details.
df
#> # A tibble: 6,398,744 x 3
#>    s          p                o
#>    <chr>      <chr>        <dbl>
#>  1 flights:1  flights:year  2013
#>  2 flights:2  flights:year  2013
#>  3 flights:3  flights:year  2013
#>  4 flights:4  flights:year  2013
#>  5 flights:5  flights:year  2013
#>  6 flights:6  flights:year  2013
#>  7 flights:7  flights:year  2013
#>  8 flights:8  flights:year  2013
#>  9 flights:9  flights:year  2013
#> 10 flights:10 flights:year  2013
#> # … with 6,398,734 more rows
```



```r
rdf_free(triplestore)
```

Or on remote virtuoso:



```r
triplestore <- rdf(storage = "virtuoso", 
                   user = "dba", 
                   password = "dba", 
                   host = "virtuoso:1111"
                   )

system.time(
  read_nquads("flights.nq", rdf = triplestore)
)
#>       user     system    elapsed 
#> 129710.695   2036.842 149941.651
```




## POSTGRES

Postgres and MySQL are ubiquitious relational databases.  This backend requires the `redland` binaries are built from source with this support enabled, which is not the case for pre-built Mac or Linux binaries.   


```r
triplestore <- rdf(storage = "postgres", 
                   host = "postgres", 
                   user = "postgres", 
                   password = "rdflib", 
                   new_db = TRUE)
#> Warning in rdf_storage(storage, world, host, port, user, password,
#> database, : postgres driver not found. Falling back on in-memory storage
system.time(
  read_nquads("flights.nq", rdf = triplestore)
)
#>    user  system elapsed 
#>  31.803  20.715  52.586
```




```r
system.time(
df <- rdf_query(triplestore, query)
)
#>    user  system elapsed 
#>   5.449   0.032   5.483
```


```r
rdf_free(triplestore)
```

## MySQL


```r
triplestore <- rdf(storage = "mysql", 
                   host = "mariadb", 
                   user = "root", 
                   password = "rdflib", 
                   database = "mysql",
                   new_db = TRUE
                  )
#> Warning in rdf_storage(storage, world, host, port, user, password,
#> database, : mysql driver not found. Falling back on in-memory storage
  read_nquads("flights.nq", rdf = triplestore)
#> Total of 1996496 triples, stored in hashes
#> -------------------------------
#>  
#>  (preview supressed for performance)
```



```r
system.time(df <- rdf_query(triplestore, query))
#>    user  system elapsed 
#>   5.510   0.004   5.515
```


```r
rdf_free(triplestore)
```




## SQLite


[SQLite](https://en.wikipedia.org/wiki/SQLite) is relatively easy to set up, but appears to have rather poor overall performance. Requires SQLite development libraries installed (should work 'out-of-the-box' with Mac binaries for `redland` package).  



```r
triplestore <- rdf(storage="sqlite", new_db = TRUE, name="rdflib.sqlite")
#> Warning in rdf_storage(storage, world, host, port, user, password,
#> database, : sqlite driver not found. Falling back on in-memory storage

system.time(
  read_nquads("flights.nq", rdf = triplestore)
)
#>    user  system elapsed 
#>  31.607  20.219  51.857
```



```r
system.time(
df <- rdf_query(triplestore, query)
)
#>    user  system elapsed 
#>   5.559   0.028   5.588
```


```r
rdf_free(triplestore)
```

# Building redland with full backend support

Getting full support for the above backend databases through the `redland` R package is not trivial.  The `redland` R package provides bindings to the redland libraries.  Unfortunately, commonly available binary versions of those libraries, such as `librdf0-dev` on Debian, `redland` on Mac OSX brew, and the statically linked versions for Mac and Windows shipping in the R package, do not build those libraries with the optional support for all backends. (NOTE: it is the C library itself which must be built from source with these options, not just the R package source).  Consequently, users must build `librdf` from the original sources, <https://github.com/dajobe/librdf> with all backend linking libraries available, and then also build the `redland` R package from source, to be able to access all of these backends. On a Debian or Ubuntu system this looks like the following:

```bash
apt-get update && apt-get -yq install \
libxml2-dev \
libcurl4-openssl-dev \
libssl-dev \
git \
automake \
libtool \
gtk-doc-tools \
bison \
flex \
libgmp-dev  \
libmhash-dev \
libgcrypt20-dev \
libpcre3-dev \
libv8-dev \
libjq-dev \
libpq-dev \
libdb-dev \
libsqlite3-dev \
libmariadbclient-dev \
librdf-storage-virtuoso \
virtuoso-server \
unixodbc-dev
```

Now we can build `raptor` (parsers), `rasqal` (sparql queries) and `rdflib` from source:

```
git clone git://github.com/dajobe/raptor.git && \
cd raptor && \
./autogen.sh && \
make && \
make install && \
cd .. && \
git clone git://github.com/dajobe/rasqal.git && \
cd rasqal && \
./autogen.sh && \
make && \
make install && \
cd .. && \
git clone git://github.com/dajobe/librdf.git && \
cd librdf && \
./autogen.sh && \
make && \
make install
```

A See the [Dockerfile](https://github.com/ropensci/rdflib/tree/master/inst/docker/Dockerfile) in `inst/docker` for an example of this, or simply use the [Rocker-based](https://rocker-project.org) image `ropensci/rdflib`.  

# Testing

`rdflib` uses Circle-CI to test database backends using `docker-compose`.  `docker-compose` pulls down dedicated docker containers for `postgres` and `mariadb`, along with the `ropensci/rdflib` container, which includes a version of `redland` compiled with support for all five major backend storage systems.  See the [Dockerfile](https://github.com/ropensci/rdflib/tree/master/inst/docker/Dockerfile) in `inst/docker` and associated [docker-compose.yml]((https://github.com/ropensci/rdflib/tree/master/docker-compose.yml) used in testing for an example of this configuration.  You can also pull the Docker image `ropensci/rdflib` from Docker Hub for testing with all these libraries installed.     





The badge below (also on the package README) indicates that these dedicated tests are passing.

[![CircleCI](https://circleci.com/gh/ropensci/rdflib.svg?style=svg)](https://circleci.com/gh/ropensci/rdflib)