File: translation.Rmd

package info (click to toggle)
r-cran-dtplyr 1.3.1-1
links: PTS, VCS
area: main
in suites: forky, sid, trixie
size: 860 kB
sloc: sh: 12; makefile: 2
file content (287 lines) | stat: -rw-r--r-- 8,101 bytes
parent folder | download | duplicates (2)
---
title: "Translation"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{translation}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

## Introduction

This vignette shows the details of how dtplyr translates dplyr expressions into the equivalent [data.table](http://r-datatable.com/) code. If you see places where you think I could generate better data.table code, please [let me know](https://github.com/tidyverse/dtplyr/issues)!

This document assumes that you're familiar with the basics of data.table; if you're not, I recommend starting at `vignette("datatable-intro.html")`.

```{r setup, message = FALSE}
library(dtplyr)
library(data.table)
library(dplyr)
```

## The basics

To get started, I'll create a simple lazy table with `lazy_dt()`:

```{r}
df <- data.frame(a = 1:5, b = 1:5, c = 1:5, d = 1:5)
dt <- lazy_dt(df)
```

The actual data doesn't matter here since we're just looking at the translation.

When you print a lazy frame, it tells you that it's a local data table with four rows. It also prints the call that dtplyr will evaluate when we execute the lazy table. In this case it's very simple: 

```{r}
dt
```

If we just want to see the generated code, you can use `show_query()`. I'll use that a lot in this vignette.

```{r}
dt %>% show_query()
```

## Simple verbs

Many dplyr verbs have a straightforward translation to either the `i` or `j` component of `[.data.table`. 

### `filter()` and `arrange()`

`filter()` and `arrange()` become elements of `i`:

```{r}
dt %>% arrange(a, b, c) %>% show_query()

dt %>% filter(b == c) %>% show_query()
dt %>% filter(b == c, c == d) %>% show_query()
```

### `select()`, `summarise()`, `transmute()`

`select()`, `summarise()` and `transmute()` all become elements of `j`:

```{r}
dt %>% select(a:b) %>% show_query()
dt %>% summarise(a = mean(a)) %>% show_query()
dt %>% transmute(a2 = a * 2) %>% show_query()
```

`mutate()` also uses the `j` component with data.table's special `:=` operator:

```{r}
dt %>% mutate(a2 = a * 2, b2 = b * 2) %>% show_query()
```

Note that dplyr will not copy the input data by default, see below for more details.

`mutate()` allows to refer to variables that you just created using an "extended `j`" expression:

```{r}
dt %>% mutate(a2 = a * 2, b2 = b * 2, a4 = a2 * 2) %>% show_query()
```

`transmute()` works similarly:

```{r}
dt %>% transmute(a2 = a * 2, b2 = b * 2, a4 = a2 * 2) %>% show_query()
```

## Other calls

Other verbs require calls to other functions:

### `rename()`

`rename()` uses `setnames()`:

```{r}
dt %>% rename(x = a, y = b) %>% show_query()
```

### `distinct()`

`distinct()` uses `unique()`:

```{r}
dt %>% distinct() %>% show_query()
dt %>% distinct(a, b) %>% show_query()
dt %>% distinct(a, b, .keep_all = TRUE) %>% show_query()
```

`distinct()` on a computed column uses an intermediate mutate:

```{r}
dt %>% distinct(c = a + b) %>% show_query()
dt %>% distinct(c = a + b, .keep_all = TRUE) %>% show_query()
```

### Joins

Most joins use the `[.data.table` equivalent:

```{r}
dt2 <- lazy_dt(data.frame(a = 1))

dt %>% inner_join(dt2, by = "a") %>% show_query()
dt %>% right_join(dt2, by = "a") %>% show_query()
dt %>% left_join(dt2, by = "a") %>% show_query()
dt %>% anti_join(dt2, by = "a") %>% show_query()
```

But `full_join()` uses `merge()`

```{r}
dt %>% full_join(dt2, by = "a") %>% show_query()
```

In some case extra calls to `data.table::setcolorder()` and `data.table::setnames()`
are required to ensure correct column order and names in:

```{r}
dt3 <- lazy_dt(data.frame(b = 1, a = 1))

dt %>% left_join(dt3, by = "a") %>% show_query()
dt %>% full_join(dt3, by = "b") %>% show_query()
```

Semi-joins are little more complex:

```{r}
dt %>% semi_join(dt2, by = "a") %>% show_query()
```

### Set operations

Set operations use the fast data.table alternatives:

```{r}
dt %>% intersect(dt2) %>% show_query()
dt %>% setdiff(dt2) %>% show_query()
dt %>% union(dt2) %>% show_query()
```

## Grouping

Just like in dplyr, `group_by()` doesn't do anything by itself, but instead modifies the operation of downstream verbs. This generally just involves using the `keyby` argument:

```{r}
dt %>% group_by(a) %>% summarise(b = mean(b)) %>% show_query()
```

You may use `by` instead of `keyby` if you set `arrange = FALSE`:

```{r}
dt %>% group_by(a, arrange = FALSE) %>% summarise(b = mean(b)) %>% show_query()
```

Often, there won't be too much of a difference between these, but for larger grouped operations, the overhead of reordering data may become significant. In these situations, using `arrange = FALSE` becomes preferable.

The primary exception is grouped `filter()`, which requires the use of `.SD`:

```{r}
dt %>% group_by(a) %>% filter(b < mean(b)) %>% show_query()
```

## Combinations

dtplyr tries to generate generate data.table code as close as possible to what you'd write by hand, as this tends to unlock data.table's tremendous speed. For example, if you `filter()` and then `select()`, dtplyr generates a single `[`:

```{r}
dt %>% 
  filter(a == 1) %>% 
  select(-a) %>% 
  show_query()
```

And similarly when combining filtering and summarising:

```{r}
dt %>% 
  group_by(a) %>% 
  filter(b < mean(b)) %>% 
  summarise(c = max(c)) %>% 
  show_query()
```

This is particularly nice when joining two tables together because you can select variables after you have joined and data.table will only carry those into the join:

```{r}
dt3 <- lazy_dt(data.frame(x = 1, y = 2))
dt4 <- lazy_dt(data.frame(x = 1, a = 2, b = 3, c = 4, d = 5, e = 7))

dt3 %>% 
  left_join(dt4) %>% 
  select(x, a:c) %>% 
  show_query()
```

Note, however, that `select()`ing and then `filter()`ing must generate two separate calls to `[`, because data.table evaluates `i` before `j`.

```{r}
dt %>% 
  select(X = a, Y = b) %>% 
  filter(X == 1) %>% 
  show_query()
```

Similarly, a `filter()` and `mutate()` can't be combined because `dt[a == 1, .(b2 := b * 2)]` would modify the selected rows in place:

```{r}
dt %>% 
  filter(a == 1) %>% 
  mutate(b2 = b * 2) %>% 
  show_query()
```

## Copies

By default dtplyr avoids mutating the input data, automatically creating a `copy()` if needed:

```{r}
dt %>% mutate(a2 = a * 2, b2 = b * 2) %>% show_query()
```

Note that dtplyr does its best to avoid needless copies, so it won't explicitly copy if there's already an implicit copy produced by `[`, `head()`, `merge()` or similar:

```{r}
dt %>% 
  filter(x == 1) %>% 
  mutate(a2 = a * 2, b2 = b * 2) %>% 
  show_query()
```

You can choose to opt out of this copy, and take advantage of data.table's reference semantics (see `vignette("datatable-reference-semantics")` for more details). Do this by setting `immutable = FALSE` on construction:

```{r}
dt2 <- data.table(a = 1:10)

dt_inplace <- lazy_dt(dt2, immutable = FALSE)
dt_inplace %>% mutate(a2 = a * 2, b2 = b * 2) %>% show_query()
```

## Performance

There are two components to the performance of dtplyr: how long it takes to generate the translation, and how well the translation performs. Given my explorations so far, I'm reasonably confident that we're generating high-quality data.table code, so most of the cost should be in the translation itself. 

The following code briefly explores the performance of a few different translations. A significant amount of work is done by the dplyr verbs, so we benchmark the whole process. 

```{r}
bench::mark(
  filter = dt %>% filter(a == b, c == d),
  mutate = dt %>% mutate(a = a * 2, a4 = a2 * 2, a8 = a4 * 2) %>% show_query(),
  summarise = dt %>% group_by(a) %>% summarise(b = mean(b)) %>% show_query(),
  check = FALSE
)[1:6]
```

These translations all take less than a millisecond, suggesting that the performance overhead of dtplyr should be negligible for realistic data sizes. Note that dtplyr run-time scales with the complexity of the pipeline, not the size of the data, so these timings should apply regardless of the size of the underlying data[^copy].

[^copy]: Unless a copy is performed.