File: formats.Rmd

package info (click to toggle)
r-cran-tibble 3.1.8%2Bdfsg-1
  • links: PTS, VCS
  • area: main
  • in suites: bookworm
  • size: 2,008 kB
  • sloc: ansic: 317; sh: 10; makefile: 5
file content (169 lines) | stat: -rw-r--r-- 4,274 bytes parent folder | download | duplicates (2)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
---
title: "Column formats"
output: rmarkdown::html_vignette
always_allow_html: true
vignette: >
  %\VignetteIndexEntry{Column formats}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  error = (Sys.getenv("IN_PKGDOWN") == "")
)
```

```{r setup}
library(tibble)
```

## Overview

This vignette shows how to decorate columns for custom formatting.
We use the formattable package for demonstration because it already contains useful vector classes that apply a custom formatting to numbers.

```{r}
library(formattable)

tbl <- tibble(x = digits(9:11, 3))
tbl
```

```{r echo = FALSE}
vec_ptype_abbr.formattable <- function(x, ...) {
  "dbl:fmt"
}

pillar_shaft.formattable <- function(x, ...) {
  pillar::new_pillar_shaft_simple(format(x), align = "right")
}
```

The `x` column in the tibble above is a regular number with a formatting method.
It always will be shown with three digits after the decimal point.
This also applies to columns derived from `x`.

```{r}
library(dplyr)
tbl2 <- 
  tbl %>%
  mutate(
    y = x + 1, 
    z = x * x, 
    v = y + z,
    lag = lag(x, default = x[[1]]),
    sin = sin(x),
    mean = mean(v),
    var = var(x)
  )

tbl2
```

Summaries also maintain the formatting.

```{r}
tbl2 %>% 
  group_by(lag) %>% 
  summarize(z = mean(z)) %>% 
  ungroup()
```

Same for pivoting operations.


```{r}
library(tidyr)

stocks <- 
  expand_grid(id = factor(1:4), year = 2018:2022) %>% 
  mutate(stock = currency(runif(20) * 10000))

stocks %>% 
  pivot_wider(id, names_from = year, values_from = stock)
```

For ggplot2 we need to do [some work](https://github.com/tidyverse/ggplot2/pull/4065) to show apply the formatting to the scales.

```{r}
library(ggplot2)

# Needs https://github.com/tidyverse/ggplot2/pull/4065 or similar
stocks %>% 
  ggplot(aes(x = year, y = stock, color = id)) +
  geom_line()
```

It pays off to specify formatting very early in the process.
The diagram below shows the principal stages of data analysis and exploration from "R for data science".

```{r echo = FALSE}
DiagrammeR::mermaid("r4ds.mmd")
```

The subsequent diagram adds data formats, communication options, and explicit data formatting.
The original r4ds transitions are highlighted in bold.
There are two principal options where to apply formatting for results: right before communicating them, or right after importing.

```{r echo = FALSE}
DiagrammeR::mermaid("formats.mmd")
```

Applying formatting early in the process gives the added benefit of showing the data in a useful format during the "Tidy", "Transform", and "Visualize" stages.
For this to be useful, we need to ensure that the formatting options applied early:

- give a good user experience for analysis
    - are easy to set up
    - keep sticky in the process of data analysis and exploration
    - support the analyst in asking the right questions about the data
    - convey the critical information at a glance, with support to go into greater detail easier
- look good for communication
    - are applied in the various communication options
    - support everything necessary to present the data in the desired way

Ensuring stickiness is difficult, and is insufficient for a dbplyr workflow where parts of the "Tidy", "Transform" or even "Visualize" stages are run on the database.
Often it's possible to derive a rule-based approach for formatting.

```{r}
tbl3 <- 
  tibble(id = letters[1:3], x = 9:11) %>% 
  mutate(
    y = x + 1, 
    z = x * x, 
    v = y + z,
    lag = lag(x, default = x[[1]]),
    sin = sin(x),
    mean = mean(v),
    var = var(x)
  )

tbl3

tbl3 %>% 
  mutate(
    across(where(is.numeric), digits, 3),
    across(where(~ is.numeric(.x) && mean(.x) > 50), digits, 1)
  )
```

These rules can be stored in `quos()`:

```{r}
rules <- quos(
  across(where(is.numeric), digits, 3),
  across(where(~ is.numeric(.x) && mean(.x) > 50), digits, 1)
)

tbl3 %>% 
  mutate(!!!rules)
```

This poses a few drawbacks:

- The syntax is repetitive and not very intuitive
- Rules that match multiple columns must be given in reverse order due to the way `mutate()` works, and are executed multiple times

What would a good API for rule-based formatting look like?