File: blog_v2.Rmd

package info (click to toggle)
r-cran-skimr 2.1.5%2Bdfsg-1
  • links: PTS, VCS
  • area: main
  • in suites: bookworm, forky, sid, trixie
  • size: 2,188 kB
  • sloc: sh: 13; makefile: 2
file content (342 lines) | stat: -rw-r--r-- 13,746 bytes parent folder | download
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
---
slug: skimrv2
title: "(Re)introducing `skimr` v2"
subtitle: "A year in the life of an open source R project"
package_version: 2.0
authors:
  - name: Michael Quinn
    url: https://github.com/michaelquinn32
  - name: Elin Waring
    url: https://github.com/elinw
date: "10/09/2019"
categories: blog
tags:
  - software peer review
  - R
  - packages
  - community
  - skimr
---

[Theme song: *PSA* by Jay-Z](https://www.youtube.com/watch?v=-LzdKH1naok)

We announced the testing version of `skimr` v2 on
[June 19, 2018](https://github.com/ropensci/skimr/issues/341). After more than
a year of (admittedly intermittent) work, we're thrilled to be able to say that
the package is ready to go to CRAN. So, what happened over the last year? And
why are we so excited for V2?

## Setting the stage

Before we can talk about the last year of `skimr` development, we need to lay
out the timeline that got us to this point. For those deeply enmeshed in `skimr`
lore, all [dozens](https://imgur.com/gallery/R1fdEt3) of you, bear with.

`skimr` was originally an [rOpenSci
unconf17](https://ropensci.org/blog/2017/07/11/skimr/) project, a big
collaboration between eight different participants that resulted in a conceptual
outline of the package and a basic working version. Participating in the unconf
was a truly magical experience, with everyone bringing a tremendous amount of
energy and ideas to the project, and implementation happening over a flurry of
["fancy git commits"](https://twitter.com/AmeliaMN/status/867818976666976256).

About six months later, we released our first version on CRAN. The time between
these two milestones was mostly spent on fleshing out all of the different ideas
that were generated during the unconf (like handling grouped data frames) and
fixing all the bugs we discovered along the way.

Getting the package on CRAN opened the gates for bug reports and feature
requests on [GitHub](https://github.com/ropensci/skimr/issues). About the same
time we pushed our first version to CRAN, Elin got `skimr`'s rOpenSci's package
[peer review](https://github.com/ropensci/software-review/issues/175) started
(thank you Jennifer and Jim!), opening another incredibly useful channel for
collecting feedback on the package. All of these new ideas and suggestions gave
us the opportunity to really push `skimr` to the next level, but doing that
would require rethinking the package, from the ground up.

A month after finishing the peer review (and six months after the process
began), we announced v2. Over the first phase of `skimr`'s life, we accumulated
700 commits, two release, 400 GitHub stars, 95 percent code coverage and a
lifetime's worth of [unicode rendering
bugs](https://github.com/ropensci/skimr#support-for-spark-histograms)!

Just kidding! We love our little histograms, even when they don't love us back!
For those of you that might have never seen `skimr`, using the package typically
boils down to a single function call:

```{r render = knitr::normal_print, message=FALSE}
library(skimr)
library(dplyr)
options(width = 90)

skim(iris)
```

## Getting it right

Under normal circumstances (i.e. not during a hackathon), most software
engineering projects begin with a design phase and series of increasingly
detailed design docs. `skimr` is only a few hundred lines of code, which means
"increasingly detailed design docs" translates to one doc. But we did actually
write it! [It's
here](https://docs.google.com/document/d/18lBStDZzd1rJq08O-4Sw2qHhuHEZ79QX4sBkeyzWNFY/edit#heading=h.5x0d5h95i329).
And it still goes a good job of laying out some of the big ideas we were
interested in taking on for v2.

* Eliminating frictions that resulted from differences in the way we stored data
  vs how it was displayed to users
* Getting away from using a global environment to configure `skimr`
* Making it easier for others to extend `skimr`
* Create more useful ways to use `skimr`

## Better internal data structures

In v1, `skimr` stored all of its data in a "long format", data frame. Although
hidden from the user by its print methods, this format would appear any time
you'd try do something with the results of a `skim()` call. It looked something
like this:

```{r eval = FALSE}
skim(mtcars) %>% dplyr::filter(stat=="hist")
```
```
# A tibble: 11 x 6
   variable type    stat  level value formatted
   <chr>    <chr>   <chr> <chr> <dbl> <chr>
 1 mpg      numeric hist  .all     NA ▃▇▇▇▃▂▂▂
 2 cyl      numeric hist  .all     NA ▆▁▁▃▁▁▁▇
 3 disp     numeric hist  .all     NA ▇▆▁▂▅▃▁▂
 4 hp       numeric hist  .all     NA ▃▇▃▅▂▃▁▁
 5 drat     numeric hist  .all     NA ▃▇▁▅▇▂▁▁
 6 wt       numeric hist  .all     NA ▃▃▃▇▆▁▁▂
 7 qsec     numeric hist  .all     NA ▃▂▇▆▃▃▁▁
 8 vs       numeric hist  .all     NA ▇▁▁▁▁▁▁▆
 9 am       numeric hist  .all     NA ▇▁▁▁▁▁▁▆
10 gear     numeric hist  .all     NA ▇▁▁▆▁▁▁▂
11 carb     numeric hist  .all     NA ▆▇▂▇▁▁▁▁
```

Big ups to anyone who looked at the rendered output and saw that this was how
you actually filtered the results. Hopefully there are even better applications
of your near-telepathic abilities.

Now, working with `skimr` is a bit more sane.

```{r render = knitr::normal_print}
skimmed <- iris %>%
  skim() %>%
  dplyr::filter(numeric.sd > 1)

skimmed
```

And

```{r render = knitr::normal_print}
dplyr::glimpse(skimmed)
```

It's still not perfect, as you need to rely on a *pseudo-namespace* to refer to
the column that you want. But this is unfortunately a necessary trade-off. As
the Rstats Bible, errr Hadley Wickham's *Advanced R*, states, all elements of
[an atomic vector must have the same
type](https://adv-r.hadley.nz/vectors-chap.html). This normally isn't something
that you have to think too much about, that is until you try to combine the
means of all your `Date` columns with the means of your `numeric` columns and
everything comes out utterly garbled. So instead of that basket of laughs, we
prefix columns names by their data type.

There's a couple of other nuances here:

* The data frame `skim()` produces always starts off with some metadata columns
* Functions that always produce the same, regardless of input type, can
  be treated as `base_skimmers` and don't need a namespace

### Manipulating internal data

A better representation of internal data comes with better tools for reshaping
the data and getting it for other contexts. A common request in v1 was tooling
to handle the `skimr` subtables separately. We now do this with `partition()`.
It replaces the v1 function `skim_to_list()`.

```{r render = knitr::normal_print}
partition(skimmed)
```

You can undo a call to `partition()` with `bind()`, which joins the subtables
into the original `skim_df` object and properly accounts for metadata. You
can skip a step with the function `yank()`, which calls partition and pulls
out a particular subtable


```{r render = knitr::normal_print}
yank(skimmed, "numeric")
```

Last, with support something close to the older format with the `to_long()`
function. This can be added for something close to backwards compatibility.
Being realistic on open source sustainability means that we are not able to
support 100% backward compatibility in v2 even with new functions.  Meanwhile
you can keep using v1 if you are happy with it.  However,
because `skimr`'s dependencies are under ongoing development, sooner or later
skimr v1 will no longer work with updates to them.

### Working with dplyr

Using `skimr` in a `dplyr` pipeline was part of the original package design, and
we've needed to devote some extra love to making sure that everything is as
seamless as possible. Part of this is due to the object produce by `skim()`,
which we call `skim_df`. It's a little weird in that it needs both metadata and
columns in the underlying data frame.

In practice, this means that you can coerce it into a different type through
normal `dplyr` operations. Here's one:

```{r render = knitr::normal_print}
select(skimmed, numeric.mean)
```

To get around this, we've added some helper functions and methods. The more
`skimr`-like replacement for `select()` is `focus()`, which preserves
metadata columns.

```{r render = knitr::normal_print}
focus(skimmed, numeric.mean)
```

## Configuring and extending skimr

Most of `skimr`'s magic, to [steal a
term](https://resources.rstudio.com/rstudio-conf-2019/our-colour-of-magic-the-open-sourcery-of-fantastic-r-packages),
comes from the fact that you can do
most everything with one function. But believe it or not, there's actually a bit
more to the package.

One big one is customization. We like the `skimr` defaults, but that doesn't
guarantee you will. So what if you want to do something different, we have
a function factory for that!

```{r render = knitr::normal_print}
my_skim <- skim_with(numeric = sfl(iqr = IQR, p25 = NULL, p75 = NULL))
my_skim(faithful)
```

Those of you familiar with customizing `skim()` in v1 will notice a couple
differences:

* we now has an object called `sfl()` for managing `skimr` function lists; more
  below
* instead of setting global options, we now have a *function factory*

Yes! A function factory. `skim_with()` gives us a new function each time
we call it, and the returned function is configured by the arguments in
`skim_with()`. This works the same way as `ecdf()` in the `stats` package or
`colorRamp` in `grDevices`. Creating new functions has a few advantages over
the previous approach.

* you can export a `skim()` function in a package or create it in a `.Rprofile`
* you avoid a bunch of potential side effects from setting options with
  `skim_with()`

The other big change is how we now handle different data types. Although many
will never see it, a key piece of `skimr` customization comes from the
`get_skimmers()` generic. It's used to detect different column types in your
data and set the appropriate summary functions for that type. It's also
designed to work with `sfl()`. Here's an example from the "Supporting additional
objects" vignette. Here, we'll create some skimmers for
[`sf`](https://cran.r-project.org/web/packages/sf/index.html) data types:

```{r eval = FALSE}
get_skimmers.sfc_POINT <- function(column) {
  sfl(
    skim_type = "sfc_POINT",
    n_unique = n_unique,
    valid = ~ sum(sf::st_is_valid(.))
  )
}
```

While it was required in `skim_with()`, users must provide a `skim_type` value
when creating new methods. With that, you can export this method in a new
package (be sure to import the generic), and the new default skimmer is added
when you load the package.

```{r, eval = FALSE}
get_default_skimmer_names()
```
```
...
$sfc_POINT
[1] "missing"  "complete" "n"        "n_unique" "valid"
...
```

Even if you don't go the full route of supporting a new data type, creating a
couple of `skimr` function lists has other benefits. For example, you can add
some to your `.Rprofile` as a way to quickly configure `skimr` interactively.

```{r eval = FALSE}
sfc_point_sfl <- sfl(
  n_unique = n_unique,
  valid = ~ sum(sf::st_is_valid(.))
)

my_skimmer <- skim_with(sfc_POINT = sfc_point_sfl)
```

## Using skimr in other contexts

In `skimr` v1, we developed some slightly hacky approaches to getting nicer
`skim()` output in RMarkdown docs. These have been removed in favor of the
[actually-supported](https://github.com/yihui/knitr/issues/1493) `knit_print`
API. Now, calling `skim()`, within an RMarkdown doc should produce something
nice by default.

```{r}
skim(chickwts)
```

You get a nice html version of both the summary header and the `skimr` subtables
for each type of data.

In this context, you configure the output the same way you handle other `knitr`
code chunks.

~~~
```{r skimr_digits = 4, skimr_summary = TRUE}
```
~~~

This means that we're dropping direct support for `kable.skim_df()` and
`pander.skim_df()`. But you can still get pretty similar results to these
functions by using the reshaping functions described above to get subtables. You
can also still use `Pander` and other nice rendering packages on an ad hoc basis
as you would for other data frames or tibbles.

We also have a similarly-nice rendered output in
[Jupyter](https://github.com/ropensci/skimr/blob/8c2263c4fd4796af0e5e8f32aafc4980bd58d43a/inst/other_docs/skimr_in_jupyter.ipynb)
and RMarkdown notebooks. In the latter, the summary is separated from the rest
of the output when working interactively. We like it that way, but we'd be happy
to hear what the rest of you think!

## Wait, that took over a year?

Well, we think that's a lot! But to be fair, it wasn't exactly simple to keep up
with `skimr`. Real talk, open source development takes up a lot of time, and
the `skimr` developers have additional important priorities.  Michael's family
added a new baby, and despite swearing up and down otherwise,
he got absolutely nothing not-baby-related done during his paternity leave (take
note new dads!). Elin
ended up taking a much bigger role on at Lehman, really limiting time for any
other work.

Even so, these are just the highlights in the normal ebb and flow of this sort
of work. Since it's no one's real job, it might not always be the first focus.
And that's OK! We've been really lucky to have a group of new users that have
been very patient with this slow development cycle while still providing really
good feedback throughout. Thank you all!

We're really excited about this next step in the `skimr` journey. We've put a
huge amount of work into this new version. Hopefully it shows. And hopefully
it inspires some of you to send more feedback and help us find even more ways
to improve!