File: comparedf.Rmd

package info (click to toggle)
r-cran-arsenal 3.6.3-2
  • links: PTS, VCS
  • area: main
  • in suites: sid, trixie
  • size: 2,788 kB
  • sloc: sh: 18; makefile: 5
file content (389 lines) | stat: -rw-r--r-- 13,700 bytes parent folder | download | duplicates (6)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
---
title: "The comparedf function"
author: "Ethan Heinzen, Ryan Lennon, Andrew Hanson"
output:
  rmarkdown::html_vignette:
    toc: yes
    toc_depth: 3
vignette: |
  %\VignetteIndexEntry{The comparedf function}
  %\VignetteEncoding{UTF-8}
  %\VignetteEngine{knitr::rmarkdown}
---

```{r include = FALSE}
knitr::opts_chunk$set(eval = TRUE, message = FALSE, results = 'asis', comment='')
options(width = 120)
```

# Introduction

The `comparedf()` function can be used to
determine and report differences between two data.frames. It was written in the spirit of replacing `PROC COMPARE`
from SAS.

```{r results = 'asis'}
library(arsenal)
```

Why "comparedf"? We originally called this function `compare.data.frame()`, using `testthat::compare()` as our S3 generic,
but that ended up getting us in trouble because of conflicting object structures. Why this didn't occur to us at the time remains a mystery.
To replace it, we brainstormed several ideas (`comparedf()`, `dfcompare()`, `collate()`, `comparison()`) but settled on the former for three
reasons:

1. There were no other objects with that generic or class (see `testthat::compare()` and `compare::compare()`).

2. It is mnemonically easy to remember (we "compare data.frames", not "data.frames compare").

3. It tab auto-completes from the original "compare".

# Basic examples

We first build two similar data.frames to compare.

```{r}
df1 <- data.frame(id = paste0("person", 1:3),
                  a = c("a", "b", "c"),
                  b = c(1, 3, 4),
                  c = c("f", "e", "d"),
                  row.names = paste0("rn", 1:3),
                  stringsAsFactors = FALSE)
df2 <- data.frame(id = paste0("person", 3:1),
                  a = c("c", "b", "a"),
                  b = c(1, 3, 4),
                  d = paste0("rn", 1:3),
                  row.names = paste0("rn", c(1,3,2)),
                  stringsAsFactors = FALSE)
```

To compare these datasets, simply pass them to the `comparedf()` function:

```{r results='markup'}
comparedf(df1, df2)
```

Use `summary()` to get a more detailed summary

```{r}
summary(comparedf(df1, df2))
```

By default, the datasets are compared row-by-row. To change this, use the `by=` or `by.x=` and `by.y=` arguments:

```{r}
summary(comparedf(df1, df2, by = "id"))
```

# A larger example

Let's muck up the `mockstudy` data.

```{r}
data(mockstudy)
mockstudy2 <- muck_up_mockstudy()
```

We've changed row order, so let's compare by the case ID:

```{r}
summary(comparedf(mockstudy, mockstudy2, by = "case"))
```

# Column name comparison options

It is possible to change which column names are considered "the same variable".

## Ignoring case

For example, to ignore case in variable names (so that `Arm` and `arm` are considered the same), pass `tol.vars = "case"`.

You can do this using `comparedf.control()`

```{r eval = FALSE}
summary(comparedf(mockstudy, mockstudy2, by = "case", control = comparedf.control(tol.vars = "case")))
```

or pass it through the `...` arguments.

```{r}
summary(comparedf(mockstudy, mockstudy2, by = "case", tol.vars = "case"))
```

## Treating dots and underscores the same (equivalence classes)

It is possible to treat certain characters or sets of characters as the same by
passing a character vector of equivalence classes to the `tol.vars=` argument.

In short, each string in the vector is split into single characters, and the resulting set of characters is
replaced by the first character in the string. For example, passing `c("._")` would replace all
underscores with dots in the column names of both datasets. Similarly, passing `c("aA", "BbCc")` would
replace all instances of `"A"` with `"a"` and all instances of `"b"`, `"C"`, or `"c"` with `"B"`.
This is one way to ignore case for certain letters. Otherwise, it's possible to combine
the equivalence classes with ignoring case, by passing (e.g.) `c("._", "case")`.

Passing a single character as an element this vector will replace that character with the empty string.
For example, passing c(" ", ".") would remove all spaces and dots from the column names.

For mockstudy, let's treat dots, underscores, and spaces as the same, and ignore case:

```{r}
summary(comparedf(mockstudy, mockstudy2, by = "case",
                tol.vars = c("._ ", "case") # dots=underscores=spaces, ignore case
))
```

## Manually specifying columns to match together

If you pass a named vector to the `tol.vars=` argument, `comparedf()` will line up the names of that
vector to the column names of `x` and the values of that vector to the column names of `y`. In this
way, you can manually specify which non-identically-named columns to compare.

For mockstudy, let's specify our variables manually in this way:

```{r}
summary(comparedf(mockstudy, mockstudy2, by = "case",
                tol.vars = c(arm = "Arm", fu.stat = "fu stat", fu.time = "fu_time")
))
```

# Column comparison options

## Logical tolerance

Use the `tol.logical=` argument to change how logicals are compared. By default, they're expected to be equal to each other.

## Numeric tolerance

To allow numeric differences of a certain tolerance, use the `tol.num=` and `tol.num.val=` options.
`tol.num.val=` determines the maximum (unsigned) difference tolerated if `tol.num="absolute"` (default),
and determines the maximum (unsigned) percent difference tolerated if `tol.num="percent"`.

Also note the option `int.as.num=`, which determines whether integers and numerics should be compared despite
their class difference. If `TRUE`, the integers are coerced to numeric.
Note that `mockstudy$ast` is integer, while `mockstudy2$ast` is numeric:

```{r}
summary(comparedf(mockstudy, mockstudy2, by = "case",
                tol.vars = c("._ ", "case"), # dots=underscores=spaces, ignore case
                int.as.num = TRUE            # compare integers and numerics
))
```

Suppose a tolerance of up to 10 is allowed for `ast`:

```{r}
summary(comparedf(mockstudy, mockstudy2, by = "case",
                tol.vars = c("._ ", "case"), # dots=underscores=spaces, ignore case
                int.as.num = TRUE,           # compare integers and numerics
                tol.num.val = 10             # allow absolute differences <= 10
))
```

## Factor tolerance

By default, factors are compared to each other based on both the labels and the underlying
numeric levels. Set `tol.factor="levels"` to match only the numeric levels, or set
`tol.factor="labels"` to match only the labels.

```{r}
summary(comparedf(mockstudy, mockstudy2, by = "case",
                tol.vars = c("._ ", "case"), # dots=underscores=spaces, ignore case
                int.as.num = TRUE,           # compare integers and numerics
                tol.num.val = 10,            # allow absolute differences <= 10
                tol.factor = "labels"        # match only factor labels
))
```

Also note the option `factor.as.char=`, which determines whether factors and characters should be compared despite
their class difference. If `TRUE`, the factors are coerced to characters.
Note that `mockstudy$race` is a character, while `mockstudy2$race` is a factor:

```{r}
summary(comparedf(mockstudy, mockstudy2, by = "case",
                tol.vars = c("._ ", "case"), # dots=underscores=spaces, ignore case
                int.as.num = TRUE,           # compare integers and numerics
                tol.num.val = 10,            # allow absolute differences <= 10
                tol.factor = "labels",       # match only factor labels
                factor.as.char = TRUE        # compare factors and characters
))
```

## Character tolerance

Use the `tol.char=` argument to change how character variables are compared.
By default, they are compared as-is, but they can be compared after ignoring case
or trimming whitespace or both.

```{r}
summary(comparedf(mockstudy, mockstudy2, by = "case",
                tol.vars = c("._ ", "case"), # dots=underscores=spaces, ignore case
                int.as.num = TRUE,           # compare integers and numerics
                tol.num.val = 10,            # allow absolute differences <= 10
                tol.factor = "labels",       # match only factor labels
                factor.as.char = TRUE,       # compare factors and characters
                tol.char = "case"            # ignore case in character vectors
))
```

## Date tolerance

Use the `tol.date=` argument to change how dates are compared. By default, they're expected to be equal to each other.

## Other data type tolerances

Use the `tol.other=` argument to change how other objects are compared. By default,
they're expected to be `identical()`.

## Specifying tolerances for each variable

You can also provide a list of tolerance functions to `comparedf()`:

```{r eval=FALSE}
comparedf.control(tol.char = list(
  "none",      # the default
  x1 = "case", # be case-insensitive for the variable "x1"
  x2 = function(x, y) tol.NA(x, y, x != y | y == "NA") # a custom-defined tolerance
))
```

## User-defined tolerance functions

### Details

The `comparedf.control()` function accepts functions for any of the tolerance arguments in addition
to the short-hand character strings. This allows the user to create custom tolerance functions to suit his/her needs.

Any custom tolerance function must accept two vectors as arguments and return a logical vector of the same length. The `TRUE`s in
the results should correspond to elements which are deemed "different". Note that the numeric and date tolerance functions should also
include a third argument for tolerance size (even if it's not used).

CAUTION: the results should not include NAs, since the logical vector is used to subset the input data.frames. The `tol.NA()` function
is useful for considering any NAs in the two vectors (but not both) as differences, in addition to other criteria.

The `tol.NA()` function is used in all default tolerance functions to help handle NAs.

### Example 1

Suppose we want to ignore any dates which are later in the second dataset than the first. We define a custom tolerance function.

```{r results = 'markup'}
my.tol <- function(x, y, tol)
{
  tol.NA(x, y, x > y)
}

date.df1 <- data.frame(dt = as.Date(c("2017-09-07", "2017-08-08", "2017-07-09", NA)))
date.df2 <- data.frame(dt = as.Date(c("2017-10-01", "2017-08-08", "2017-07-10", "2017-01-01")))
n.diffs(comparedf(date.df1, date.df2)) # default finds any differences
n.diffs(comparedf(date.df1, date.df2, tol.date = my.tol)) # our function identifies only the NA as different...
n.diffs(comparedf(date.df2, date.df1, tol.date = my.tol)) # ... until we change the argument order

```

### Example 2

(Continuing our mockstudy example)

Suppose we're okay with NAs getting replaced by -9.

```{r}
tol.minus9 <- function(x, y, tol)
{
  idx1 <- is.na(x) & !is.na(y) & y == -9
  idx2 <- tol.num.absolute(x, y, tol) # find other absolute differences
  return(!idx1 & idx2)
}

summary(comparedf(mockstudy, mockstudy2, by = "case",
                tol.vars = c("._ ", "case"), # dots=underscores=spaces, ignore case
                int.as.num = TRUE,           # compare integers and numerics
                tol.num.val = 10,            # allow absolute differences <= 10
                tol.factor = "labels",       # match only factor labels
                factor.as.char = TRUE,       # compare factors and characters
                tol.char = "case",           # ignore case in character vectors
                tol.num = tol.minus9         # ignore NA -> -9 changes
))
```

# Extract Differences

Differences can be easily extracted using the `diffs()` function. If you only want to determine how many differences
were found, use the `n.diffs()` function.

```{r results = 'markup'}
cmp <- comparedf(mockstudy, mockstudy2, by = "case", tol.vars = c("._ ", "case"), int.as.num = TRUE)
n.diffs(cmp)
head(diffs(cmp))
```

Differences can also be summarized by variable.

```{r results = 'markup'}
diffs(cmp, by.var = TRUE)
```

To report differences from only a few variables, one can pass a list of variable names to `diffs()`.

```{r results = 'markup'}
diffs(cmp, vars = c("ps", "ast"), by.var = TRUE)
diffs(cmp, vars = c("ps", "ast"))
```

# Appendix

## Stucture of the Object

(This section is just as much for my use as for yours!)

```{r}
obj <- comparedf(mockstudy, mockstudy2, by = "case")
```

There are two main objects in the `"comparedf"` object, each with its own print method.

The `frame.summary` contains:

- the substituted-deparsed arguments

- information about the number of columns and rows in each dataset

- the by-variables for each dataset (which may not be the same)

- the attributes for each dataset (which get counted in the print method)

- a data.frame of by-variables and row numbers of observations not shared between datasets

- the number of shared observations

```{r results='markup'}
print(obj$frame.summary)
```

The `vars.summary` contains:

- variable name, column number, and class vector (with possibly more than one element) for each x and y.
  These are all `NA` if there isn't a match in both datasets.
  
- values, a list-column of the text string `"by-variable"` for the by-variables,
  `NULL` for columns that aren't compared, or a data.frame containing:

    - The by-variables for differences found
    
    - The values which are different for x and y
    
    - The row numbers for differences found
    
- attrs, a list-column of `NULL` if there are no attributes, or
  a data.frame containing:
  
    - The name of the attributes
    
    - The attributes for x and y, set to `NA` if non-existant
    
    - The actual attributes (if `show.attr=TRUE`).

```{r results='markup'}
print(obj$vars.summary)
```