File: design_philosophy.Rmd

package info (click to toggle)
r-cran-sjmisc 2.8.10-1
  • links: PTS, VCS
  • area: main
  • in suites: sid, trixie
  • size: 1,232 kB
  • sloc: sh: 13; makefile: 2
file content (135 lines) | stat: -rw-r--r-- 5,464 bytes parent folder | download | duplicates (4)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
---
title: "The Design Philosophy of Functions in sjmisc"
author: "Daniel Lüdecke"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{The Design Philosophy of Functions in sjmisc}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r echo = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE, 
  comment = "#>"
)
if (!requireNamespace("dplyr", quietly = TRUE)) {
  knitr::opts_chunk$set(eval = FALSE)
}
options(max.print = 1000)
suppressPackageStartupMessages(library(sjmisc))
```

Basically, this package complements the _dplyr_ package in that _sjmisc_ takes over data transformation tasks on variables, like recoding, dichotomizing or grouping variables, setting and replacing missing values, etc. The data transformation functions also support labelled data.

# The design of data transformation functions

The design of data transformation functions in this package follows, where appropriate, the _tidyverse-approach_, with the first argument of a function always being the data (either a data frame or vector), followed by variable names that should be processed by the function. If no variables are specified as argument, the function applies to the complete data that was indicated as first function argument.

## The data-argument

A major difference to dplyr-functions like `select()` or `filter()` is that the data-argument (the first argument of each function), may either be a _data frame_ or a _vector_. The returned object for each function _equals the type of the data-argument_:

  * If the data-argument is a vector, the function returns a vector.
  * If the data-argument is a data frame, the function returns a data frame.

```{r}
library(sjmisc)
data(efc)

# returns a vector
x <- rec(efc$e42dep, rec = "1,2=1; 3,4=2")
str(x)

# returns a data frame
rec(efc, e42dep, rec = "1,2=1; 3,4=2", append = FALSE) %>% head()
```

This design-choice is mainly due to compatibility- and convenience-reasons. It does not affect the usual "tidyverse-workflow" or when using pipe-chains.

## The ...-ellipses-argument

The selection of variables specified in the `...`-ellipses-argument is powered by dplyr's `select()` and tidyselect's `select_helpers()`. This means, you can use existing functions like `:` to select a range of variables, or also use tidyselect's `select_helpers`, like `contains()` or `one_of()`.

```{r echo=FALSE, message=FALSE}
library(dplyr)
```
```{r collapse=TRUE}
# select all variables with "cop" in their names, and also
# the range from c161sex to c175empl
rec(
  efc, contains("cop"), c161sex:c175empl, 
  rec = "0,1=0; else=1", 
  append = FALSE
) %>% head()

# center all variables with "age" in name, variable c12hour
# and all variables from column 19 to 21
center(efc, c12hour, contains("age"), 19:21, append = FALSE) %>% head()
```

## The function-types

There are two types of function designs:

### coercing/converting functions

Functions like `to_factor()` or `to_label()`, which convert variables into other types or add additional information like variable or value labels as attribute, typically _return the complete data frame_ that was given as first argument _without any new variables_. The variables specified in the `...`-ellipses argument are converted (overwritten), all other variables remain unchanged.

```{r}
x <- efc[, 3:5]

x %>% str()

to_factor(x, e42dep, e16sex) %>% str()
```

### transformation/recoding functions

Functions like `rec()` or `dicho()`, which transform or recode variables, by default add _the transformed or recoded variables_ to the data frame, so they return the new variables _and_ the original data as combined data frame. To return _only the transformed and recoded variables_ specified in the `...`-ellipses argument, use argument `append = FALSE`.

```{r}
# complete data, including new columns
rec(efc, c82cop1, c83cop2, rec = "1,2=0; 3:4=2", append = TRUE) %>% head()

# only new columns
rec(efc, c82cop1, c83cop2, rec = "1,2=0; 3:4=2", append = FALSE) %>% head()
```

These variables usually get a suffix, so you can bind these variables as new columns to a data frame, for instance with `add_columns()`. The function `add_columns()` is useful if you want to bind/add columns within a pipe-chain _to the end_ of a data frame.

```{r}
efc %>% 
  rec(c82cop1, c83cop2, rec = "1,2=0; 3:4=2", append = FALSE) %>% 
  add_columns(efc) %>% 
  head()
```

If `append = TRUE` and `suffix = ""`, recoded variables will replace (overwrite) existing variables.

```{r}
# complete data, existing columns c82cop1 and c83cop2 are replaced
rec(efc, c82cop1, c83cop2, rec = "1,2=0; 3:4=2", append = TRUE, suffix = "") %>% head()
```

## sjmisc and dplyr

The functions of **sjmisc** are designed to work together seamlessly with other packages from the tidyverse, like **dplyr**. For instance, you can use the functions from **sjmisc** both within a pipe-worklflow to manipulate data frames, or to create new variables with `mutate()`:

```{r}
efc %>% 
  select(c82cop1, c83cop2) %>% 
  rec(rec = "1,2=0; 3:4=2") %>% 
  head()

efc %>% 
  select(c82cop1, c83cop2) %>% 
  mutate(
    c82cop1_dicho = rec(c82cop1, rec = "1,2=0; 3:4=2"),
    c83cop2_dicho = rec(c83cop2, rec = "1,2=0; 3:4=2")
  ) %>% 
  head()
```

This makes it easy to adapt the **sjmisc** functions to your own workflow.