1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135
|
---
title: "The Design Philosophy of Functions in sjmisc"
author: "Daniel Lüdecke"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
%\VignetteIndexEntry{The Design Philosophy of Functions in sjmisc}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
```{r echo = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>"
)
if (!requireNamespace("dplyr", quietly = TRUE)) {
knitr::opts_chunk$set(eval = FALSE)
}
options(max.print = 1000)
suppressPackageStartupMessages(library(sjmisc))
```
Basically, this package complements the _dplyr_ package in that _sjmisc_ takes over data transformation tasks on variables, like recoding, dichotomizing or grouping variables, setting and replacing missing values, etc. The data transformation functions also support labelled data.
# The design of data transformation functions
The design of data transformation functions in this package follows, where appropriate, the _tidyverse-approach_, with the first argument of a function always being the data (either a data frame or vector), followed by variable names that should be processed by the function. If no variables are specified as argument, the function applies to the complete data that was indicated as first function argument.
## The data-argument
A major difference to dplyr-functions like `select()` or `filter()` is that the data-argument (the first argument of each function), may either be a _data frame_ or a _vector_. The returned object for each function _equals the type of the data-argument_:
* If the data-argument is a vector, the function returns a vector.
* If the data-argument is a data frame, the function returns a data frame.
```{r}
library(sjmisc)
data(efc)
# returns a vector
x <- rec(efc$e42dep, rec = "1,2=1; 3,4=2")
str(x)
# returns a data frame
rec(efc, e42dep, rec = "1,2=1; 3,4=2", append = FALSE) %>% head()
```
This design-choice is mainly due to compatibility- and convenience-reasons. It does not affect the usual "tidyverse-workflow" or when using pipe-chains.
## The ...-ellipses-argument
The selection of variables specified in the `...`-ellipses-argument is powered by dplyr's `select()` and tidyselect's `select_helpers()`. This means, you can use existing functions like `:` to select a range of variables, or also use tidyselect's `select_helpers`, like `contains()` or `one_of()`.
```{r echo=FALSE, message=FALSE}
library(dplyr)
```
```{r collapse=TRUE}
# select all variables with "cop" in their names, and also
# the range from c161sex to c175empl
rec(
efc, contains("cop"), c161sex:c175empl,
rec = "0,1=0; else=1",
append = FALSE
) %>% head()
# center all variables with "age" in name, variable c12hour
# and all variables from column 19 to 21
center(efc, c12hour, contains("age"), 19:21, append = FALSE) %>% head()
```
## The function-types
There are two types of function designs:
### coercing/converting functions
Functions like `to_factor()` or `to_label()`, which convert variables into other types or add additional information like variable or value labels as attribute, typically _return the complete data frame_ that was given as first argument _without any new variables_. The variables specified in the `...`-ellipses argument are converted (overwritten), all other variables remain unchanged.
```{r}
x <- efc[, 3:5]
x %>% str()
to_factor(x, e42dep, e16sex) %>% str()
```
### transformation/recoding functions
Functions like `rec()` or `dicho()`, which transform or recode variables, by default add _the transformed or recoded variables_ to the data frame, so they return the new variables _and_ the original data as combined data frame. To return _only the transformed and recoded variables_ specified in the `...`-ellipses argument, use argument `append = FALSE`.
```{r}
# complete data, including new columns
rec(efc, c82cop1, c83cop2, rec = "1,2=0; 3:4=2", append = TRUE) %>% head()
# only new columns
rec(efc, c82cop1, c83cop2, rec = "1,2=0; 3:4=2", append = FALSE) %>% head()
```
These variables usually get a suffix, so you can bind these variables as new columns to a data frame, for instance with `add_columns()`. The function `add_columns()` is useful if you want to bind/add columns within a pipe-chain _to the end_ of a data frame.
```{r}
efc %>%
rec(c82cop1, c83cop2, rec = "1,2=0; 3:4=2", append = FALSE) %>%
add_columns(efc) %>%
head()
```
If `append = TRUE` and `suffix = ""`, recoded variables will replace (overwrite) existing variables.
```{r}
# complete data, existing columns c82cop1 and c83cop2 are replaced
rec(efc, c82cop1, c83cop2, rec = "1,2=0; 3:4=2", append = TRUE, suffix = "") %>% head()
```
## sjmisc and dplyr
The functions of **sjmisc** are designed to work together seamlessly with other packages from the tidyverse, like **dplyr**. For instance, you can use the functions from **sjmisc** both within a pipe-worklflow to manipulate data frames, or to create new variables with `mutate()`:
```{r}
efc %>%
select(c82cop1, c83cop2) %>%
rec(rec = "1,2=0; 3:4=2") %>%
head()
efc %>%
select(c82cop1, c83cop2) %>%
mutate(
c82cop1_dicho = rec(c82cop1, rec = "1,2=0; 3:4=2"),
c83cop2_dicho = rec(c83cop2, rec = "1,2=0; 3:4=2")
) %>%
head()
```
This makes it easy to adapt the **sjmisc** functions to your own workflow.
|