1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237
|
---
title: "Recoding Variables"
author: "Daniel Lüdecke"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
%\VignetteIndexEntry{Recoding Variables}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
```{r echo = FALSE}
knitr::opts_chunk$set(collapse = TRUE, warning = FALSE, comment = "#>")
if (!requireNamespace("dplyr", quietly = TRUE)) {
knitr::opts_chunk$set(eval = FALSE)
}
suppressPackageStartupMessages(library(sjmisc))
```
Data preparation is a common task in research, which usually takes the most amount of time in the analytical process. **sjmisc** is a package with special focus on transformation of _variables_ that fits into the workflow and design-philosophy of the so-called "tidyverse".
Basically, this package complements the **dplyr** package in that **sjmisc** takes over data transformation tasks on variables, like recoding, dichotomizing or grouping variables, setting and replacing missing values, etc. A distinctive feature of **sjmisc** is the support for labelled data, which is especially useful for users who often work with data sets from other statistical software packages like _SPSS_ or _Stata_.
This vignette demonstrate some of the important recoding-functions in **sjmisc**. The examples are based on data from the EUROFAMCARE project, a survey on the situation of family carers of older people in Europe. The sample data set `efc` is part of this package.
```{r message=FALSE}
library(sjmisc)
data(efc)
```
To show the results after recoding variables, the `frq()` function is used to print frequency tables.
## Dichotomization: dividing variables into two groups
`dicho()` dichotomizes variables into "dummy" variables (with 0/1 coding). Dichotomization is either done by median, mean or a specific value (see argument `dich.by`).
Like all recoding-functions in **sjmisc**, `dicho()` returns the complete data frame _including_ the recoded variables, if the first argument is a `data.frame`. If the first argument is a vector, only the recoded variable is returned. See [this vignette](design_philosophy.html) for details about the function-design.
If `dicho()` returns a data frame, the recoded variables have the same name as the original variable, including a suffix `_d`.
```{r}
# age, ranged from 65 to 104, in this output
# grouped to get a shorter table
frq(efc, e17age, auto.grp = 5)
# splitting is done at the median by default:
median(efc$e17age, na.rm = TRUE)
# the recoded variable is now named "e17age_d"
efc <- dicho(efc, e17age)
frq(efc, e17age_d)
```
As `dicho()`, like all recoding-functions, supports [labelled data](https://cran.r-project.org/package=sjlabelled), the variable preserves it variable label (but not the value labels). You can directly define value labels inside the function:
```{r}
x <- dicho(efc$e17age, val.labels = c("young age", "old age"))
frq(x)
```
To split a variable at a different value, use the `dich.by`-argument. The value specified in `dich.by` is _inclusive_, i.e. all values from lowest to and including `dich.by` are recoded into the lower category, while all values _above_ `dich.by` are recoded into the higher category.
```{r}
# split at upper quartile
x <- dicho(
efc$e17age,
dich.by = quantile(efc$e17age, probs = .75, na.rm = TRUE),
val.labels = c("younger three quarters", "oldest quarter")
)
frq(x)
```
Since the distribution of values in a dataset may differ for different subgroups, all recoding-functions also work on grouped data frames. In the following example, first, the age-variable `e17age` is dichotomized at the median. Then, the data is grouped by gender (`c161sex`) and the dichotomization is done for each subgroup, i.e. it once relates to the median age in the subgroup of female, and once to the median age in the subgroup of male family carers.
```{r}
data(efc)
x1 <- dicho(efc$e17age)
x2 <- efc %>%
dplyr::group_by(c161sex) %>%
dicho(e17age) %>%
dplyr::pull(e17age_d)
# median age of total sample
frq(x1)
# median age of total sample, with median-split applied
# to distribution of age by subgroups of gender
frq(x2)
```
## Splitting variables into several groups
`split_var()` recodes numeric variables into equal sized groups, i.e. a variable is cut into a smaller number of groups at specific cut points. The amount of groups depends on the `n`-argument and cuts a variable into `n` quantiles.
Similar to `dicho()`, if the first argument in `split_var()` is a data frame, the complete data frame including the new recoded variable(s), with suffix `_g`, is returned.
```{r}
x <- split_var(efc$e17age, n = 3)
frq(x)
```
Unlike dplyr's `ntile()`, `split_var()` never splits a value into two different categories, i.e. you always get a "clean" separation of original categories. In other words: cases that have identical values in a variable will always be recoded into the same group. The following example demonstrates the differences:
```{r}
x <- dplyr::ntile(efc$neg_c_7, n = 3)
# for some cases, value "10" is recoded into category "1",
# for other cases into category "2". Same is true for value "13"
table(efc$neg_c_7, x)
x <- split_var(efc$neg_c_7, n = 3)
# no separation of cases with identical values.
table(efc$neg_c_7, x)
```
`split_var()`, unlike `ntile()`, does therefor not always return exactly equal-sized groups:
```{r}
x <- dplyr::ntile(efc$neg_c_7, n = 3)
frq(x)
x <- split_var(efc$neg_c_7, n = 3)
frq(x)
```
## Recode variables into equal-ranged groups
With `group_var()`, variables can be grouped into equal ranged categories, i.e. a variable is cut into a smaller number of groups, where each group has the same value range. `group_labels()` creates the related value labels.
The range of the groups is defined in the `size`-argument. At the same time, the `size`-argument also defines the _lower bound_ of one of the groups.
For instance, if the lowest value of a variable is 1 and the maximum is 10, and `size = 5`, then
a) each group will have a range of 5, and
b) one of the groups will start with the value 5.
This means, that an equal-ranged grouping will define groups from _0 to 4_, _5 to 9_ and _10-14_. Each of these groups has a range of 5, and one of the groups starts with the value 5.
The group assignment becomes clearer, when `group_labels()` is used in parallel:
```{r}
set.seed(123)
x <- round(runif(n = 150, 1, 10))
frq(x)
frq(group_var(x, size = 5))
group_labels(x, size = 5)
dummy <- group_var(x, size = 5, as.num = FALSE)
levels(dummy) <- group_labels(x, size = 5)
frq(dummy)
dummy <- group_var(x, size = 3, as.num = FALSE)
levels(dummy) <- group_labels(x, size = 3)
frq(dummy)
```
The argument `right.interval` can be used when `size` should indicate the _upper bound_ of a group-range.
```{r}
dummy <- group_var(x, size = 4, as.num = FALSE)
levels(dummy) <- group_labels(x, size = 4)
frq(dummy)
dummy <- group_var(x, size = 4, as.num = FALSE, right.interval = TRUE)
levels(dummy) <- group_labels(x, size = 4, right.interval = TRUE)
frq(dummy)
```
## Flexible recoding of variables
`rec()` recodes old values of variables into new values, and can be considered as a "classical" recode-function. The recode-pattern, i.e. which new values should replace the old values, is defined in the `rec`-argument. This argument has a specific "syntax":
* **recode pairs**: Each recode pair has to be separated by a ;, e.g. `rec = "1=1; 2=4; 3=2; 4=3"`
* **multiple values**: Multiple old values that should be recoded into a new single value may be separated with comma, e.g. `rec = "1,2=1; 3,4=2"`
* **value range**: A value range is indicated by a colon, e.g. `rec = "1:4=1; 5:8=2"` (recodes all values from 1 to 4 into 1, and from 5 to 8 into 2)
* **value range for doubles**: For double vectors (with fractional part), all values within the specified range are recoded; e.g. `rec = "1:2.5=1;2.6:3=2"` recodes 1 to 2.5 into 1 and 2.6 to 3 into 2, but 2.55 would not be recoded (since it's not included in any of the specified ranges)
* **"min" and "max"**: Minimum and maximum values are indicates by `min` (or `lo`) and `max` (or `hi`), e.g. `rec = "min:4=1; 5:max=2"` (recodes all values from minimum values of x to 4 into 1, and from 5 to maximum values of x into 2) You can also use `min` or `max` to recode a value into the minimum or maximum value of a variable, e.g. `rec = "min:4=1; 5:7=max"` (recodes all values from minimum values of x to 4 into 1, and from 5 to 7 into the maximum value of x).
* **"else"**: All other values, which have not been specified yet, are indicated by else, e.g. `rec = "3=1; 1=2; else=3"` (recodes 3 into 1, 1 into 2 and all other values into 3)
* **"copy"**: The `"else"`-token can be combined with `"copy"`, indicating that all remaining, not yet recoded values should stay the same (are copied from the original value), e.g. `rec = "3=1; 1=2; else=copy"` (recodes 3 into 1, 1 into 2 and all other values like 2, 4 or 5 etc. will not be recoded, but copied.
* **NA's**: `NA` values are allowed both as old and new value, e.g. `rec = "NA=1; 3:5=NA"` (recodes all `NA` into 1, and all values from 3 to 5 into NA in the new variable)
* **"rev"**: `"rev"` is a special token that reverses the value order.
* **direct value labelling**: Value labels for new values can be assigned inside the recode pattern by writing the value label in square brackets after defining the new value in a recode pair, e.g. `rec = "15:30=1 [young aged]; 31:55=2 [middle aged]; 56:max=3 [old aged]"`
* **non-captured values**: Non-matching values will be set to `NA`, unless captured by the `"else"`- or `"copy"`-token.
Here are some examples:
```{r}
frq(efc$e42dep)
# replace NA with 5
frq(rec(efc$e42dep, rec = "NA=5;else=copy"))
# recode 1 to 2 into 1 and 3 to 4 into 2
frq(rec(efc$e42dep, rec = "1,2=1; 3,4=2"))
# recode 1 to 3 into 4 into 2
frq(rec(efc$e42dep, rec = "min:3=1; 4=2"))
# recode numeric to character, and remaining values
# into the highest value (="hi") of e42dep
frq(rec(efc$e42dep, rec = "1=first;2=2nd;else=hi"))
data(iris)
frq(rec(iris, Species, rec = "setosa=huhu; else=copy", append = FALSE))
# works with mutate
efc %>%
dplyr::select(e42dep, e17age) %>%
dplyr::mutate(dependency_rev = rec(e42dep, rec = "rev")) %>%
head()
# recode multiple variables and set value labels via recode-syntax
dummy <- rec(
efc, c160age, e17age,
rec = "15:30=1 [young]; 31:55=2 [middle]; 56:max=3 [old]",
append = FALSE
)
frq(dummy)
```
## Scoped variants
Where applicable, the recoding-functions in **sjmisc** have "scoped" versions as well, e.g. `dicho_if()` or `split_var_if()`, where transformation will be applied only to those variables that match the logical condition of `predicate`.
|