1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252
|
---
title: "How are categorical predictors handled in `recipes`?"
vignette: >
%\VignetteEngine{knitr::rmarkdown}
%\VignetteIndexEntry{Dummy Variables and Interactions}
%\VignetteEncoding{UTF-8}
output:
knitr:::html_vignette:
toc: yes
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(
message = FALSE,
digits = 3,
collapse = TRUE,
comment = "#>"
)
options(digits = 3)
library(recipes)
```
Recipes can be different from their base R counterparts such as `model.matrix`. This vignette describes the different methods for encoding categorical predictors with special attention to interaction terms and contrasts.
## Creating Dummy Variables
Let's start, of course, with `iris` data. This has four numeric columns and a single factor column with three levels: `'setosa'`, `'versicolor'`, and `'virginica'`. Our initial recipe will have no outcome:
```{r iris-base-rec}
library(recipes)
# make a copy for use below
iris <- iris %>% mutate(original = Species)
iris_rec <- recipe( ~ ., data = iris)
summary(iris_rec)
```
A [contrast function](https://en.wikipedia.org/wiki/Contrast_(statistics)) in R is a method for translating a column with categorical values into one or more numeric columns that take the place of the original. This can also be known as an encoding method or a parameterization function.
The default approach is to create dummy variables using the "reference cell" parameterization. This means that, if there are _C_ levels of the factor, there will be _C_ - 1 dummy variables created and all but the first factor level are made into new columns:
```{r iris-ref-cell}
ref_cell <-
iris_rec %>%
step_dummy(Species) %>%
prep(training = iris)
summary(ref_cell)
# Get a row for each factor level
juice(ref_cell, original, starts_with("Species")) %>% distinct()
```
Note that the column that was used to make the new columns (`Species`) is no longer there. See the section below on obtaining the entire set of _C_ columns.
There are different types of contrasts that can be used for different types of factors. The defaults are:
```{r defaults}
param <- getOption("contrasts")
param
```
Looking at `?contrast`, there are other options. One alternative is the little known Helmert contrast:
> `contr.helmert` returns Helmert contrasts, which contrast the second level with the first, the third with the average of the first two, and so on.
To get this encoding, the global option for the contrasts can be changed and saved. [`step_dummy`](https://recipes.tidymodels.org/reference/step_dummy.html) picks up on this and makes the correct calculations:
```{r iris-helmert}
# change it:
go_helmert <- param
go_helmert["unordered"] <- "contr.helmert"
options(contrasts = go_helmert)
# now make dummy variables with new parameterization
helmert <-
iris_rec %>%
step_dummy(Species) %>%
prep(training = iris)
summary(helmert)
juice(helmert, original, starts_with("Species")) %>% distinct()
# Yuk; go back to the original method
options(contrasts = param)
```
Note that the column names do not reference a specific level of the species variable. This contrast function has columns that can involve multiple levels; level-specific columns wouldn't make sense.
If no columns are selected (perhaps due to an earlier `step_zv()`), the `bake()` and `juice()` functions will return the data as-is (e.g. with no dummy variables).
Finally, `step_dummy()` has an option called `preserve` that can be used to keep the original column that are being used to create the dummy variables.
## Interactions with Dummy Variables
Creating interactions with recipes requires the use of a model formula, such as
```{r iris-2int}
iris_int <-
iris_rec %>%
step_interact( ~ Sepal.Width:Sepal.Length) %>%
prep(training = iris)
summary(iris_int)
```
In [R model formulae](https://stat.ethz.ch/R-manual/R-devel/library/stats/html/formula.html), using a `*` between two variables would expand to `a*b = a + b + a:b` so that the main effects are included. In [`step_interact`](https://recipes.tidymodels.org/reference/step_interact.html), you can do use `*`, but only the interactions are recorded as columns that needs to be created.
One thing that `recipes` does differently than base R is to construct the design matrix in sequential iterations. This is relevant when thinking about interactions between continuous and categorical predictors.
For example, if you were to use the standard formula interface, the creation of the dummy variables happens at the same time as the interactions are created:
```{r mm-int}
model.matrix(~ Species*Sepal.Length, data = iris) %>%
as.data.frame() %>%
# show a few specific rows
slice(c(1, 51, 101)) %>%
as.data.frame()
```
With recipes, you create them sequentially. This raises an issue: do I have to type out all of the interaction effects by their specific names when using dummy variable?
```{r nope, eval = FALSE}
# Must I do this?
iris_rec %>%
step_interact( ~ Species_versicolor:Sepal.Length +
Species_virginica:Sepal.Length)
```
Note only is this a pain, but it may not be obvious what dummy variables are available (especially when [`step_other`](https://recipes.tidymodels.org/reference/step_other.html) is used).
The solution is to use a selector:
```{r iris-sel}
iris_int <-
iris_rec %>%
step_dummy(Species) %>%
step_interact( ~ starts_with("Species"):Sepal.Length) %>%
prep(training = iris)
summary(iris_int)
```
What happens here is that `starts_with("Species")` is executed on the data that are available when the previous steps have been applied to the data. That means that the dummy variable columns are present. The results of this selectors are then translated to an additive function of the results. In this case, that means that
```{r sel-input, eval = FALSE}
starts_with("Species")
```
becomes
```{r sel-output, eval = FALSE}
(Species_versicolor + Species_virginica)
```
The entire interaction formula is shown here:
```{r int-form}
iris_int
```
For interactions between multiple sets of dummy variables, the formula could include multiple selectors (e.g. `starts_with("x_"):starts_with("y_")`).
## Warning!
Would it work if I didn't convert species to a factor and used the interactions step?
```{r iris-dont}
iris_int <-
iris_rec %>%
step_interact( ~ Species:Sepal.Length) %>%
prep(training = iris)
summary(iris_int)
```
The columns `Species` isn't affected and a warning is issued. Basically, you only get half of what `model.matrix` does and that could really be problematic in subsequent steps.
## Getting All of the Indicator Variables
As mentioned above, if there are _C_ levels of the factor, there will be _C_ - 1 dummy variables created. You might want to get all of them back.
Historically, _C_ - 1 are used so that a linear dependency is avoided in the design matrix; all _C_ dummy variables would add up row-wise to the intercept column and the inverse matrix for linear regression can't be computed. This technical term for a the design matrix like this is "less than full rank".
There are models (e.g. `glmnet` and others) that can avoid this issue so you might want to get all of the columns. To do this, `step_dummy` has an option called `one_hot` that will make sure that all _C_ are produced:
```{r one-hot}
iris_rec %>%
step_dummy(Species, one_hot = TRUE) %>%
prep(training = iris) %>%
juice(original, starts_with("Species")) %>%
distinct()
```
The option is named that way since this is that the computer scientists call ["one-hot encoding"](https://www.google.com/search?q=one-hot+encoding).
***Warning!*** (again)
This will give you the full set of indicators and, when you use the typical contrast function, it does. It might do some seemingly weird (but legitimate) things when used with other contrasts:
```{r one-hot-two}
hot_reference <-
iris_rec %>%
step_dummy(Species, one_hot = TRUE) %>%
prep(training = iris) %>%
juice(original, starts_with("Species")) %>%
distinct()
hot_reference
# from above
options(contrasts = go_helmert)
hot_helmert <-
iris_rec %>%
step_dummy(Species, one_hot = TRUE) %>%
prep(training = iris) %>%
juice(original, starts_with("Species")) %>%
distinct()
hot_helmert
```
Since this contrast doesn't make sense using all _C_ columns, it reverts back to the default encoding.
## Novel Levels
When a recipe is used with new samples, some factors may have acquired new levels that were not present when `prep` was run. If `step_dummy` encounters this situation, a warning is issues ("There are new levels in a factor") and the indicator variables that correspond to the factor are assigned missing values.
One way around this is to use `step_other`. This step can convert infrequently occurring levels to a new category (that defaults to "other"). This step can also be used to convert new factor levels to "other" also.
Also, `step_integer` has functionality similar to [`LabelEncoder`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html) and encodes new values as zero.
The [`embed`](https://github.com/tidymodels/embed) package can also handle novel factors levels within a recipe. `step_embed` and `step_tfembed` assign a common numeric score to novel levels.
## Other Steps Related to Dummy Variables
There are a bunch of steps related to going in-between factors and dummy variables:
* [`step_unknown`](https://recipes.tidymodels.org/reference/step_unknown.html) assigns missing factor values into an `'unknown'` category.
* [`step_other`](https://recipes.tidymodels.org/reference/step_other.html) can collapse infrequently occurring levels into `'other'`.
* [`step_regex`](https://recipes.tidymodels.org/reference/step_regex.html) will create a single dummy variable based on applying a regular expression to a text field. Similarly, [`step_count`](https://recipes.tidymodels.org/reference/step_count.html) does the same but counts the occurrences of the pattern in the string.
* [`step_holiday`](https://recipes.tidymodels.org/reference/step_holiday.html) creates dummy variables from date fields to capture holidays.
* [`step_lincomb`](https://recipes.tidymodels.org/reference/step_lincomb.html) can be useful if you _over-specify_ interactions and need to remove linear dependencies.
* [`step_zv`](https://recipes.tidymodels.org/reference/step_zv.html) can remove dummy variables that never show a 1 in the column (i.e. is zero-variance).
* [`step_bin2factor`](https://recipes.tidymodels.org/reference/step_bin2factor.html) takes a binary indicator and makes a factor variable. This can be useful when using naive Bayes models.
* `step_embed`, `step_lencode_glm`, `step_lencode_bayes` and others in the [`embed`](https://github.com/tidymodels/embed) package can use one or more (non-binary) values to encode factor predictors into a numeric form.
[`step_dummy`](https://recipes.tidymodels.org/reference/step_dummy.html) also works with _ordered factors_. As seen above, the default encoding is to create a series of polynomial variables. There are also a few steps for ordered factors:
* [`step_ordinalscore`](https://recipes.tidymodels.org/reference/step_ordinalscore.html) can translate the levels to a single numeric score.
* [`step_unorder`](https://recipes.tidymodels.org/reference/step_unorder.html) can convert to an unordered factor.
|