1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173
|
% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/selections.R
\name{selections}
\alias{selections}
\alias{selection}
\title{Methods for selecting variables in step functions}
\description{
Tips for selecting columns in step functions.
}
\details{
When selecting variables or model terms in \code{step}
functions, \code{dplyr}-like tools are used. The \emph{selector} functions
can choose variables based on their name, current role, data
type, or any combination of these. The selectors are passed as
any other argument to the step. If the variables are explicitly
named in the step function, this might look like:
\preformatted{
recipe( ~ ., data = USArrests) \%>\%
step_pca(Murder, Assault, UrbanPop, Rape, num_comp = 3)
}
The first four arguments indicate which variables should be
used in the PCA while the last argument is a specific argument
to \code{\link[=step_pca]{step_pca()}} about the number of components.
Note that:
\enumerate{
\item These arguments are not evaluated until the \code{prep}
function for the step is executed.
\item The \code{dplyr}-like syntax allows for negative signs to
exclude variables (e.g. \code{-Murder}) and the set of selectors will
processed in order.
\item A leading exclusion in these arguments (e.g. \code{-Murder})
has the effect of adding \emph{all} variables to the list except the
excluded variable(s), ignoring role information.
}
Select helpers from the \code{tidyselect} package can also be used:
\code{\link[tidyselect:starts_with]{tidyselect::starts_with()}}, \code{\link[tidyselect:starts_with]{tidyselect::ends_with()}},
\code{\link[tidyselect:starts_with]{tidyselect::contains()}}, \code{\link[tidyselect:starts_with]{tidyselect::matches()}},
\code{\link[tidyselect:starts_with]{tidyselect::num_range()}}, \code{\link[tidyselect:everything]{tidyselect::everything()}},
\code{\link[tidyselect:one_of]{tidyselect::one_of()}}, \code{\link[tidyselect:all_of]{tidyselect::all_of()}}, and
\code{\link[tidyselect:all_of]{tidyselect::any_of()}}
For example:
\preformatted{
recipe(Species ~ ., data = iris) \%>\%
step_center(starts_with("Sepal"), -contains("Width"))
}
would only select \code{Sepal.Length}
Columns of the design matrix that may not exist when the step
is coded can also be selected. For example, when using
\code{step_pca()}, the number of columns created by feature extraction
may not be known when subsequent steps are defined. In this
case, using \code{matches("^PC")} will select all of the columns
whose names start with "PC" \emph{once those columns are created}.
There are sets of recipes-specific functions that can be used to select
variables based on their role or type: \code{\link[=has_role]{has_role()}} and
\code{\link[=has_type]{has_type()}}. For convenience, there are also functions that are
more specific. The functions \code{\link[=all_numeric]{all_numeric()}} and \code{\link[=all_nominal]{all_nominal()}} select
based on type, with nominal variables including both character and factor;
the functions \code{\link[=all_predictors]{all_predictors()}} and \code{\link[=all_outcomes]{all_outcomes()}} select based on role.
The functions \code{\link[=all_numeric_predictors]{all_numeric_predictors()}} and \code{\link[=all_nominal_predictors]{all_nominal_predictors()}}
select intersections of role and type. Any can be used in conjunction with
the previous functions described for selecting variables using their names.
A selection like this:
\preformatted{
data(biomass)
recipe(HHV ~ ., data = biomass) \%>\%
step_center(all_numeric(), -all_outcomes())
}
is equivalent to:
\preformatted{
data(biomass)
recipe(HHV ~ ., data = biomass) \%>\%
step_center(all_numeric_predictors())
}
Both result in all the numeric predictors: carbon, hydrogen,
oxygen, nitrogen, and sulfur.
If a role for a variable has not been defined, it will never be
selected using role-specific selectors.
\subsection{Interactions}{
Selectors can be used in \code{\link[=step_interact]{step_interact()}} in similar ways but
must be embedded in a model formula (as opposed to a sequence
of selectors). For example, the interaction specification
could be \code{~ starts_with("Species"):Sepal.Width}. This can be
useful if \code{Species} was converted to dummy variables
previously using \code{\link[=step_dummy]{step_dummy()}}. The implementation of
\code{step_interact()} is special, and is more restricted than
the other step functions. Only the selector functions from
recipes and tidyselect are allowed. User defined selector functions
will not be recognized. Additionally, the tidyselect domain specific
language is not recognized here, meaning that \code{&}, \code{|}, \code{!}, and \code{-}
will not work.
}
\subsection{Tips for saving recipes and filtering columns}{
When creating variable selections:
\itemize{
\item If you are using column filtering steps, such as \code{step_corr()}, try to
avoid hardcoding specific variable names in downstream steps in case
those columns are removed by the filter. Instead, use
\code{\link[dplyr:reexports]{dplyr::any_of()}} and
\code{\link[dplyr:reexports]{dplyr::all_of()}}.
\itemize{
\item \code{\link[dplyr:reexports]{dplyr::any_of()}} will be tolerant if a column
has been removed.
\item \code{\link[dplyr:reexports]{dplyr::all_of()}} will fail unless all of the
columns are present in the data.
}
\item For both of these functions, if you are going to save the recipe as a
binary object to use in another R session, try to avoid referring to a
vector in your workspace.
\itemize{
\item Preferred: \code{any_of(!!var_names)}
\item Avoid: \code{any_of(var_names)}
}
}
Some examples:
\if{html}{\out{<div class="sourceCode r">}}\preformatted{some_vars <- names(mtcars)[4:6]
# No filter steps, OK for not saving the recipe
rec_1 <-
recipe(mpg ~ ., data = mtcars) \%>\%
step_log(all_of(some_vars)) \%>\%
prep()
# No filter steps, saving the recipe
rec_2 <-
recipe(mpg ~ ., data = mtcars) \%>\%
step_log(!!!some_vars) \%>\%
prep()
# This fails since `wt` is not in the data
recipe(mpg ~ ., data = mtcars) \%>\%
step_rm(wt) \%>\%
step_log(!!!some_vars) \%>\%
prep()
}\if{html}{\out{</div>}}
\if{html}{\out{<div class="sourceCode">}}\preformatted{## Error in `step_log()`:
## Caused by error in `prep()` at recipes/R/recipe.R:437:8:
## ! Can't subset columns that don't exist.
## x Column `wt` doesn't exist.
}\if{html}{\out{</div>}}
\if{html}{\out{<div class="sourceCode r">}}\preformatted{# Best for filters (using any_of()) and when
# saving the recipe
rec_4 <-
recipe(mpg ~ ., data = mtcars) \%>\%
step_rm(wt) \%>\%
step_log(any_of(!!some_vars)) \%>\%
# equal to step_log(any_of(c("hp", "drat", "wt")))
prep()
}\if{html}{\out{</div>}}
}
}
|