1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165
|
% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/recipe.R
\name{recipe}
\alias{recipe}
\alias{recipe.default}
\alias{recipe.formula}
\alias{recipe.data.frame}
\alias{recipe.matrix}
\title{Create a Recipe for Preprocessing Data}
\usage{
recipe(x, ...)
\method{recipe}{default}(x, ...)
\method{recipe}{data.frame}(x, formula = NULL, ..., vars = NULL, roles = NULL)
\method{recipe}{formula}(formula, data, ...)
\method{recipe}{matrix}(x, ...)
}
\arguments{
\item{x, data}{A data frame or tibble of the \emph{template} data set
(see below).}
\item{...}{Further arguments passed to or from other methods (not currently
used).}
\item{formula}{A model formula. No in-line functions should be used here
(e.g. \code{log(x)}, \code{x:y}, etc.) and minus signs are not allowed. These types of
transformations should be enacted using \code{step} functions in this package.
Dots are allowed as are simple multivariate outcome terms (i.e. no need for
\code{cbind}; see Examples).}
\item{vars}{A character string of column names corresponding to variables
that will be used in any context (see below)}
\item{roles}{A character string (the same length of \code{vars}) that
describes a single role that the variable will take. This value could be
anything but common roles are \code{"outcome"}, \code{"predictor"},
\code{"case_weight"}, or \code{"ID"}}
}
\value{
An object of class \code{recipe} with sub-objects:
\item{var_info}{A tibble containing information about the original data
set columns}
\item{term_info}{A tibble that contains the current set of terms in the
data set. This initially defaults to the same data contained in
\code{var_info}.}
\item{steps}{A list of \code{step} or \code{check} objects that define the sequence of
preprocessing operations that will be applied to data. The default value is
\code{NULL}}
\item{template}{A tibble of the data. This is initialized to be the same
as the data given in the \code{data} argument but can be different after
the recipe is trained.}
}
\description{
A recipe is a description of what steps should be applied to a data set in
order to get it ready for data analysis.
}
\details{
Recipes are alternative methods for creating design matrices and
for preprocessing data.
Variables in recipes can have any type of \emph{role} in subsequent analyses
such as: outcome, predictor, case weights, stratification variables, etc.
\code{recipe} objects can be created in several ways. If the analysis only
contains outcomes and predictors, the simplest way to create one is to use
a simple formula (e.g. \code{y ~ x1 + x2}) that does not contain inline
functions such as \code{log(x3)}. An example is given below.
Alternatively, a \code{recipe} object can be created by first specifying
which variables in a data set should be used and then sequentially
defining their roles (see the last example).
There are two different types of operations that can be
sequentially added to a recipe. \strong{Steps} can include common
operations like logging a variable, creating dummy variables or
interactions and so on. More computationally complex actions
such as dimension reduction or imputation can also be specified.
\strong{Checks} are operations that conduct specific tests of the
data. When the test is satisfied, the data are returned without
issue or modification. Otherwise, any error is thrown.
Once a recipe has been defined, the \code{\link[=prep]{prep()}} function can be
used to estimate quantities required for the operations using a
data set (a.k.a. the training data). \code{\link[=prep]{prep()}} returns another
recipe.
To apply the recipe to a data set, the \code{\link[=bake]{bake()}} function is
used in the same manner as \code{predict} would be for models. This
applies the steps to any data set.
Note that the data passed to \code{recipe} need not be the complete data
that will be used to train the steps (by \code{\link[=prep]{prep()}}). The recipe
only needs to know the names and types of data that will be used. For
large data sets, \code{head} could be used to pass the recipe a smaller
data set to save time and memory.
}
\examples{
###############################################
# simple example:
library(modeldata)
data(biomass)
# split data
biomass_tr <- biomass[biomass$dataset == "Training",]
biomass_te <- biomass[biomass$dataset == "Testing",]
# When only predictors and outcomes, a simplified formula can be used.
rec <- recipe(HHV ~ carbon + hydrogen + oxygen + nitrogen + sulfur,
data = biomass_tr)
# Now add preprocessing steps to the recipe.
sp_signed <- rec \%>\%
step_normalize(all_predictors()) \%>\%
step_spatialsign(all_predictors())
sp_signed
# now estimate required parameters
sp_signed_trained <- prep(sp_signed, training = biomass_tr)
sp_signed_trained
# apply the preprocessing to a data set
test_set_values <- bake(sp_signed_trained, new_data = biomass_te)
# or use pipes for the entire workflow:
rec <- biomass_tr \%>\%
recipe(HHV ~ carbon + hydrogen + oxygen + nitrogen + sulfur) \%>\%
step_normalize(all_predictors()) \%>\%
step_spatialsign(all_predictors())
###############################################
# multivariate example
# no need for `cbind(carbon, hydrogen)` for left-hand side
multi_y <- recipe(carbon + hydrogen ~ oxygen + nitrogen + sulfur,
data = biomass_tr)
multi_y <- multi_y \%>\%
step_center(all_outcomes()) \%>\%
step_scale(all_predictors())
multi_y_trained <- prep(multi_y, training = biomass_tr)
results <- bake(multi_y_trained, biomass_te)
###############################################
# Creating a recipe manually with different roles
rec <- recipe(biomass_tr) \%>\%
update_role(carbon, hydrogen, oxygen, nitrogen, sulfur,
new_role = "predictor") \%>\%
update_role(HHV, new_role = "outcome") \%>\%
update_role(sample, new_role = "id variable") \%>\%
update_role(dataset, new_role = "splitting indicator")
rec
}
\author{
Max Kuhn
}
\concept{model_specification}
\concept{preprocessing}
\keyword{datagen}
|