File: recipe.Rd

package info (click to toggle)
r-cran-recipes 0.1.15%2Bdfsg-1
  • links: PTS, VCS
  • area: main
  • in suites: bullseye
  • size: 2,496 kB
  • sloc: sh: 37; makefile: 2
file content (165 lines) | stat: -rw-r--r-- 6,038 bytes parent folder | download
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/recipe.R
\name{recipe}
\alias{recipe}
\alias{recipe.default}
\alias{recipe.formula}
\alias{recipe.data.frame}
\alias{recipe.matrix}
\title{Create a Recipe for Preprocessing Data}
\usage{
recipe(x, ...)

\method{recipe}{default}(x, ...)

\method{recipe}{data.frame}(x, formula = NULL, ..., vars = NULL, roles = NULL)

\method{recipe}{formula}(formula, data, ...)

\method{recipe}{matrix}(x, ...)
}
\arguments{
\item{x, data}{A data frame or tibble of the \emph{template} data set
(see below).}

\item{...}{Further arguments passed to or from other methods (not currently
used).}

\item{formula}{A model formula. No in-line functions should be used here
(e.g. \code{log(x)}, \code{x:y}, etc.) and minus signs are not allowed. These types of
transformations should be enacted using \code{step} functions in this package.
Dots are allowed as are simple multivariate outcome terms (i.e. no need for
\code{cbind}; see Examples).}

\item{vars}{A character string of column names corresponding to variables
that will be used in any context (see below)}

\item{roles}{A character string (the same length of \code{vars}) that
describes a single role that the variable will take. This value could be
anything but common roles are \code{"outcome"}, \code{"predictor"},
\code{"case_weight"}, or \code{"ID"}}
}
\value{
An object of class \code{recipe} with sub-objects:
\item{var_info}{A tibble containing information about the original data
set columns}
\item{term_info}{A tibble that contains the current set of terms in the
data set. This initially defaults to the same data contained in
\code{var_info}.}
\item{steps}{A list of \code{step}  or \code{check} objects that define the sequence of
preprocessing operations that will be applied to data. The default value is
\code{NULL}}
\item{template}{A tibble of the data. This is initialized to be the same
as the data given in the \code{data} argument but can be different after
the recipe is trained.}
}
\description{
A recipe is a description of what steps should be applied to a data set in
order to get it ready for data analysis.
}
\details{
Recipes are alternative methods for creating design matrices and
for preprocessing data.

Variables in recipes can have any type of \emph{role} in subsequent analyses
such as: outcome, predictor, case weights, stratification variables, etc.

\code{recipe} objects can be created in several ways. If the analysis only
contains outcomes and predictors, the simplest way to create one is to use
a simple formula (e.g. \code{y ~ x1 + x2}) that does not contain inline
functions such as \code{log(x3)}. An example is given below.

Alternatively, a \code{recipe} object can be created by first specifying
which variables in a data set should be used and then sequentially
defining their roles (see the last example).

There are two different types of operations that can be
sequentially added to a recipe. \strong{Steps}  can include common
operations like logging a variable, creating dummy variables or
interactions and so on. More computationally complex actions
such as dimension reduction or imputation can also be specified.
\strong{Checks} are operations that conduct specific tests of the
data. When the test is satisfied, the data are returned without
issue or modification. Otherwise, any error is thrown.

Once a recipe has been defined, the \code{\link[=prep]{prep()}} function can be
used to estimate quantities required for the operations using a
data set (a.k.a. the training data). \code{\link[=prep]{prep()}} returns another
recipe.

To apply the recipe to a data set, the \code{\link[=bake]{bake()}} function is
used in the same manner as \code{predict} would be for models. This
applies the steps to any data set.

Note that the data passed to \code{recipe} need not be the complete data
that will be used to train the steps (by \code{\link[=prep]{prep()}}). The recipe
only needs to know the names and types of data that will be used. For
large data sets, \code{head} could be used to pass the recipe a smaller
data set to save time and memory.
}
\examples{

###############################################
# simple example:
library(modeldata)
data(biomass)

# split data
biomass_tr <- biomass[biomass$dataset == "Training",]
biomass_te <- biomass[biomass$dataset == "Testing",]

# When only predictors and outcomes, a simplified formula can be used.
rec <- recipe(HHV ~ carbon + hydrogen + oxygen + nitrogen + sulfur,
              data = biomass_tr)

# Now add preprocessing steps to the recipe.

sp_signed <- rec \%>\%
  step_normalize(all_predictors()) \%>\%
  step_spatialsign(all_predictors())
sp_signed

# now estimate required parameters
sp_signed_trained <- prep(sp_signed, training = biomass_tr)
sp_signed_trained

# apply the preprocessing to a data set
test_set_values <- bake(sp_signed_trained, new_data = biomass_te)

# or use pipes for the entire workflow:
rec <- biomass_tr \%>\%
  recipe(HHV ~ carbon + hydrogen + oxygen + nitrogen + sulfur) \%>\%
  step_normalize(all_predictors()) \%>\%
  step_spatialsign(all_predictors())

###############################################
# multivariate example

# no need for `cbind(carbon, hydrogen)` for left-hand side
multi_y <- recipe(carbon + hydrogen ~ oxygen + nitrogen + sulfur,
                  data = biomass_tr)
multi_y <- multi_y \%>\%
  step_center(all_outcomes()) \%>\%
  step_scale(all_predictors())

multi_y_trained <- prep(multi_y, training = biomass_tr)

results <- bake(multi_y_trained, biomass_te)

###############################################
# Creating a recipe manually with different roles

rec <- recipe(biomass_tr) \%>\%
  update_role(carbon, hydrogen, oxygen, nitrogen, sulfur,
           new_role = "predictor") \%>\%
  update_role(HHV, new_role = "outcome") \%>\%
  update_role(sample, new_role = "id variable") \%>\%
  update_role(dataset, new_role = "splitting indicator")
rec
}
\author{
Max Kuhn
}
\concept{model_specification}
\concept{preprocessing}
\keyword{datagen}