File: recipe.Rd

package info (click to toggle)
r-cran-recipes 0.1.15%2Bdfsg-1
links: PTS, VCS
area: main
in suites: bullseye
size: 2,496 kB
sloc: sh: 37; makefile: 2
file content (165 lines) | stat: -rw-r--r-- 6,038 bytes
% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/recipe.R
\name{recipe}
\alias{recipe}
\alias{recipe.default}
\alias{recipe.formula}
\alias{recipe.data.frame}
\alias{recipe.matrix}
\title{Create a Recipe for Preprocessing Data}
\usage{
recipe(x, ...)

\method{recipe}{default}(x, ...)

\method{recipe}{data.frame}(x, formula = NULL, ..., vars = NULL, roles = NULL)

\method{recipe}{formula}(formula, data, ...)

\method{recipe}{matrix}(x, ...)
}
\arguments{
\item{x, data}{A data frame or tibble of the \emph{template} data set
(see below).}

\item{...}{Further arguments passed to or from other methods (not currently
used).}

\item{formula}{A model formula. No in-line functions should be used here
(e.g. \code{log(x)}, \code{x:y}, etc.) and minus signs are not allowed. These types of
transformations should be enacted using \code{step} functions in this package.
Dots are allowed as are simple multivariate outcome terms (i.e. no need for
\code{cbind}; see Examples).}

\item{vars}{A character string of column names corresponding to variables
that will be used in any context (see below)}

\item{roles}{A character string (the same length of \code{vars}) that
describes a single role that the variable will take. This value could be
anything but common roles are \code{"outcome"}, \code{"predictor"},
\code{"case_weight"}, or \code{"ID"}}
}
\value{
An object of class \code{recipe} with sub-objects:
\item{var_info}{A tibble containing information about the original data
set columns}
\item{term_info}{A tibble that contains the current set of terms in the
data set. This initially defaults to the same data contained in
\code{var_info}.}
\item{steps}{A list of \code{step}  or \code{check} objects that define the sequence of
preprocessing operations that will be applied to data. The default value is
\code{NULL}}
\item{template}{A tibble of the data. This is initialized to be the same
as the data given in the \code{data} argument but can be different after
the recipe is trained.}
}
\description{
A recipe is a description of what steps should be applied to a data set in
order to get it ready for data analysis.
}
\details{
Recipes are alternative methods for creating design matrices and
for preprocessing data.

Variables in recipes can have any type of \emph{role} in subsequent analyses
such as: outcome, predictor, case weights, stratification variables, etc.

\code{recipe} objects can be created in several ways. If the analysis only
contains outcomes and predictors, the simplest way to create one is to use
a simple formula (e.g. \code{y ~ x1 + x2}) that does not contain inline
functions such as \code{log(x3)}. An example is given below.

Alternatively, a \code{recipe} object can be created by first specifying
which variables in a data set should be used and then sequentially
defining their roles (see the last example).

There are two different types of operations that can be
sequentially added to a recipe. \strong{Steps}  can include common
operations like logging a variable, creating dummy variables or
interactions and so on. More computationally complex actions
such as dimension reduction or imputation can also be specified.
\strong{Checks} are operations that conduct specific tests of the
data. When the test is satisfied, the data are returned without
issue or modification. Otherwise, any error is thrown.

Once a recipe has been defined, the \code{\link[=prep]{prep()}} function can be
used to estimate quantities required for the operations using a
data set (a.k.a. the training data). \code{\link[=prep]{prep()}} returns another
recipe.

To apply the recipe to a data set, the \code{\link[=bake]{bake()}} function is
used in the same manner as \code{predict} would be for models. This
applies the steps to any data set.

Note that the data passed to \code{recipe} need not be the complete data
that will be used to train the steps (by \code{\link[=prep]{prep()}}). The recipe
only needs to know the names and types of data that will be used. For
large data sets, \code{head} could be used to pass the recipe a smaller
data set to save time and memory.
}
\examples{

###############################################
# simple example:
library(modeldata)
data(biomass)

# split data
biomass_tr <- biomass[biomass$dataset == "Training",]
biomass_te <- biomass[biomass$dataset == "Testing",]

# When only predictors and outcomes, a simplified formula can be used.
rec <- recipe(HHV ~ carbon + hydrogen + oxygen + nitrogen + sulfur,
              data = biomass_tr)

# Now add preprocessing steps to the recipe.

sp_signed <- rec \%>\%
  step_normalize(all_predictors()) \%>\%
  step_spatialsign(all_predictors())
sp_signed

# now estimate required parameters
sp_signed_trained <- prep(sp_signed, training = biomass_tr)
sp_signed_trained

# apply the preprocessing to a data set
test_set_values <- bake(sp_signed_trained, new_data = biomass_te)

# or use pipes for the entire workflow:
rec <- biomass_tr \%>\%
  recipe(HHV ~ carbon + hydrogen + oxygen + nitrogen + sulfur) \%>\%
  step_normalize(all_predictors()) \%>\%
  step_spatialsign(all_predictors())

###############################################
# multivariate example

# no need for `cbind(carbon, hydrogen)` for left-hand side
multi_y <- recipe(carbon + hydrogen ~ oxygen + nitrogen + sulfur,
                  data = biomass_tr)
multi_y <- multi_y \%>\%
  step_center(all_outcomes()) \%>\%
  step_scale(all_predictors())

multi_y_trained <- prep(multi_y, training = biomass_tr)

results <- bake(multi_y_trained, biomass_te)

###############################################
# Creating a recipe manually with different roles

rec <- recipe(biomass_tr) \%>\%
  update_role(carbon, hydrogen, oxygen, nitrogen, sulfur,
           new_role = "predictor") \%>\%
  update_role(HHV, new_role = "outcome") \%>\%
  update_role(sample, new_role = "id variable") \%>\%
  update_role(dataset, new_role = "splitting indicator")
rec
}
\author{
Max Kuhn
}
\concept{model_specification}
\concept{preprocessing}
\keyword{datagen}