File: step_impute_linear.Rd

package info (click to toggle)
r-cran-recipes 1.0.4%2Bdfsg-1
links: PTS, VCS
area: main
in suites: bookworm
size: 3,636 kB
sloc: sh: 37; makefile: 2
file content (132 lines) | stat: -rw-r--r-- 4,773 bytes
% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/impute_linear.R
\name{step_impute_linear}
\alias{step_impute_linear}
\title{Impute numeric variables via a linear model}
\usage{
step_impute_linear(
  recipe,
  ...,
  role = NA,
  trained = FALSE,
  impute_with = imp_vars(all_predictors()),
  models = NULL,
  skip = FALSE,
  id = rand_id("impute_linear")
)
}
\arguments{
\item{recipe}{A recipe object. The step will be added to the
sequence of operations for this recipe.}

\item{...}{One or more selector functions to choose variables to be imputed;
these variables \strong{must} be of type \code{numeric}. When used with \code{imp_vars},
these dots indicate which variables are used to predict the missing data
in each variable. See \code{\link[=selections]{selections()}} for more details.}

\item{role}{Not used by this step since no new variables are
created.}

\item{trained}{A logical to indicate if the quantities for
preprocessing have been estimated.}

\item{impute_with}{A call to \code{imp_vars} to specify which variables are used
to impute the variables that can include specific variable names separated
by commas or different selectors (see \code{\link[=selections]{selections()}}). If a column is
included in both lists to be imputed and to be an imputation predictor, it
will be removed from the latter and not used to impute itself.}

\item{models}{The \code{\link[=lm]{lm()}} objects are stored here once the linear models
have been trained by \code{\link[=prep]{prep()}}.}

\item{skip}{A logical. Should the step be skipped when the
recipe is baked by \code{\link[=bake]{bake()}}? While all operations are baked
when \code{\link[=prep]{prep()}} is run, some operations may not be able to be
conducted on new data (e.g. processing the outcome variable(s)).
Care should be taken when using \code{skip = TRUE} as it may affect
the computations for subsequent operations.}

\item{id}{A character string that is unique to this step to identify it.}
}
\value{
An updated version of \code{recipe} with the new step added to the
sequence of any existing operations.
}
\description{
\code{step_impute_linear} creates a \emph{specification} of a recipe step that will
create linear regression models to impute missing data.
}
\details{
For each variable requiring imputation, a linear model is fit
where the outcome is the variable of interest and the predictors are any
other variables listed in the \code{impute_with} formula. Note that if a variable
that is to be imputed is also in \code{impute_with}, this variable will be ignored.

The variable(s) to be imputed must be of type \code{numeric}. The imputed values
will keep the same type as their original data (i.e, model predictions are
coerced to integer as needed).

Since this is a linear regression, the imputation model only uses complete
cases for the training set predictors.
}
\section{Tidying}{
When you \code{\link[=tidy.recipe]{tidy()}} this step, a tibble with
columns \code{terms} (the selectors or variables selected) and \code{model} (the
bagged tree object) is returned.
}

\section{Case weights}{


This step performs an unsupervised operation that can utilize case weights.
As a result, case weights are only used with frequency weights. For more
information, see the documentation in \link{case_weights} and the examples on
\code{tidymodels.org}.
}

\examples{
\dontshow{if (rlang::is_installed(c("modeldata", "ggplot2"))) (if (getRversion() >= "3.4") withAutoprint else force)(\{ # examplesIf}
data(ames, package = "modeldata")
set.seed(393)
ames_missing <- ames
ames_missing$Longitude[sample(1:nrow(ames), 200)] <- NA

imputed_ames <-
  recipe(Sale_Price ~ ., data = ames_missing) \%>\%
  step_impute_linear(
    Longitude,
    impute_with = imp_vars(Latitude, Neighborhood, MS_Zoning, Alley)
  ) \%>\%
  prep(ames_missing)

imputed <-
  bake(imputed_ames, new_data = ames_missing) \%>\%
  dplyr::rename(imputed = Longitude) \%>\%
  bind_cols(ames \%>\% dplyr::select(original = Longitude)) \%>\%
  bind_cols(ames_missing \%>\% dplyr::select(Longitude)) \%>\%
  dplyr::filter(is.na(Longitude))

library(ggplot2)
ggplot(imputed, aes(x = original, y = imputed)) +
  geom_abline(col = "green") +
  geom_point(alpha = .3) +
  coord_equal() +
  labs(title = "Imputed Values")
\dontshow{\}) # examplesIf}
}
\references{
Kuhn, M. and Johnson, K. (2013).
\emph{Feature Engineering and Selection}
\url{https://bookdown.org/max/FES/handling-missing-data.html}
}
\seealso{
Other imputation steps: 
\code{\link{step_impute_bag}()},
\code{\link{step_impute_knn}()},
\code{\link{step_impute_lower}()},
\code{\link{step_impute_mean}()},
\code{\link{step_impute_median}()},
\code{\link{step_impute_mode}()},
\code{\link{step_impute_roll}()}
}
\concept{imputation steps}