1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140
|
% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/nzv.R
\name{step_nzv}
\alias{step_nzv}
\title{Near-Zero Variance Filter}
\usage{
step_nzv(
recipe,
...,
role = NA,
trained = FALSE,
freq_cut = 95/5,
unique_cut = 10,
options = list(freq_cut = 95/5, unique_cut = 10),
removals = NULL,
skip = FALSE,
id = rand_id("nzv")
)
}
\arguments{
\item{recipe}{A recipe object. The step will be added to the
sequence of operations for this recipe.}
\item{...}{One or more selector functions to choose variables
for this step. See \code{\link[=selections]{selections()}} for more details.}
\item{role}{Not used by this step since no new variables are
created.}
\item{trained}{A logical to indicate if the quantities for
preprocessing have been estimated.}
\item{freq_cut, unique_cut}{Numeric parameters for the filtering process. See
the Details section below.}
\item{options}{A list of options for the filter (see Details
below).}
\item{removals}{A character string that contains the names of
columns that should be removed. These values are not determined
until \code{\link[=prep]{prep()}} is called.}
\item{skip}{A logical. Should the step be skipped when the
recipe is baked by \code{\link[=bake]{bake()}}? While all operations are baked
when \code{\link[=prep]{prep()}} is run, some operations may not be able to be
conducted on new data (e.g. processing the outcome variable(s)).
Care should be taken when using \code{skip = TRUE} as it may affect
the computations for subsequent operations.}
\item{id}{A character string that is unique to this step to identify it.}
}
\value{
An updated version of \code{recipe} with the new step added to the
sequence of any existing operations.
}
\description{
\code{step_nzv} creates a \emph{specification} of a recipe step
that will potentially remove variables that are highly sparse
and unbalanced.
}
\details{
This step can potentially remove columns from the data set. This may
cause issues for subsequent steps in your recipe if the missing columns are
specifically referenced by name. To avoid this, see the advice in the
\emph{Tips for saving recipes and filtering columns} section of \link{selections}.
This step diagnoses predictors that have one unique
value (i.e. are zero variance predictors) or predictors that have
both of the following characteristics:
\enumerate{
\item they have very few unique values relative to the number
of samples and
\item the ratio of the frequency of the most common value to
the frequency of the second most common value is large.
}
For example, an example of near-zero variance predictor is one
that, for 1000 samples, has two distinct values and 999 of them
are a single value.
To be flagged, first, the frequency of the most prevalent value
over the second most frequent value (called the "frequency
ratio") must be above \code{freq_cut}. Secondly, the "percent of
unique values," the number of unique values divided by the total
number of samples (times 100), must also be below
\code{unique_cut}.
In the above example, the frequency ratio is 999 and the unique
value percent is 0.2\%.
}
\section{Tidying}{
When you \code{\link[=tidy.recipe]{tidy()}} this step, a tibble with column
\code{terms} (the columns that will be removed) is returned.
}
\section{Case weights}{
This step performs an unsupervised operation that can utilize case weights.
As a result, case weights are only used with frequency weights. For more
information, see the documentation in \link{case_weights} and the examples on
\code{tidymodels.org}.
}
\examples{
\dontshow{if (rlang::is_installed("modeldata")) (if (getRversion() >= "3.4") withAutoprint else force)(\{ # examplesIf}
data(biomass, package = "modeldata")
biomass$sparse <- c(1, rep(0, nrow(biomass) - 1))
biomass_tr <- biomass[biomass$dataset == "Training", ]
biomass_te <- biomass[biomass$dataset == "Testing", ]
rec <- recipe(HHV ~ carbon + hydrogen + oxygen +
nitrogen + sulfur + sparse,
data = biomass_tr
)
nzv_filter <- rec \%>\%
step_nzv(all_predictors())
filter_obj <- prep(nzv_filter, training = biomass_tr)
filtered_te <- bake(filter_obj, biomass_te)
any(names(filtered_te) == "sparse")
tidy(nzv_filter, number = 1)
tidy(filter_obj, number = 1)
\dontshow{\}) # examplesIf}
}
\seealso{
Other variable filter steps:
\code{\link{step_corr}()},
\code{\link{step_filter_missing}()},
\code{\link{step_lincomb}()},
\code{\link{step_rm}()},
\code{\link{step_select}()},
\code{\link{step_zv}()}
}
\concept{variable filter steps}
|