File: step_nzv.Rd

package info (click to toggle)
r-cran-recipes 0.1.15%2Bdfsg-1
  • links: PTS, VCS
  • area: main
  • in suites: bullseye
  • size: 2,496 kB
  • sloc: sh: 37; makefile: 2
file content (125 lines) | stat: -rw-r--r-- 3,974 bytes parent folder | download
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/nzv.R
\name{step_nzv}
\alias{step_nzv}
\alias{tidy.step_nzv}
\title{Near-Zero Variance Filter}
\usage{
step_nzv(
  recipe,
  ...,
  role = NA,
  trained = FALSE,
  freq_cut = 95/5,
  unique_cut = 10,
  options = list(freq_cut = 95/5, unique_cut = 10),
  removals = NULL,
  skip = FALSE,
  id = rand_id("nzv")
)

\method{tidy}{step_nzv}(x, ...)
}
\arguments{
\item{recipe}{A recipe object. The step will be added to the
sequence of operations for this recipe.}

\item{...}{One or more selector functions to choose which
variables that will be evaluated by the filtering. See
\code{\link[=selections]{selections()}} for more details. For the \code{tidy}
method, these are not currently used.}

\item{role}{Not used by this step since no new variables are
created.}

\item{trained}{A logical to indicate if the quantities for
preprocessing have been estimated.}

\item{freq_cut, unique_cut}{Numeric parameters for the filtering process. See
the Details section below.}

\item{options}{A list of options for the filter (see Details
below).}

\item{removals}{A character string that contains the names of
columns that should be removed. These values are not determined
until \code{\link[=prep.recipe]{prep.recipe()}} is called.}

\item{skip}{A logical. Should the step be skipped when the
recipe is baked by \code{\link[=bake.recipe]{bake.recipe()}}? While all operations are baked
when \code{\link[=prep.recipe]{prep.recipe()}} is run, some operations may not be able to be
conducted on new data (e.g. processing the outcome variable(s)).
Care should be taken when using \code{skip = TRUE} as it may affect
the computations for subsequent operations}

\item{id}{A character string that is unique to this step to identify it.}

\item{x}{A \code{step_nzv} object.}
}
\value{
An updated version of \code{recipe} with the new step
added to the sequence of existing steps (if any). For the
\code{tidy} method, a tibble with columns \code{terms} which
is the columns that will be removed.
}
\description{
\code{step_nzv} creates a \emph{specification} of a recipe step
that will potentially remove variables that are highly sparse
and unbalanced.
}
\details{
This step diagnoses predictors that have one unique
value (i.e. are zero variance predictors) or predictors that have
both of the following characteristics:
\enumerate{
\item they have very few unique values relative to the number
of samples and
\item the ratio of the frequency of the most common value to
the frequency of the second most common value is large.
}

For example, an example of near-zero variance predictor is one
that, for 1000 samples, has two distinct values and 999 of them
are a single value.

To be flagged, first, the frequency of the most prevalent value
over the second most frequent value (called the "frequency
ratio") must be above \code{freq_cut}. Secondly, the "percent of
unique values," the number of unique values divided by the total
number of samples (times 100), must also be below
\code{unique_cut}.

In the above example, the frequency ratio is 999 and the unique
value percent is 0.2\%.
}
\examples{
library(modeldata)
data(biomass)

biomass$sparse <- c(1, rep(0, nrow(biomass) - 1))

biomass_tr <- biomass[biomass$dataset == "Training",]
biomass_te <- biomass[biomass$dataset == "Testing",]

rec <- recipe(HHV ~ carbon + hydrogen + oxygen +
                    nitrogen + sulfur + sparse,
              data = biomass_tr)

nzv_filter <- rec \%>\%
  step_nzv(all_predictors())

filter_obj <- prep(nzv_filter, training = biomass_tr)

filtered_te <- bake(filter_obj, biomass_te)
any(names(filtered_te) == "sparse")

tidy(nzv_filter, number = 1)
tidy(filter_obj, number = 1)
}
\seealso{
\code{\link[=step_corr]{step_corr()}} \code{\link[=recipe]{recipe()}}
\code{\link[=prep.recipe]{prep.recipe()}} \code{\link[=bake.recipe]{bake.recipe()}}
}
\concept{preprocessing}
\concept{variable_filters}
\keyword{datagen}