File: varsel.Rd

package info (click to toggle)
r-cran-projpred 2.3.0%2Bdfsg-1
  • links: PTS, VCS
  • area: main
  • in suites: bookworm
  • size: 1,180 kB
  • sloc: cpp: 296; sh: 14; makefile: 5
file content (225 lines) | stat: -rw-r--r-- 11,020 bytes parent folder | download
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/varsel.R
\name{varsel}
\alias{varsel}
\alias{varsel.default}
\alias{varsel.refmodel}
\title{Variable selection without cross-validation}
\usage{
varsel(object, ...)

\method{varsel}{default}(object, ...)

\method{varsel}{refmodel}(
  object,
  d_test = NULL,
  method = NULL,
  ndraws = NULL,
  nclusters = 20,
  ndraws_pred = 400,
  nclusters_pred = NULL,
  refit_prj = !inherits(object, "datafit"),
  nterms_max = NULL,
  verbose = TRUE,
  lambda_min_ratio = 1e-05,
  nlambda = 150,
  thresh = 1e-06,
  regul = 1e-04,
  penalty = NULL,
  search_terms = NULL,
  seed = sample.int(.Machine$integer.max, 1),
  ...
)
}
\arguments{
\item{object}{An object of class \code{refmodel} (returned by \code{\link[=get_refmodel]{get_refmodel()}} or
\code{\link[=init_refmodel]{init_refmodel()}}) or an object that can be passed to argument \code{object} of
\code{\link[=get_refmodel]{get_refmodel()}}.}

\item{...}{Arguments passed to \code{\link[=get_refmodel]{get_refmodel()}} as well as to the divergence
minimizer (during a forward search and also during the evaluation part, but
the latter only if \code{refit_prj} is \code{TRUE}).}

\item{d_test}{A \code{list} of the structure outlined in section "Argument
\code{d_test}" below, providing test data for evaluating the predictive
performance of the submodels as well as of the reference model. If \code{NULL},
the training data is used.}

\item{method}{The method for the search part. Possible options are \code{"L1"} for
L1 search and \code{"forward"} for forward search. If \code{NULL}, then internally,
\code{"L1"} is used, except if the reference model has multilevel or additive
terms or if \code{!is.null(search_terms)}. See also section "Details" below.}

\item{ndraws}{Number of posterior draws used in the search part. Ignored if
\code{nclusters} is not \code{NULL} or in case of L1 search (because L1 search always
uses a single cluster). If both (\code{nclusters} and \code{ndraws}) are \code{NULL}, the
number of posterior draws from the reference model is used for \code{ndraws}.
See also section "Details" below.}

\item{nclusters}{Number of clusters of posterior draws used in the search
part. Ignored in case of L1 search (because L1 search always uses a single
cluster). For the meaning of \code{NULL}, see argument \code{ndraws}. See also
section "Details" below.}

\item{ndraws_pred}{Only relevant if \code{refit_prj} is \code{TRUE}. Number of
posterior draws used in the evaluation part. Ignored if \code{nclusters_pred} is
not \code{NULL}. If both (\code{nclusters_pred} and \code{ndraws_pred}) are \code{NULL}, the
number of posterior draws from the reference model is used for
\code{ndraws_pred}. See also section "Details" below.}

\item{nclusters_pred}{Only relevant if \code{refit_prj} is \code{TRUE}. Number of
clusters of posterior draws used in the evaluation part. For the meaning of
\code{NULL}, see argument \code{ndraws_pred}. See also section "Details" below.}

\item{refit_prj}{A single logical value indicating whether to fit the
submodels along the solution path again (\code{TRUE}) or to retrieve their fits
from the search part (\code{FALSE}) before using those (re-)fits in the
evaluation part.}

\item{nterms_max}{Maximum number of predictor terms until which the search is
continued. If \code{NULL}, then \code{min(19, D)} is used where \code{D} is the number of
terms in the reference model (or in \code{search_terms}, if supplied). Note that
\code{nterms_max} does not count the intercept, so use \code{nterms_max = 0} for the
intercept-only model. (Correspondingly, \code{D} above does not count the
intercept.)}

\item{verbose}{A single logical value indicating whether to print out
additional information during the computations.}

\item{lambda_min_ratio}{Only relevant for L1 search. Ratio between the
smallest and largest lambda in the L1-penalized search. This parameter
essentially determines how long the search is carried out, i.e., how large
submodels are explored. No need to change this unless the program gives a
warning about this.}

\item{nlambda}{Only relevant for L1 search. Number of values in the lambda
grid for L1-penalized search. No need to change this unless the program
gives a warning about this.}

\item{thresh}{Only relevant for L1 search. Convergence threshold when
computing the L1 path. Usually, there is no need to change this.}

\item{regul}{A number giving the amount of ridge regularization when
projecting onto (i.e., fitting) submodels which are GLMs. Usually there is
no need for regularization, but sometimes we need to add some
regularization to avoid numerical problems.}

\item{penalty}{Only relevant for L1 search. A numeric vector determining the
relative penalties or costs for the predictors. A value of \code{0} means that
those predictors have no cost and will therefore be selected first, whereas
\code{Inf} means those predictors will never be selected. If \code{NULL}, then \code{1} is
used for each predictor.}

\item{search_terms}{Only relevant for forward search. A custom character
vector of predictor term blocks to consider for the search. Section
"Details" below describes more precisely what "predictor term block" means.
The intercept (\code{"1"}) is always included internally via \code{union()}, so
there's no difference between including it explicitly or omitting it. The
default \code{search_terms} considers all the terms in the reference model's
formula.}

\item{seed}{Pseudorandom number generation (PRNG) seed by which the same
results can be obtained again if needed. Passed to argument \code{seed} of
\code{\link[=set.seed]{set.seed()}}, but can also be \code{NA} to not call \code{\link[=set.seed]{set.seed()}} at all. Here,
this seed is used for clustering the reference model's posterior draws (if
\code{!is.null(nclusters)} or \code{!is.null(nclusters_pred)}) and for drawing new
group-level effects when predicting from a multilevel submodel (however,
not yet in case of a GAMM).}
}
\value{
An object of class \code{vsel}. The elements of this object are not meant
to be accessed directly but instead via helper functions (see the main
vignette and \link{projpred-package}).
}
\description{
Run the \emph{search} part and the \emph{evaluation} part for a projection predictive
variable selection. The search part determines the solution path, i.e., the
best submodel for each submodel size (number of predictor terms). The
evaluation part determines the predictive performance of the submodels along
the solution path.
}
\details{
Arguments \code{ndraws}, \code{nclusters}, \code{nclusters_pred}, and \code{ndraws_pred}
are automatically truncated at the number of posterior draws in the
reference model (which is \code{1} for \code{datafit}s). Using less draws or clusters
in \code{ndraws}, \code{nclusters}, \code{nclusters_pred}, or \code{ndraws_pred} than posterior
draws in the reference model may result in slightly inaccurate projection
performance. Increasing these arguments affects the computation time
linearly.

For argument \code{method}, there are some restrictions: For a reference model
with multilevel or additive formula terms, only the forward search is
available. Furthermore, argument \code{search_terms} requires a forward search
to take effect.

L1 search is faster than forward search, but forward search may be more
accurate. Furthermore, forward search may find a sparser model with
comparable performance to that found by L1 search, but it may also start
overfitting when more predictors are added.

An L1 search may select interaction terms before the corresponding main
terms are selected. If this is undesired, choose the forward search
instead.

The elements of the \code{search_terms} character vector don't need to be
individual predictor terms. Instead, they can be building blocks consisting
of several predictor terms connected by the \code{+} symbol. To understand how
these building blocks work, it is important to know how \pkg{projpred}'s
forward search works: It starts with an empty vector \code{chosen} which will
later contain already selected predictor terms. Then, the search iterates
over model sizes \eqn{j \in \{1, ..., J\}}{j = 1, ..., J}. The candidate
models at model size \eqn{j} are constructed from those elements from
\code{search_terms} which yield model size \eqn{j} when combined with the
\code{chosen} predictor terms. Note that sometimes, there may be no candidate
models for model size \eqn{j}. Also note that internally, \code{search_terms} is
expanded to include the intercept (\code{"1"}), so the first step of the search
(model size 1) always consists of the intercept-only model as the only
candidate.

As a \code{search_terms} example, consider a reference model with formula \code{y ~ x1 + x2 + x3}. Then, to ensure that \code{x1} is always included in the
candidate models, specify \code{search_terms = c("x1", "x1 + x2", "x1 + x3", "x1 + x2 + x3")}. This search would start with \code{y ~ 1} as the only
candidate at model size 1. At model size 2, \code{y ~ x1} would be the only
candidate. At model size 3, \code{y ~ x1 + x2} and \code{y ~ x1 + x3} would be the
two candidates. At the last model size of 4, \code{y ~ x1 + x2 + x3} would be
the only candidate. As another example, to exclude \code{x1} from the search,
specify \code{search_terms = c("x2", "x3", "x2 + x3")}.
}
\section{Argument \code{d_test}}{
If not \code{NULL}, then \code{d_test} needs to be a \code{list} with the following
elements:
\itemize{
\item \code{data}: a \code{data.frame} containing the predictor variables for the test set.
\item \code{offset}: a numeric vector containing the offset values for the test set
(if there is no offset, use a vector of zeros).
\item \code{weights}: a numeric vector containing the observation weights for the test
set (if there are no observation weights, use a vector of ones).
\item \code{y}: a numeric vector containing the response values for the test set.
}
}

\examples{
if (requireNamespace("rstanarm", quietly = TRUE)) {
  # Data:
  dat_gauss <- data.frame(y = df_gaussian$y, df_gaussian$x)

  # The "stanreg" fit which will be used as the reference model (with small
  # values for `chains` and `iter`, but only for technical reasons in this
  # example; this is not recommended in general):
  fit <- rstanarm::stan_glm(
    y ~ X1 + X2 + X3 + X4 + X5, family = gaussian(), data = dat_gauss,
    QR = TRUE, chains = 2, iter = 500, refresh = 0, seed = 9876
  )

  # Variable selection (here without cross-validation and with small values
  # for `nterms_max`, `nclusters`, and `nclusters_pred`, but only for the
  # sake of speed in this example; this is not recommended in general):
  vs <- varsel(fit, nterms_max = 3, nclusters = 5, nclusters_pred = 10,
               seed = 5555)
  # Now see, for example, `?print.vsel`, `?plot.vsel`, `?suggest_size.vsel`,
  # and `?solution_terms.vsel` for possible post-processing functions.
}

}
\seealso{
\code{\link[=cv_varsel]{cv_varsel()}}
}