File: suggest_size.Rd

package info (click to toggle)
r-cran-projpred 2.3.0%2Bdfsg-1
links: PTS, VCS
area: main
in suites: bookworm
size: 1,180 kB
sloc: cpp: 296; sh: 14; makefile: 5
file content (130 lines) | stat: -rw-r--r-- 6,413 bytes
% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/methods.R
\name{suggest_size}
\alias{suggest_size}
\alias{suggest_size.vsel}
\title{Suggest submodel size}
\usage{
suggest_size(object, ...)

\method{suggest_size}{vsel}(
  object,
  stat = "elpd",
  pct = 0,
  type = "upper",
  thres_elpd = NA,
  warnings = TRUE,
  ...
)
}
\arguments{
\item{object}{An object of class \code{vsel} (returned by \code{\link[=varsel]{varsel()}} or
\code{\link[=cv_varsel]{cv_varsel()}}).}

\item{...}{Arguments passed to \code{\link[=summary.vsel]{summary.vsel()}}, except for \code{object}, \code{stats}
(which is set to \code{stat}), \code{type}, and \code{deltas} (which is set to \code{TRUE}).
See section "Details" below for some important arguments which may be
passed here.}

\item{stat}{Performance statistic (i.e., utility or loss) used for the
decision. See argument \code{stats} of \code{\link[=summary.vsel]{summary.vsel()}} for possible choices.}

\item{pct}{A number giving the proportion (\emph{not} percents) of the \emph{relative}
null model utility one is willing to sacrifice. See section "Details" below
for more information.}

\item{type}{Either \code{"upper"} or \code{"lower"} determining whether the decision is
based on the upper or lower confidence interval bound, respectively. See
section "Details" below for more information.}

\item{thres_elpd}{Only relevant if \code{stat \%in\% c("elpd", "mlpd")}. The
threshold for the ELPD difference (taking the submodel's ELPD minus the
baseline model's ELPD) above which the submodel's ELPD is considered to be
close enough to the baseline model's ELPD. An equivalent rule is applied in
case of the MLPD. See section "Details" for a formalization. Supplying \code{NA}
deactivates this.}

\item{warnings}{Mainly for internal use. A single logical value indicating
whether to throw warnings if automatic suggestion fails. Usually there is
no reason to set this to \code{FALSE}.}
}
\value{
A single numeric value, giving the suggested submodel size (or \code{NA}
if the suggestion failed).

The intercept is not counted by \code{\link[=suggest_size]{suggest_size()}}, so a suggested size of
zero stands for the intercept-only model.
}
\description{
This function can suggest an appropriate submodel size based on a decision
rule described in section "Details" below. Note that this decision is quite
heuristic and should be interpreted with caution. It is recommended to
examine the results via \code{\link[=plot.vsel]{plot.vsel()}} and/or \code{\link[=summary.vsel]{summary.vsel()}} and to make the
final decision based on what is most appropriate for the problem at hand.
}
\details{
In general (beware of special extensions below), the suggested model
size is the smallest model size \eqn{k \in \{0, 1, ...,
  \texttt{nterms\_max}\}}{{k = 0, 1, ..., nterms_max}} for which either the
lower or upper bound (depending on argument \code{type}) of the
normal-approximation (or bootstrap; see argument \code{stat}) confidence
interval (with nominal coverage \code{1 - alpha}; see argument \code{alpha} of
\code{\link[=summary.vsel]{summary.vsel()}}) for \eqn{U_k - U_{\mathrm{base}}}{U_k - U_base} (with
\eqn{U_k} denoting the \eqn{k}-th submodel's true utility and
\eqn{U_{\mathrm{base}}}{U_base} denoting the baseline model's true utility)
falls above (or is equal to) \deqn{\texttt{pct} \cdot (u_0 -
  u_{\mathrm{base}})}{pct * (u_0 - u_base)} where \eqn{u_0} denotes the null
model's estimated utility and \eqn{u_{\mathrm{base}}}{u_base} the baseline
model's estimated utility. The baseline model is either the reference model
or the best submodel found (see argument \code{baseline} of \code{\link[=summary.vsel]{summary.vsel()}}).

If \code{!is.na(thres_elpd)} and \code{stat = "elpd"}, the decision rule above is
extended: The suggested model size is then the smallest model size \eqn{k}
fulfilling the rule above \emph{or} \eqn{u_k - u_{\mathrm{base}} >
  \texttt{thres\_elpd}}{u_k - u_base > thres_elpd}. Correspondingly, in case
of \code{stat = "mlpd"} (and \code{!is.na(thres_elpd)}), the suggested model size is
the smallest model size \eqn{k} fulfilling the rule above \emph{or} \eqn{u_k -
  u_{\mathrm{base}} > \frac{\texttt{thres\_elpd}}{N}}{u_k - u_base >
  thres_elpd / N} with \eqn{N} denoting the number of observations.

For example (disregarding the special extensions in case of \code{stat = "elpd"}
or \code{stat = "mlpd"}), \code{alpha = 2 * pnorm(-1)}, \code{pct = 0}, and \code{type = "upper"} means that we select the smallest model size for which the upper
bound of the 68\% confidence interval for \eqn{U_k - U_{\mathrm{base}}}{U_k
  - U_base} exceeds (or is equal to) zero, that is (if \code{stat} is a
performance statistic for which the normal approximation is used, not the
bootstrap), for which the submodel's utility estimate is at most one
standard error smaller than the baseline model's utility estimate (with
that standard error referring to the utility \emph{difference}).
}
\note{
Loss statistics like the root mean squared error (RMSE) and the mean
squared error (MSE) are converted to utilities by multiplying them by \code{-1},
so a call such as \code{suggest_size(object, stat = "rmse", type = "upper")}
finds the smallest model size whose upper confidence interval bound for the
\emph{negative} RMSE or MSE exceeds the cutoff (or, equivalently, has the lower
confidence interval bound for the RMSE or MSE below the cutoff). This is
done to make the interpretation of argument \code{type} the same regardless of
argument \code{stat}.
}
\examples{
if (requireNamespace("rstanarm", quietly = TRUE)) {
  # Data:
  dat_gauss <- data.frame(y = df_gaussian$y, df_gaussian$x)

  # The "stanreg" fit which will be used as the reference model (with small
  # values for `chains` and `iter`, but only for technical reasons in this
  # example; this is not recommended in general):
  fit <- rstanarm::stan_glm(
    y ~ X1 + X2 + X3 + X4 + X5, family = gaussian(), data = dat_gauss,
    QR = TRUE, chains = 2, iter = 500, refresh = 0, seed = 9876
  )

  # Variable selection (here without cross-validation and with small values
  # for `nterms_max`, `nclusters`, and `nclusters_pred`, but only for the
  # sake of speed in this example; this is not recommended in general):
  vs <- varsel(fit, nterms_max = 3, nclusters = 5, nclusters_pred = 10,
               seed = 5555)
  print(suggest_size(vs))
}

}