1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357
|
---
title: "projpred Performing variable selection on Generalized Linear Multilevel Models"
date: "`r Sys.Date()`"
output:
html_vignette
params:
EVAL: !r identical(Sys.getenv("NOT_CRAN"), "true")
---
<!--
%\VignetteEngine{knitr::rmarkdown}
%\VignetteIndexEntry{projpred Performing variable selection on Generalized Linear Multilevel Models}
\usepackage[utf8](inputenc)
-->
```{r, child="children/SETTINGS-knitr.txt"}
```
This vignette shows how to use `projpred` to perform variable selection in the
context of Generalized Linear Multilevel Models (GLMMs). For a general overview
of the package we refer the reader to the quickstart vignette. The method used
here is described in the latest preprint (by Catalina et al. (2020), arxiv
link).
## Gaussian example
Load the necessary packages. If the sampling takes more than 30 seconds and
multiple cores are available, uncomment the line setting `mc.cores` to set the
number of cores used (this is commented out as the sampling in the example is
fast and to avoid possible problems when building the vignette along the package
installation in special environments such as computing clusters).
```{r, results='hide', message=FALSE, warning=FALSE}
library(projpred)
library(rstanarm)
library(tidyr)
library(dplyr)
library(ggplot2)
library(bayesplot)
theme_set(theme_classic())
#options(mc.cores = 4)
```
For the first Gaussian example we borrow the popular `orthodont` dataset from
the `nlme` package. This data include observations from a growth curve on an
orthodontic measurement. It contains 108 observations with 4 variables. The
variables contained in this dataset are:
- distance, numeric distances from the pituitary to the pterygomaxillary
fissure (mm). These are measured on x-ray images of the skull.
- age, the age of the subject (years).
- Subject, an ordered factor indicating the subject on which the measurement
is taken.
- Sex, indicator of sex in the subject.
```{r}
data("Orthodont", package = "nlme")
```
We first build the following complete `rstanarm` model including all variables:
```{r, cache=TRUE, results="hide", message=FALSE, warning=FALSE}
fit <- stan_glmer(distance ~ age * Sex + (age | Subject),
chains = 2, data = Orthodont, seed = 1)
```
Since we have repeated measurements over subjects at multiple time points, the
effect of `age` is allowed to vary over subjects. To run `projpred` on this
model we can first run a simple variable selection with the full dataset. This
approach may be overconfident in some cases and therefore in general cases it is
recommended to run a cross validated variable selection instead.
```{r, cache=TRUE, results="hide", message=FALSE, warning=FALSE}
ref <- get_refmodel(fit)
vs <- varsel(ref)
```
Because our model includes group effects, the only available search method is
forward search, which is a bit slower than L_1 search but tends to find smaller
models. Beware that the underlying projection fitting method could return some
warnings in cases where the optimization may not have converged properly. This
is more common in the case of generalized models (non-Gaussian families). We
print the list of the variables ordered by relevance:
```{r, message=FALSE, warning=FALSE}
solution_terms(vs) # selection order of the variables
```
We plot some statistics computed on the training data, such as the sum of log
predictive densities (ELPD) and root mean squared error (RMSE) as the function
of number of variables added. By default, the statistics are shown on absolute
scale, but with `deltas = TRUE` the plot shows results relative to the full
model.
```{r, fig.width=5, fig.height=4}
# plot predictive performance on training data
plot(vs, stats = c('elpd', 'rmse'))
```
From this plot, it is clearly visible that the first two terms are probaly
enough to achieve predictive performance comparable to the reference model. We
perform the projection for a submodel of desired size using the function
`project`. The projection can also be coerced to a matrix with draws of the
selected variables and sigma. The draws can be visualized with, for example, the
`mcmc_areas` function in the `bayesplot` package. Below we compare how the
projection affects the three most relevant variables.
<!-- ```{r, fig.width=6, fig.height=2} -->
<!-- # Visualise the three most relevant variables in the full model -\-> -->
<!-- mcmc_areas(as.matrix(vs$refmodel$fit), -->
<!-- pars = c("(Intercept)", "Subject__Intercept", "age", -->
<!-- "sd_Subject__age", "sigma")) -->
<!-- ``` -->
```{r, cache=TRUE, fig.width=6, fig.height=2, message=FALSE, warning=FALSE}
# Visualise the projected three most relevant variables
proj <- project(vs, nterms = 2, ns = 500)
mcmc_areas(as.matrix(proj), pars = solution_terms(vs)[1:2])
```
We make predictions with the projected submodels. For point estimates we can use
method `proj_linpred`. Test inputs can be provided using the keyword `newdata`.
If also the test targets `ynew` are provided, then the function evaluates the
log predictive density at these points. For instance, the following computes the
mean of the predictive distribution and evaluates the log density at the
training points using the 5 most relevant variables.
```{r, cache=TRUE, message=FALSE, warning=FALSE}
pred <- proj_linpred(vs, newdata = Orthodont, nterms = 5, integrated = TRUE)
```
Visualize the predictions
```{r, fig.width=5, fig.height=3}
ggplot() +
geom_point(aes(x = pred$pred, y = Orthodont$distance)) +
geom_abline(slope = 1, color = "red") +
labs(x = "prediction", y = "y")
```
We also obtain draws from the projected predictive distribution. Here's an
example prediction for the first data point using 5 terms (the observed
value is marked by the red line).
```{r, fig.height=3, fig.width=5, message=FALSE, warning=FALSE}
subset <- Orthodont %>% as_tibble() %>% dplyr::sample_n(1)
y_subset <- subset %>% dplyr::select(distance) %>% as.data.frame()
y1_rep <- proj_predict(vs, newdata = subset, nterms = 5, seed = 7560)
qplot(as.vector(y1_rep), bins = 25) +
geom_vline(xintercept = as.numeric(y_subset), color = "red") +
xlab("y1_rep")
```
## Poisson example
As previously, we first load the data,
```{r}
data_pois <- read.table("data_pois.csv", header = TRUE)
```
These data correspond to a phylogenetic dataset where the phenotype is expressed
as counts. This kind of data is relevant in evolutionary biology when data of
many species are analyzed at the same time. The model we fit here is borrowed
from *Modern Phylogenetic Comparative Methods and the application in
Evolutionary Biology* (de Villemeruil & Nakagawa, 2014). The necessary data can
also be downloaded from the corresponding website
(http://www.mpcm-evolution.com/). The specific formula is taken from the
*Estimating Phylogenetic Multilevel Models with brms* vignette of the brms
package
(https://cran.r-project.org/package=brms/vignettes/brms_phylogenetics.html).
```{r, cache=TRUE, message=FALSE, warning=FALSE, results="hide"}
fit <- stan_glmer(
phen_pois ~ cofactor + (1 | phylo) + (1 | obs), data = data_pois,
family = poisson("log"), chains = 2, iter = 2000,
control = list(adapt_delta = 0.95)
)
```
As we did in the previous example, we can perform variable selection and look
at the projection.
```{r, cache=TRUE, results="hide", message=FALSE, warning=FALSE}
vs <- varsel(fit)
```
Because our model includes group effects, the only available search method is
forward search, which is a bit slower than L_1 search but tends to find smaller
models. Beware that the underlying projection fitting method could return some
warnings in cases where the optimization may not have converged properly. This
is more common in the case of generalized models (non-Gaussian families). We
print the list of the variables ordered by relevance:
```{r, message=FALSE, warning=FALSE}
solution_terms(vs) # selection order of the variables
```
We plot some statistics computed on the training data, such as the sum of log
predictive densities (ELPD) and root mean squared error (RMSE) as the function
of number of variables added. By default, the statistics are shown on absolute
scale, but with `deltas = TRUE` the plot shows results relative to the full
model.
```{r, fig.width=5, fig.height=4}
# plot predictive performance on training data
plot(vs, stats = c('elpd', 'rmse'))
```
We perform the projection for a submodel of desired size using the function
`project`. The projection can also be coerced to a matrix with draws of the
selected variables and sigma. The draws can be visualized with, for example, the
`mcmc_areas` function in the `bayesplot` package. Below we compare how the
projection affects the most relevant variables.
<!-- ```{r, fig.width=6, fig.height=2} -->
<!-- # Visualise the most relevant variable in the full model -\-> -->
<!-- mcmc_areas(as.matrix(vs$refmodel$fit), -->
<!-- pars = c("b_Intercept", "b_cofactor", "sd_phylo__Intercept")) -->
<!-- ``` -->
```{r, cache=TRUE, fig.width=6, fig.height=2, message=FALSE, warning=FALSE}
# Visualise the projected two most relevant variables
proj <- project(vs, nterms = 2, ndraws = 10)
mcmc_areas(as.matrix(proj), pars = solution_terms(vs)[1:2])
```
As we did in the Gaussian example, we make predictions with the projected
submodels. The following computes the mean of the predictive distribution and
evaluates the log density at the training points using the previously projected
model.
```{r, cache=TRUE, message=FALSE, warning=FALSE}
pred <- proj_linpred(proj, newdata = data_pois, integrated = TRUE)
```
We can also visualize the corresponding projected predictions.
```{r, fig.width=5, fig.height=3}
xaxis <- seq(-3, 3, length.out = 1000)
y_mu <- rowMeans(vs$refmodel$mu)
ggplot() +
geom_point(aes(x = pred$pred, y = y_mu)) +
geom_line(aes(x=xaxis, y = exp(xaxis)), color = "red") +
labs(x = "prediction", y = "y")
```
## Bernoulli example
For both of the previous cases, running variable selection with the full data
was actually enough because they were pretty simple. In this section we work on
a slightly more difficult data set.
We can load the data from the popular `lme4` package:
```{r}
data("VerbAgg", package = "lme4")
```
The whole dataset consists of 7584 questionnaire answers from different
subjects. Given that the full data could take a long time to fit, we will
subsample 50 individuals (for which sampling still takes a bit). For this, we
use the great `tidyverse` environment.
```{r}
## subsample 50 participants
VerbAgg_subsample <- VerbAgg %>%
tidyr::as_tibble() %>%
dplyr::filter(id %in% sample(id, 50)) %>%
dplyr::mutate(r2num = as.integer(r2) - 1) # binomial family needs numeric target
```
For this simple model we will add some group effects that we know are not quite
relevant.
```{r, cache=TRUE, results="hide", message=FALSE, warning=FALSE}
## simple bernoulli model
formula_va <- r2num ~ btype + situ + mode + (btype + situ + mode | id)
fit_va <- stan_glmer(
formula = formula_va,
data = VerbAgg_subsample,
family = binomial("logit"),
seed = 1234,
chains = 2
)
```
As we did before, we can run the standard variable selection with
```{r, cache=TRUE, results="hide", message=FALSE, warning=FALSE}
vs_va <- varsel(fit_va)
```
```{r}
solution_terms(vs_va)
```
```{r, fig.height=4, fig.width=5}
plot(vs_va, stats = c("elpd", "acc"))
```
Even though the ordering of the variables makes sense, the performance of the
projected models does not quite match the reference model. There may be
different reasons that can explain this behaviour:
1. The reference model posterior may not be very narrow and then running
`varsel` with the default `ndraws_pred` could be not enough. Increasing
`ndraws_pred` helps but also increases the computational cost. Re fitting the
reference model ensuring a narrower posterior (usually employing a stronger
sparsifying prior) would have a similar effect. This is usually the case when
the difference in predictive performance is not very large.
2. For non-Gaussian models the source of the inaccuracy may come from the fact
that we are running an approximate projection. In this case, increasing the
number of draws for the prediction `ndraws_pred` should also help.
3. Given that `varsel` computes the expected predictive performance of the
reference model from the in-sample mean, it may sometimes result in
overconfident results. Running `cv_varsel` with LOO cross-validation computes
the predictive performance by taking PSIS-LOO weights into account, usually
reporting a more realistic ELPD.
By default, `cv_varsel` will run LOO cross-validation, and will try to run a
full variable selection for every data point. Given the large computational cost
of running more draws, we'll first try running `cv_varsel` without validating
the search. We can omit this behaviour by fitting the model once and computing
the LOO elpd posterior from it by passing the argument `validate_search =
FALSE`. Note that in general it is recommended to run the full validation
search.
```{r, cache=TRUE, results="hide", message=FALSE, warning=FALSE}
cv_vs_va <- cv_varsel(fit_va, validate_search = FALSE)
```
```{r, fig.height=4, fig.width=5}
plot(cv_vs_va, stats = c("elpd", "acc"))
```
Now we see that the projected models behave as expected and can proceed with the
usual analysis.
As in previous examples, we can show the mean prediction with `proj_linpred`.
The following computes the mean of the predictive distribution and evaluates the
log density at the training points using the 6 most relevant variables.
```{r, cache=TRUE, message=FALSE, warning=FALSE}
pred <- proj_linpred(cv_vs_va, newdata = VerbAgg_subsample,
nterms = 6, integrated = TRUE, ndraws = 10)
```
We can also visualize the corresponding projected predictions.
```{r, fig.width=5, fig.height=3}
xaxis <- seq(-6, 6, length.out = 1000)
yaxis <- cv_vs_va$family$linkinv(xaxis)
y_mu <- rowMeans(cv_vs_va$refmodel$mu)
ggplot() +
geom_point(aes(x = pred$pred, y = y_mu)) +
geom_line(aes(x = xaxis, y = yaxis), color = "red") +
labs(x = "prediction", y = "y")
```
|