File: quickstart_glmm.Rmd

package info (click to toggle)
r-cran-projpred 2.0.2%2Bdfsg-1
  • links: PTS, VCS
  • area: main
  • in suites: bullseye
  • size: 740 kB
  • sloc: cpp: 355; sh: 14; makefile: 2
file content (357 lines) | stat: -rw-r--r-- 14,187 bytes parent folder | download | duplicates (2)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
---
title: "projpred Performing variable selection on Generalized Linear Multilevel Models"
date: "`r Sys.Date()`"
output:
  html_vignette
params:
  EVAL: !r identical(Sys.getenv("NOT_CRAN"), "true")
---
<!--
%\VignetteEngine{knitr::rmarkdown}
%\VignetteIndexEntry{projpred Performing variable selection on Generalized Linear Multilevel Models}
\usepackage[utf8](inputenc)
-->
```{r, child="children/SETTINGS-knitr.txt"}
```

This vignette shows how to use `projpred` to perform variable selection in the
context of Generalized Linear Multilevel Models (GLMMs). For a general overview
of the package we refer the reader to the quickstart vignette. The method used
here is described in the latest preprint (by Catalina et al. (2020), arxiv
link).

## Gaussian example

Load the necessary packages. If the sampling takes more than 30 seconds and
multiple cores are available, uncomment the line setting `mc.cores` to set the
number of cores used (this is commented out as the sampling in the example is
fast and to avoid possible problems when building the vignette along the package
installation in special environments such as computing clusters).

```{r, results='hide', message=FALSE, warning=FALSE}
library(projpred)
library(rstanarm)
library(tidyr)
library(dplyr)
library(ggplot2)
library(bayesplot)
theme_set(theme_classic())
#options(mc.cores = 4)
```

For the first Gaussian example we borrow the popular `orthodont` dataset from
the `nlme` package. This data include observations from a growth curve on an
orthodontic measurement. It contains 108 observations with 4 variables. The
variables contained in this dataset are:

 - distance, numeric distances from the pituitary to the pterygomaxillary 
 fissure (mm). These are measured on x-ray images of the skull.
 - age, the age of the subject (years).
 - Subject, an ordered factor indicating the subject on which the measurement 
   is taken.
 - Sex, indicator of sex in the subject.
 
```{r}
data("Orthodont", package = "nlme")
```

We first build the following complete `rstanarm` model including all variables:
 
```{r, cache=TRUE, results="hide", message=FALSE, warning=FALSE}
fit <- stan_glmer(distance ~ age * Sex + (age | Subject),
                  chains = 2,  data = Orthodont, seed = 1)
```
 
Since we have repeated measurements over subjects at multiple time points, the
effect of `age` is allowed to vary over subjects. To run `projpred` on this
model we can first run a simple variable selection with the full dataset. This
approach may be overconfident in some cases and therefore in general cases it is
recommended to run a cross validated variable selection instead.
    
```{r, cache=TRUE, results="hide", message=FALSE, warning=FALSE}
ref <- get_refmodel(fit)
vs <- varsel(ref)
```

Because our model includes group effects, the only available search method is
forward search, which is a bit slower than L_1 search but tends to find smaller
models. Beware that the underlying projection fitting method could return some
warnings in cases where the optimization may not have converged properly. This
is more common in the case of generalized models (non-Gaussian families). We
print the list of the variables ordered by relevance:

```{r, message=FALSE, warning=FALSE}
solution_terms(vs) # selection order of the variables
```

We plot some statistics computed on the training data, such as the sum of log
predictive densities (ELPD) and root mean squared error (RMSE) as the function
of number of variables added. By default, the statistics are shown on absolute
scale, but with `deltas = TRUE` the plot shows results relative to the full
model.

```{r, fig.width=5, fig.height=4}
# plot predictive performance on training data
plot(vs, stats = c('elpd', 'rmse'))
```

From this plot, it is clearly visible that the first two terms are probaly
enough to achieve predictive performance comparable to the reference model. We
perform the projection for a submodel of desired size using the function
`project`. The projection can also be coerced to a matrix with draws of the
selected variables and sigma. The draws can be visualized with, for example, the
`mcmc_areas` function in the `bayesplot` package. Below we compare how the
projection affects the three most relevant variables.

<!-- ```{r, fig.width=6, fig.height=2} -->
<!--  # Visualise the three most relevant variables in the full model -\-> -->
<!--  mcmc_areas(as.matrix(vs$refmodel$fit), -->
<!--             pars = c("(Intercept)", "Subject__Intercept", "age", -->
<!--                      "sd_Subject__age", "sigma")) -->
<!-- ``` -->

```{r, cache=TRUE, fig.width=6, fig.height=2, message=FALSE, warning=FALSE}
# Visualise the projected three most relevant variables
proj <- project(vs, nterms = 2, ns = 500)
mcmc_areas(as.matrix(proj), pars = solution_terms(vs)[1:2])
```

We make predictions with the projected submodels. For point estimates we can use
method `proj_linpred`. Test inputs can be provided using the keyword `newdata`.
If also the test targets `ynew` are provided, then the function evaluates the
log predictive density at these points. For instance, the following computes the
mean of the predictive distribution and evaluates the log density at the
training points using the 5 most relevant variables.

```{r, cache=TRUE, message=FALSE, warning=FALSE}
pred <- proj_linpred(vs, newdata = Orthodont, nterms = 5, integrated = TRUE)
```

Visualize the predictions
```{r, fig.width=5, fig.height=3}
ggplot() +
  geom_point(aes(x = pred$pred, y = Orthodont$distance)) +
  geom_abline(slope = 1, color = "red") +
  labs(x = "prediction", y = "y")
```

We also obtain draws from the projected predictive distribution. Here's an
example prediction for the first data point using 5 terms (the observed
value is marked by the red line).

```{r, fig.height=3, fig.width=5, message=FALSE, warning=FALSE}
subset <- Orthodont %>% as_tibble() %>% dplyr::sample_n(1)
y_subset <- subset %>% dplyr::select(distance) %>% as.data.frame()
y1_rep <- proj_predict(vs, newdata = subset, nterms = 5, seed = 7560)
qplot(as.vector(y1_rep), bins = 25) +
  geom_vline(xintercept = as.numeric(y_subset), color = "red") +
  xlab("y1_rep")
```

## Poisson example

As previously, we first load the data,

```{r}
data_pois <- read.table("data_pois.csv", header = TRUE)
```

These data correspond to a phylogenetic dataset where the phenotype is expressed
as counts. This kind of data is relevant in evolutionary biology when data of
many species are analyzed at the same time. The model we fit here is borrowed
from *Modern Phylogenetic Comparative Methods and the application in
Evolutionary Biology* (de Villemeruil & Nakagawa, 2014). The necessary data can
also be downloaded from the corresponding website
(http://www.mpcm-evolution.com/). The specific formula is taken from the
*Estimating Phylogenetic Multilevel Models with brms* vignette of the brms
package
(https://cran.r-project.org/package=brms/vignettes/brms_phylogenetics.html).

```{r, cache=TRUE, message=FALSE, warning=FALSE, results="hide"}
fit <- stan_glmer(
  phen_pois ~ cofactor + (1 | phylo) + (1 | obs), data = data_pois,
  family = poisson("log"), chains = 2, iter = 2000,
  control = list(adapt_delta = 0.95)
)
```

As we did in the previous example, we can perform variable selection and look
at the projection.
 
```{r, cache=TRUE, results="hide", message=FALSE, warning=FALSE}
vs <- varsel(fit)
```

Because our model includes group effects, the only available search method is
forward search, which is a bit slower than L_1 search but tends to find smaller
models. Beware that the underlying projection fitting method could return some
warnings in cases where the optimization may not have converged properly. This
is more common in the case of generalized models (non-Gaussian families). We
print the list of the variables ordered by relevance:

```{r, message=FALSE, warning=FALSE}
solution_terms(vs) # selection order of the variables
```

We plot some statistics computed on the training data, such as the sum of log
predictive densities (ELPD) and root mean squared error (RMSE) as the function
of number of variables added. By default, the statistics are shown on absolute
scale, but with `deltas = TRUE` the plot shows results relative to the full
model.

```{r, fig.width=5, fig.height=4}
# plot predictive performance on training data
plot(vs, stats = c('elpd', 'rmse'))
```

We perform the projection for a submodel of desired size using the function
`project`. The projection can also be coerced to a matrix with draws of the
selected variables and sigma. The draws can be visualized with, for example, the
`mcmc_areas` function in the `bayesplot` package. Below we compare how the
projection affects the most relevant variables.

<!-- ```{r, fig.width=6, fig.height=2} -->
<!--  # Visualise the most relevant variable in the full model -\-> -->
<!--  mcmc_areas(as.matrix(vs$refmodel$fit), -->
<!--             pars = c("b_Intercept", "b_cofactor", "sd_phylo__Intercept")) -->
<!-- ``` -->

```{r, cache=TRUE, fig.width=6, fig.height=2, message=FALSE, warning=FALSE}
# Visualise the projected two most relevant variables
proj <- project(vs, nterms = 2, ndraws = 10)
mcmc_areas(as.matrix(proj), pars = solution_terms(vs)[1:2])
```

As we did in the Gaussian example, we make predictions with the projected
submodels. The following computes the mean of the predictive distribution and
evaluates the log density at the training points using the previously projected
model.

```{r, cache=TRUE, message=FALSE, warning=FALSE}
pred <- proj_linpred(proj, newdata = data_pois, integrated = TRUE)
```

We can also visualize the corresponding projected predictions.

```{r, fig.width=5, fig.height=3}
xaxis <- seq(-3, 3, length.out = 1000)
y_mu <- rowMeans(vs$refmodel$mu)
ggplot() +
  geom_point(aes(x = pred$pred, y = y_mu)) +
  geom_line(aes(x=xaxis, y = exp(xaxis)), color = "red") +
  labs(x = "prediction", y = "y")
```

## Bernoulli example

For both of the previous cases, running variable selection with the full data
was actually enough because they were pretty simple. In this section we work on
a slightly more difficult data set.

We can load the data from the popular `lme4` package:

```{r}
data("VerbAgg", package = "lme4")
```

The whole dataset consists of 7584 questionnaire answers from different
subjects. Given that the full data could take a long time to fit, we will
subsample 50 individuals (for which sampling still takes a bit). For this, we
use the great `tidyverse` environment.

```{r}
## subsample 50 participants
VerbAgg_subsample <- VerbAgg %>%
  tidyr::as_tibble() %>%
  dplyr::filter(id %in% sample(id, 50)) %>%
  dplyr::mutate(r2num = as.integer(r2) - 1) # binomial family needs numeric target
```

For this simple model we will add some group effects that we know are not quite
relevant.

```{r, cache=TRUE, results="hide", message=FALSE, warning=FALSE}
## simple bernoulli model
formula_va <- r2num ~ btype + situ + mode + (btype + situ + mode | id)
fit_va <- stan_glmer(
  formula = formula_va,
  data = VerbAgg_subsample,
  family = binomial("logit"),
  seed = 1234,
  chains = 2
)
```

As we did before, we can run the standard variable selection with

```{r, cache=TRUE, results="hide", message=FALSE, warning=FALSE}
vs_va <- varsel(fit_va)
```

```{r}
solution_terms(vs_va)
```

```{r, fig.height=4, fig.width=5}
plot(vs_va, stats = c("elpd", "acc"))
```

Even though the ordering of the variables makes sense, the performance of the
projected models does not quite match the reference model. There may be
different reasons that can explain this behaviour:

1. The reference model posterior may not be very narrow and then running
   `varsel` with the default `ndraws_pred` could be not enough. Increasing
   `ndraws_pred` helps but also increases the computational cost. Re fitting the
   reference model ensuring a narrower posterior (usually employing a stronger
   sparsifying prior) would have a similar effect. This is usually the case when
   the difference in predictive performance is not very large.
2. For non-Gaussian models the source of the inaccuracy may come from the fact
   that we are running an approximate projection. In this case, increasing the
   number of draws for the prediction `ndraws_pred` should also help.
3. Given that `varsel` computes the expected predictive performance of the
   reference model from the in-sample mean, it may sometimes result in
   overconfident results. Running `cv_varsel` with LOO cross-validation computes
   the predictive performance by taking PSIS-LOO weights into account, usually
   reporting a more realistic ELPD.

By default, `cv_varsel` will run LOO cross-validation, and will try to run a
full variable selection for every data point. Given the large computational cost
of running more draws, we'll first try running `cv_varsel` without validating
the search. We can omit this behaviour by fitting the model once and computing
the LOO elpd posterior from it by passing the argument `validate_search =
FALSE`. Note that in general it is recommended to run the full validation
search.

```{r, cache=TRUE, results="hide", message=FALSE, warning=FALSE}
cv_vs_va <- cv_varsel(fit_va, validate_search = FALSE)
```

```{r, fig.height=4, fig.width=5}
plot(cv_vs_va, stats = c("elpd", "acc"))
```

Now we see that the projected models behave as expected and can proceed with the
usual analysis.

As in previous examples, we can show the mean prediction with `proj_linpred`.
The following computes the mean of the predictive distribution and evaluates the
log density at the training points using the 6 most relevant variables.

```{r, cache=TRUE, message=FALSE, warning=FALSE}
pred <- proj_linpred(cv_vs_va, newdata = VerbAgg_subsample,
                     nterms = 6, integrated = TRUE, ndraws = 10)
```

We can also visualize the corresponding projected predictions.

```{r, fig.width=5, fig.height=3}
xaxis <- seq(-6, 6, length.out = 1000)
yaxis <- cv_vs_va$family$linkinv(xaxis)

y_mu <- rowMeans(cv_vs_va$refmodel$mu)
ggplot() +
  geom_point(aes(x = pred$pred, y = y_mu)) +
  geom_line(aes(x = xaxis, y = yaxis), color = "red") +
  labs(x = "prediction", y = "y")
```