File: quickstart.Rmd

package info (click to toggle)
r-cran-projpred 2.0.2%2Bdfsg-1
  • links: PTS, VCS
  • area: main
  • in suites: bullseye
  • size: 740 kB
  • sloc: cpp: 355; sh: 14; makefile: 2
file content (190 lines) | stat: -rw-r--r-- 10,723 bytes parent folder | download | duplicates (2)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
---
title: "projpred Quick Start"
date: "`r Sys.Date()`"
output:
  html_vignette
params:
  EVAL: !r identical(Sys.getenv("NOT_CRAN"), "true")
---
<!--
%\VignetteEngine{knitr::rmarkdown}
%\VignetteIndexEntry{projpred Quick Start}
\usepackage[utf8](inputenc)
-->
```{r, child="children/SETTINGS-knitr.txt"}
```

This vignette shows how to use the main functionalities of the ```projpred```-package, which implements the projective variable selection (Goutis and Robert, 1998; Dupuis and Robert, 2003) for generalized linear models. The package is compatible with ```rstanarm``` but also other reference models could be used (see section Custom reference models). The methods implemented in the package are described in detail in Piironen et al. (2020) and evaluated in comparison to many other methods in Piironen and Vehtari (2017a).


## Gaussian example
Load the necessary packages. If the sampling takes more than 10 seconds and multiple cores are available, uncomment the line setting `mc.cores` to set the number of cores used (this is commented out as the sampling in the example is fast and to avoid possible problems when building the vignette along the package installation in special environments such as computing clusters).
```{r, results='hide', message=FALSE, warning=FALSE}
library(brms)
library(projpred)
library(dplyr)
library(ggplot2)
library(bayesplot)
theme_set(theme_classic())
#options(mc.cores = parallel::detectCores())
```

The package contains a simple Gaussian example dataset accessible with the `data`-command. This dataset is one of the example cases from the `glmnet`-package. The following loads a dataframe `df_gaussian` with the predictor matrix `x` and the corresponding targets `y` into the workspace.
```{r}
data('df_gaussian', package = 'projpred')
```

We first construct a model with all the variables and regularized horseshoe prior (Piironen and Vehtari, 2017c) on the regression coefficients. This gives us the full Bayesian solution to the problem. We specify the prior on the number of relevant variables using the approch by Piironen and Vehtari (2017b,c). The prior for the global shrinkage parameter is defined based on the prior on the number of relevant variables.

Before building the model we call `break_up_matrix_term`.
This is a convenience function to automatically split matrix variables in linear terms. For example, in `y ~ x`, `x` can be a matrix.
If this function is not used, `projpred` considers `x` to be jointly included or excluded in the variable selection.

```{r, results='hide', message=FALSE, warning=FALSE}
split_structure <- break_up_matrix_term(y ~ x, data = df_gaussian)
df_gaussian <- split_structure$data
formula <- split_structure$formula
d <- df_gaussian
n <- nrow(df_gaussian) # 100
D <- ncol(df_gaussian[, -1]) # 20
p0 <- 5 # prior guess for the number of relevant variables
tau0 <- p0/(D-p0) * 1/sqrt(n) # scale for tau (notice that stan_glm will automatically scale this by sigma)
fit <- brm(formula, family=gaussian(), data=df_gaussian,
           prior=prior(horseshoe(scale_global = tau0, scale_slab = 1), class=b),
           ## To make this vignette build fast, we use only 2 chains and
           ## 500 iterations. In practice, at least 4 chains should be 
           ## used and 2000 iterations might be required for reliable
           ## inference.
           seed=1, chains=2, iter=500)
```

Thanks to its general interface, `projpred` can be applied to a very wide set of models. To achieve so, we internally define the following set of functions:
  - Divergence minimizer for the projection, which in this case it simply corresponds to a maximum likelihood estimator.
  - Prediction function for the projections.
  - Prediction function for the reference model that outputs the linear predictor. 
  - Function to fetch data from the original dataset.

For most models, `projpred` is able to generate these itself without any user input more than the reference model object. Nonetheless, for custom or not supported models, the user can provide those themself (more details in the documentation about reference models).

<!-- ```{r, results='hide', messages=FALSE, warning=FALSE} -->
<!-- ref_predfun <- function(fit, newdata=NULL) -->
<!--   t(posterior_linpred(fit, newdata = newdata)) -->

<!-- proj_predfun <- function(fit, newdata=NULL) { -->
<!--   newdata <- as.data.frame(fetch_data(xnew=newdata)) -->
<!--   predict(fit, newdata = newdata) -->
<!-- } -->

<!-- mle <- function(form, dat) -->
<!--   lm(form, data = dat) -->

<!-- fetch_data <- function(obs=NULL, xnew=NULL) { -->
<!--   if (is.null(obs)) -->
<!--     if (is.null(xnew)) -->
<!--       return(d) -->
<!--     else -->
<!--       return(xnew) -->
<!--   else if (is.null(xnew)) -->
<!--     return(d[obs, ]) -->
<!--   else -->
<!--     return(xnew[obs, ]) -->
<!-- } -->
<!-- ``` -->

<!-- The first function, `ref_predfun`, is the prediction function for the reference model, so we use `posterior_linpred` from `rstanarm`. -->
<!-- For each of the submodels we use `proj_predfun`, which uses `lme4` `predict` function for linear models. -->
<!-- To fit the projections we use `lm`. -->
<!-- Finally, the `fetch_data` function that simply extracts a set of data points from the dataset, that will be used during the cross validation stage. -->
<!-- The `fetch_data` function needs to handle the different data structures needed in different contexts inside `projpred`. -->

Next we construct the reference model structure `refmodel`.

```{r, results='hide', warning=FALSE, messages=FALSE}
refmodel <- get_refmodel(fit)
```

The function `cv_varsel` runs a cross-validated variable selection, and returns an object that contains the relevant information about the procedure, such as the solution path of the terms. The search heuristic can be specified by the keyword `method`.
For a simple linear model (without multilevel structure), the fastest method is to use L1--search. 

```{r, results='hide', messages=FALSE, warnings=FALSE}
vs <- cv_varsel(refmodel, method = "forward")
```

```{r, messages=FALSE, warnings=FALSE}
solution_terms(vs) # selection order of the variables
```

We plot some statistics computed on the training data, such as the sum of log predictive densities (ELPD) and root mean squared error (RMSE) as the function of number of variables added. By default, the statistics are shown on absolute scale, but with ```deltas=T``` the plot shows results relative to the full model.
```{r, fig.width=5, fig.height=4}
plot(vs, stats = c('elpd', 'rmse'))
```

```{r, fig.width=5, fig.height=4}
# plot the validation results, this time relative to the full model
plot(vs, stats = c('elpd', 'rmse'), deltas = TRUE)
```

We perform the projection for a submodel of desired size using the function `project`. The projection can also be coerced to a matrix with draws of the selected variables and sigma. The draws can be visualized with, for example, the `mcmc_areas` function in the `bayesplot` package. Below we compare how the projection affects the three most relevant variables.

```{r, fig.width=6, fig.height=2}
 # Visualise the three most relevant variables in the full model -->
 mcmc_areas(as.matrix(refmodel$fit),
            pars = c("b_Intercept", paste0("b_", solution_terms(vs)[1:3]),
                     "sigma")) +
   coord_cartesian(xlim = c(-2, 2))
```

```{r, fig.width=6, fig.height=2}
# Visualise the projected three most relevant variables
proj <- project(vs, nterms = 3, ns = 500)
mcmc_areas(as.matrix(proj)) +
  coord_cartesian(xlim = c(-2, 2))
```

We make predictions with the projected submodels. For point estimates we can use method `proj_linpred`. Test inputs can be provided using the keyword `newdata`. If also the test targets `ynew` are provided in `newdata`, then the function evaluates the log predictive density at these points. For instance, the following computes the mean of the predictive distribution and evaluates the log density at the training points using the 6 most relevant variables.
```{r}
pred <- proj_linpred(vs, newdata = df_gaussian, nterms = 6, integrated = TRUE)
```

Visualize the predictions
```{r, fig.width=5, fig.height=3}
ggplot() +
  geom_point(aes(x = pred$pred,y = df_gaussian$y)) +
  geom_abline(slope = 1, color = "red") +
  labs(x = "prediction", y = "y")
```

We also obtain draws from the projected predictive distribution. Here's an example prediction for a random data point using 6 variables (the observed value is marked by the red line)
```{r, fig.height=3, fig.width=5}
subset <- df_gaussian %>% dplyr::sample_n(1)
y_subset <- subset %>% dplyr::select(y)
y1_rep <- proj_predict(vs, newdata = subset, nterms = 6, seed = 7560)
qplot(as.vector(y1_rep), bins = 25) +
  geom_vline(xintercept = as.numeric(y_subset), color = "red") +
  xlab("y1_rep")
```

Forward search finds a sparser model with comparable performance to Lasso but starts overfitting when more features are added.




### References

Carvalho, C.M., Polson, N.G., Scott, J.G. (2010). The horseshoe estimator for sparse signals. _Biometrika_ 97(2):465–480. 

Dupuis, J. A. and Robert, C. P. (2003). Variable selection in qualitative models via an entropic explanatory power. _Journal of Statistical Planning and Inference_, 111(1-2):77–94.

Goutis, C. and Robert, C. P. (1998). Model choice in generalised linear models: a Bayesian approach via Kullback–Leibler projections. _Biometrika_, 85(1):29–37.

Piironen, Juho and Vehtari, Aki (2017a). Comparison of Bayesian predictive methods for model selection. _Statistics and Computing_ 27(3):711-735. DOI 10.1007/s11222-016-9649-y. [Online](https://link.springer.com/article/10.1007/s11222-016-9649-y)

Piironen, Juho and Vehtari, Aki (2017b). On the Hyperprior Choice for the Global Shrinkage Parameter in the Horseshoe Prior. In _Proceedings of the 20th International Conference on Artificial Intelligence and Statistics (AISTATS)_, PMLR 54:905-913, 2017. [Online](http://proceedings.mlr.press/v54/piironen17a.html)

Piironen, Juho and Vehtari, Aki (2017c). Sparsity information and regularization in the horseshoe and other shrinkage priors. _Electronic Journal of Statistics_, 11(2): 5018--5051. [Online](https://projecteuclid.org/euclid.ejs/1513306866#info)

Piironen, Juho, Paasiniemi, Markus and Vehtari, Aki (2020). Projective Inference in High-dimensional Problems: Prediction and Feature Selection. [Online](https://projecteuclid.org/euclid.ejs/1589335310)

Tibshirani, Robert (1996). Regression shrinkage and selection via the Lasso. _Journal of the Royal Statistical Society. Series B (Methodological)_, 58(1):267--288, 1996.

Tran, M.N., Nott, D.J., Leng, C. (2012): The predictive Lasso. _Statistics and Computing_ 22(5):1069-1084.