File: Intervals.Rmd

package info (click to toggle)
r-cran-rsample 1.1.1%2Bdfsg-1
  • links: PTS, VCS
  • area: main
  • in suites: bookworm
  • size: 1,872 kB
  • sloc: sh: 13; makefile: 2
file content (301 lines) | stat: -rw-r--r-- 9,581 bytes parent folder | download
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
---
title: "Bootstrap confidence intervals"
vignette: >
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteIndexEntry{Bootstrap confidence intervals}
output:
  knitr:::html_vignette:
    toc: yes
---


```{r setup, include=FALSE}
knitr::opts_chunk$set(
  eval = rlang::is_installed("ggplot2") && rlang::is_installed("modeldata")
)
```

```{r package_setup, include = FALSE}
library(tidymodels)
library(nlstools)
library(GGally)
theme_set(theme_bw())
```


The bootstrap was originally intended for estimating confidence intervals for complex statistics whose variance properties are difficult to analytically derive. Davison and Hinkley's [_Bootstrap Methods and Their Application_](https://www.cambridge.org/core/books/bootstrap-methods-and-their-application/ED2FD043579F27952363566DC09CBD6A) is a great resource for these methods. `rsample` contains a few function to compute the most common types of intervals. 

## A nonlinear regression example

To demonstrate the computations for the different types of intervals, we'll use a nonlinear regression example from [Baty _et al_ (2015)](https://www.jstatsoft.org/article/view/v066i05). They showed data that monitored oxygen uptake in a patient with rest and exercise phases (in the data frame `O2K`). 

```{r O2K-dat}
library(tidymodels)
library(nlstools)
library(GGally)

data(O2K)

ggplot(O2K, aes(x = t, y = VO2)) + 
  geom_point()
```

The authors fit a segmented regression model where the transition point was known (this is the time when exercise commenced). Their model was:

```{r O2K-fit}
nonlin_form <-  
  as.formula(
    VO2 ~ (t <= 5.883) * VO2rest + 
      (t > 5.883) * 
      (VO2rest + (VO2peak - VO2rest) * (1 - exp(-(t - 5.883) / mu)))
    )

# Starting values from visual inspection
start_vals <- list(VO2rest = 400, VO2peak = 1600, mu = 1)

res <- nls(nonlin_form, start = start_vals, data = O2K) 

tidy(res)
```


`broom::tidy()` returns our analysis object in a standardized way. The column names shown here are used for most types of objects and this allows us to use the results more easily. 

For `rsample`, we'll rely on the `tidy()` method to work with bootstrap estimates when we need confidence intervals. There's an example at the end of a univariate statistic that isn't automatically formatted with `tidy()`. 

To run our model over different bootstraps, we'll write a function that uses the `split` object as input and produces a tidy data frame:

```{r model-info}
# Will be used to fit the models to different bootstrap data sets:
fit_fun <- function(split, ...) {
  # We could check for convergence, make new parameters, etc.
  nls(nonlin_form, data = analysis(split), ...) %>%
    tidy()
}
```

First, let's create a set of resamples and fit separate models to each. The options `apparent = TRUE` will be set. This creates a final resample that is a copy of the original (unsampled) data set. This is required for some of the interval methods. 

```{r resample}
set.seed(462)
nlin_bt <-
  bootstraps(O2K, times = 2000, apparent = TRUE) %>%
  mutate(models = map(splits, ~ fit_fun(.x, start = start_vals)))
nlin_bt

nlin_bt$models[[1]]
```

Let's look at the data and see if there any outliers or aberrant results:

```{r extract}
library(tidyr)
nls_coef <- 
  nlin_bt %>%
  dplyr::select(-splits) %>%
  # Turn it into a tibble by stacking the `models` col
  unnest() %>%
  # Get rid of unneeded columns
  dplyr::select(id, term, estimate) 

head(nls_coef)
```
Now let's create a scatterplot matrix:

```{r splom} 
nls_coef %>%
  # Put different parameters in columns
  tidyr::spread(term, estimate) %>% 
  # Keep only numeric columns
  dplyr::select(-id) %>% 
  ggscatmat(alpha = .25)
```

One potential outlier on the right for `VO2peak` but we'll leave it in. 

The univariate distributions are:

```{r hists}
nls_coef %>% 
  ggplot(aes(x = estimate)) + 
  geom_histogram(bins = 20, col = "white") + 
  facet_wrap(~ term, scales = "free_x")
```

### Percentile intervals

The most basic type of interval uses _percentiles_ of the resampling distribution. To get the percentile intervals, the `rset` object is passed as the first argument and the second argument is the list column of tidy results: 

```{r pctl}
p_ints <- int_pctl(nlin_bt, models)
p_ints
```

When overlaid with the univariate distributions: 

```{r pctl-plot}
nls_coef %>% 
  ggplot(aes(x = estimate)) + 
  geom_histogram(bins = 20, col = "white") + 
  facet_wrap(~ term, scales = "free_x") + 
  geom_vline(data = p_ints, aes(xintercept = .lower), col = "red") + 
  geom_vline(data = p_ints, aes(xintercept = .upper), col = "red")
```

How do these intervals compare to the parametric asymptotic values?

```{r int-compare}
parametric <- 
  tidy(res, conf.int = TRUE) %>% 
  dplyr::select(
    term,
    .lower = conf.low,
    .estimate = estimate,
    .upper = conf.high
  ) %>% 
  mutate(
    .alpha = 0.05,
    .method = "parametric"
  )

intervals <- 
  bind_rows(parametric, p_ints) %>% 
  arrange(term, .method)
intervals %>% split(intervals$term)
```

The percentile intervals are wider than the parametric intervals (which assume asymptotic normality). Do the estimates appear to be normally distributed? We can look at quantile-quantile plots:  

```{r qqplot}
nls_coef %>% 
  ggplot(aes(sample = estimate)) + 
  stat_qq() +
  stat_qq_line(alpha = .25) + 
  facet_wrap(~ term, scales = "free") 
```

### t-intervals

Bootstrap _t_-intervals are estimated by computing intermediate statistics that are _t_-like in structure. To use these, we require the estimated variance _for each individual resampled estimate_. In our example, this comes along with the fitted model object. We can extract the standard errors of the parameters. Luckily, most `tidy()` provide this in a column named `std.error`. 

The arguments for these intervals are the same:

```{r t-ints}
t_stats <- int_t(nlin_bt, models)
intervals <- 
  bind_rows(intervals, t_stats) %>% 
  arrange(term, .method)
intervals %>% split(intervals$term)
```


### Bias-corrected and accelerated intervals

For bias-corrected and accelerated (BCa) intervals, an additional argument is required. The `.fn` argument is a function that computes the statistic of interest. The first argument should be for the `rsplit` object and other arguments can be passed in using the ellipses. 

These intervals use an internal leave-one-out resample to compute the Jackknife statistic and will recompute the statistic for _every bootstrap resample_. If the statistic is expensive to compute, this may take some time. For those calculations, we use the `furrr` package so these can be computed in parallel if you have set up a parallel processing plan (see `?future::plan`). 

The user-facing function takes an argument for the function and the ellipses. 

```{r bca-comp}
bias_corr <- int_bca(nlin_bt, models, .fn = fit_fun, start = start_vals)
intervals <- 
  bind_rows(intervals, bias_corr) %>% 
  arrange(term, .method)
intervals %>% split(intervals$term)
```


## No existing tidy method

In this case, your function can emulate the minimum results: 

 * a character column called `term`, 
 * a numeric column called `estimate`, and, optionally, 
 * a numeric column called `std.error`. 

The last column is only needed for `int_t`. 

Suppose we just want to estimate the fold-increase in the outcome between the 90th and 10th percentiles over the course of the experiment. Our function might look like:

```{r fold-foo}
fold_incr <- function(split, ...) {
  dat <- analysis(split)
  quants <- quantile(dat$VO2, probs = c(.1, .9))
  tibble(
    term = "fold increase",
    estimate = unname(quants[2]/quants[1]),
    # We don't know the analytical formula for this 
    std.error = NA_real_
  )
}
```

Everything else works the same as before:

```{r fold-ci}
nlin_bt <-
  nlin_bt %>%
  mutate(folds = map(splits, fold_incr))

int_pctl(nlin_bt, folds)
int_bca(nlin_bt, folds, .fn = fold_incr)
```

## Intervals for linear(ish) parametric intervals

`rsample` also contains the `reg_intervals()` function that can be used for linear regression (via `lm()`), generalized linear models (`glm()`), or log-linear survival models (`survival::survreg()` or `survival::coxph()`). This function makes it easier to get intervals for these models. 

A simple example is a logistic regression using the dementia data from the `modeldata` package: 

```{r ad-data}
data(ad_data, package = "modeldata")
```

Let's fit a model with a few predictors: 

```{r ad-model}
lr_mod <- glm(Class ~ male + age + Ab_42 + tau, data = ad_data,
              family = binomial)
glance(lr_mod)
tidy(lr_mod)
```

Let's use this model with student-t intervals: 

```{r ad-t-int}
set.seed(29832)
lr_int <- 
  reg_intervals(Class ~ male + age + Ab_42 + tau, 
                data = ad_data,
                model_fn = "glm",
                family = binomial)
lr_int
```

We can also save the resamples for plotting: 

```{r ad-t-int-plot}
set.seed(29832)
lr_int <- 
  reg_intervals(Class ~ male + age + Ab_42 + tau, 
                data = ad_data,
                keep_reps = TRUE,
                model_fn = "glm",
                family = binomial)
lr_int
```

Now we can unnest the data to use in a ggplot: 

```{r ad-plot}
lr_int %>% 
  select(term, .replicates) %>% 
  unnest(cols = .replicates) %>% 
  ggplot(aes(x = estimate)) + 
  geom_histogram(bins = 30) + 
  facet_wrap(~ term, scales = "free_x") + 
  geom_vline(data = lr_int, aes(xintercept = .lower), col = "red") + 
  geom_vline(data = lr_int, aes(xintercept = .upper), col = "red") + 
  geom_vline(xintercept = 0, col = "green")
```