1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295
|
---
title: "Examples of using survtab"
author: "Joonas Miettinen"
date: "`r Sys.Date()`"
output:
html_document:
toc: true
toc_depth: 2
fig_width: 6
fig_height: 6
vignette: >
%\VignetteEngine{knitr::rmarkdown}
%\VignetteIndexEntry{survtab examples}
%\usepackage[utf8]{inputenc}
---
# Overview
This vignette aims to clarify the usage of the `survtab_ag` and `survtab` functions included in this package. `survtab_ag` estimates various survival functions and cumulative incidence functions (CIFs) non-parametrically using aggregated data, and `survtab` is a wrapper for `survtab_ag`, to which `Lexis` data is supplied.
Two methods (`surv.method`) are currently supported: The `"lifetable"` (actuarial) method only makes use of counts when estimating any of the supported survival time functions. The default method (`"hazard"`}) estimates appropriate hazards and transforms them into survival function or CIF estimates.
For relative survival estimation we need also to enumerate the expected hazard levels for the subjects in the data. This is done by merging expected hazards to individuals' subintervals (which divide their survival time lines to a number of small intervals). For Pohar-Perme-weighted analyses one must additionally compute various weighted figures at the level of split subject data.
If one has subject-level data, the simplest way of computing survival function estimates with `popEpi` is by defining a `Lexis` object and using `survtab`, which will do the rest. For pre-aggregated data one may use the `survtab_ag` function instead. One can also use the `lexpand` function to split, merge population hazards, and aggregate in a single function call and then use `survtab_ag` if that is convenient.
# Using `survtab`
It is straightforward to estimate various survival time functions with `survtab` once a `Lexis` object has been defined (see `?Lexis` in package `Epi` for details):
```{r pkgs, eval = TRUE, echo = TRUE, message = FALSE}
library(popEpi)
library(Epi)
```
```{r}
data(sire)
## NOTE: recommended to use factor status variable
x <- Lexis(entry = list(FUT = 0, AGE = dg_age, CAL = get.yrs(dg_date)),
exit = list(CAL = get.yrs(ex_date)),
data = sire[sire$dg_date < sire$ex_date, ],
exit.status = factor(status, levels = 0:2,
labels = c("alive", "canD", "othD")),
merge = TRUE)
## pretend some are male
set.seed(1L)
x$sex <- rbinom(nrow(x), 1, 0.5)
## observed survival - explicit method
st <- survtab(Surv(time = FUT, event = lex.Xst) ~ sex, data = x,
surv.type = "surv.obs",
breaks = list(FUT = seq(0, 5, 1/12)))
## observed survival - easy method (assumes lex.Xst in x is the status variable)
st <- survtab(FUT ~ sex, data = x,
surv.type = "surv.obs",
breaks = list(FUT = seq(0, 5, 1/12)))
## printing gives the used settings and
## estimates at the middle and end of the estimated
## curves; more information available using summary()
st
```
Plotting by strata (men = blue, women = red):
```{r}
plot(st, col = c("blue", "red"))
```
Note that the correct usage of the `formula` argument in `survtab` specifies the time scale in the `Lexis` object over which survival is computed (here `"FUT"` for follow-up time). This is used to identify the appropriate time scale in the data. When only supplying the survival time scale as the right-hand-side of the formula, the column `lex.Xst` in the supplied `Lexis` object is assumed to be the (correctly formatted!) status variable. When using `Surv()` to be explicit, we effectively (and exceptionally) pass the starting times to the `time` argument in `Surv()`, and `time2` is ignored entirely. The function will fail if `time` does not match exactly with a time scale in data.
When using `Surv()`, one must also pass the status variable, which can be something other than the `lex.Xst` variable created by `Lexis()`, though usually ``lex.Xst` is what you want to use (especially if the data has already been split using e.g. `splitLexis` or `splitMulti`, which is allowed). It is recommended to use a factor status variable to pass to `Surv()`, though a numeric variable will work in simple cases (0 = alive, 1 = dead; also `FALSE` = alive, `TRUE` = dead). Using `Surv()` also allows easy passing of transformations of `lex.Xst`, e.g. `Surv(FUT, lex.Xst %in% 1:2)`.
The argument `breaks` must be a named list of breaks by which to split the `Lexis` data (see `?splitMulti`). It is mandatory to assign breaks at least to the survival time scale (`"FUT"` in our example) so that `survtab` knows what intervals to use to estimate the requested survival time function(s). The breaks also determine the window used: It is therefore easy to compute so called period estimates by defining the roof and floor along the calendar time scale, e.g.
`breaks = list(FUT = seq(0, 5, 1/12), CAL = c(2000, 2005))`
would cause `survtab` to compute period estimates for 2000-2004 (breaks given here as fractional years, so 2005 is effectively 2004.99999...).
## Relative/net survival
Relative/net survival estimation requires knowledge of the expected hazard levels for the individuals in the data. In `survtab` this is accomplished by passing a long-format `data.frame` of population hazards via the `pophaz` argument. E.g. the `popmort` dataset included in `popEpi` (Finnish overall mortality rates for men and women).
```{r popmort}
data(popmort)
pm <- data.frame(popmort)
names(pm) <- c("sex", "CAL", "AGE", "haz")
head(pm)
```
The `data.frame` should contain a variable named `"haz"` indicating the population hazard at the level of one subject-year. Any other variables are considered to be variables, by which to merge population hazards to the (split) subject-level data within `survtab`. These merging variables may correspond to the time scales in the used `Lexis` object. This allows for e.g. merging in different population hazards for the same subject as they get older.
The following causes `survtab` to estimate EdererII relative survival:
```{r survtab_e2}
st.e2 <- survtab(Surv(time = FUT, event = lex.Xst) ~ sex, data = x,
surv.type = "surv.rel", relsurv.method = "e2",
breaks = list(FUT = seq(0, 5, 1/12)),
pophaz = pm)
```
```{r}
plot(st.e2, y = "r.e2", col = c("blue", "red"))
```
Note that the curves diverge due to merging in the "wrong" population hazards for some individuals which we randomized earlier to be male though all the individuals in data are actually female. Pohar-Perme-weighted estimates can be computed by
```{r survtab_pp}
st.pp <- survtab(Surv(time = FUT, event = lex.Xst) ~ sex, data = x,
surv.type = "surv.rel", relsurv.method = "pp",
breaks = list(FUT = seq(0, 5, 1/12)),
pophaz = pm)
```
Compare with EdererII estimates:
```{r}
plot(st.e2, y = "r.e2", col = c("blue", "red"), lty = 1)
lines(st.pp, y = "r.pp", col = c("blue", "red"), lty = 2)
```
## Adjusting estimates
`survtab` also allows for adjusting the survival curves by categorical variables --- typically by age groups. The following demonstrates how:
```{r survtab_adjust}
## an age group variable
x$agegr <- cut(x$dg_age, c(0, 60, 70, 80, Inf), right = FALSE)
## using "internal weights" - see ?ICSS for international weights standards
w <- table(x$agegr)
w
w <- list(agegr = as.numeric(w))
```
```{r survtab_adjust_2}
st.as <- survtab(Surv(time = FUT, event = lex.Xst) ~ sex + adjust(agegr),
data = x, weights = w,
surv.type = "surv.rel", relsurv.method = "e2",
breaks = list(FUT = seq(0, 5, 1/12)),
pophaz = pm)
```
```{r}
plot(st.as, y = "r.e2.as", col = c("blue", "red"))
```
We now have age-adjusted EdererII relative/net survival estimates. The `weights` argument allows for either a list of weights (with one or multiple variables to adjust by) or a `data.frame` of weights. Examples:
```{r weights_examples, eval = TRUE}
list(sex = c(0.4, 0.6), agegr = c(0.2, 0.2, 0.4, 0.2))
wdf <- merge(0:1, 1:4)
names(wdf) <- c("sex", "agegr")
wdf$weights <- c(0.1, 0.1, 0.1, 0.1, 0.2, 0.2, 0.1, 0.1)
wdf
```
The weights do not have to sum to one when supplied as they are internally forced to do so within each stratum. In the `data.frame` of weights, the column of actual weights to use must be named "weights". When there are more than one variable to adjust by, and a list of weights has been supplied, the variable-specific weights are first multiplied together (cumulatively) and then scaled to sum to one.
This adjusting can be done to any survival time function that `survtab` (and `survtab_ag`) estimates. One can also supply adjusting variables via the `adjust` argument if convenient:
```{r survtab_adjust_3}
st.as <- survtab(Surv(time = FUT, event = lex.Xst) ~ sex,
adjust = "agegr",
data = x, weights = w,
surv.type = "surv.rel", relsurv.method = "e2",
breaks = list(FUT = seq(0, 5, 1/12)),
pophaz = pm)
```
Where `adjust` could also be `adjust = agegr`, `adjust = list(agegr)` or
`adjust = list(agegr = cut(dg_age, c(0, 60, 70, 80, Inf), right = FALSE))`
for exactly the same results. When adjusting by multiple variables, one must supply a vector of variable names in data or a list of multiple elements (as in the base function `aggregate`).
## Other survival time functions
One can also estimate cause-specific survival functions, cumulative incidence functions (CIFs, a.k.a. crude risk a.k.a. absolute risk functions), and CIFs based on the excess numbers of events. Cause-specific survival is close to net survival as they are philosophically highly similar concepts:
```{r survtab_cause}
st.ca <- survtab(Surv(time = FUT, event = lex.Xst) ~ 1,
data = x,
surv.type = "surv.cause",
breaks = list(FUT = seq(0, 5, 1/12)))
st.pp <- survtab(Surv(time = FUT, event = lex.Xst) ~ 1, data = x,
surv.type = "surv.rel", relsurv.method = "pp",
breaks = list(FUT = seq(0, 5, 1/12)),
pophaz = pm)
plot(st.ca, y = "surv.obs.canD", col = "blue")
lines(st.pp, y = "r.pp", col = "red")
```
Absolute risk:
```{r survtab_cif}
st.cif <- survtab(Surv(time = FUT, event = lex.Xst) ~ 1,
data = x,
surv.type = "cif.obs",
breaks = list(FUT = seq(0, 5, 1/12)))
plot(st.cif, y = "CIF_canD", conf.int = FALSE)
lines(st.cif, y = "CIF_othD", conf.int = FALSE, col = "red")
```
The "relative CIF" attempts to be close to the true CIF without using knowledge about the types of events, e.g. causes of death:
```{r survtab_relcif}
st.cir <- survtab(Surv(time = FUT, event = lex.Xst) ~ 1,
data = x,
surv.type = "cif.rel",
breaks = list(FUT = seq(0, 5, 1/12)),
pophaz = pm)
plot(st.cif, y = "CIF_canD", conf.int = FALSE, col = "blue")
lines(st.cir, y = "CIF.rel", conf.int = FALSE, col = "red")
```
# Using `survtab_ag`
Arguments concerning the types and methods of estimating of survival time functions work the same in `survtab_ag` as in `survtab` (the latter uses the former). However, with aggregated data one must explicitly supply the various count and person-time variables. Also, usage of the `formula` argument is different.
For demonstration purposes we form an aggregated data set using `lexpand`; see `?lexpand` for more information on that function.
```{r}
sire$sex <- rbinom(nrow(sire), size = 1, prob = 0.5)
ag <- lexpand(sire, birth = "bi_date", entry = "dg_date", exit = "ex_date",
status = "status", breaks = list(fot = seq(0, 5, 1/12)),
aggre = list(sex, fot))
head(ag)
```
Now simply do:
```{r survtab_ag_example1}
st <- survtab_ag(fot ~ sex, data = ag, surv.type = "surv.obs",
surv.method = "hazard",
d = c("from0to1", "from0to2"), pyrs = "pyrs")
```
Or:
```{r survtab_ag_example2}
st <- survtab_ag(fot ~ sex, data = ag, surv.type = "surv.obs",
surv.method = "lifetable",
d = c("from0to1", "from0to2"), n = "at.risk",
n.cens = "from0to0")
```
Note that e.g. argument `d` could also have been supplied as
`list(from0to1, from0to2)`
or
`list(canD = from0to1, othD = from0to2)`
for identical results. The last is convenient for e.g. `surv.cause` computations:
```{r survtab_ag_cause}
st.ca <- survtab_ag(fot ~ sex, data = ag, surv.type = "surv.cause",
surv.method = "hazard",
d = list(canD = from0to1, othD = from0to2), pyrs = "pyrs")
plot(st.ca, y = "surv.obs.canD", col = c("blue", "red"))
```
One has to supply the most variables when computing Pohar-Perme estimates (though it is probably rare to have third-source aggregated data with Pohar-Perme weighted figures, it is implemented here to be used as a workhorse for `survtab`). For this we must aggregate again to get the Pohar-Perme weighted counts and subject-times:
```{r}
ag <- lexpand(sire, birth = "bi_date", entry = "dg_date", exit = "ex_date",
status = "status", breaks = list(fot = seq(0, 5, 1/12)),
pophaz = popmort, pp = TRUE,
aggre = list(sex, fot))
st.pp <- survtab_ag(fot ~ sex, data = ag, surv.type = "surv.rel",
surv.method = "hazard", relsurv.method = "pp",
d = list(from0to1 + from0to2), pyrs = "pyrs",
d.pp = list(from0to1.pp + from0to2.pp),
d.pp.2 = list(from0to1.pp.2 + from0to2.pp.2),
pyrs.pp = "ptime.pp", d.exp.pp = "d.exp.pp")
plot(st.pp, y = "r.pp", col = c("blue", "red"))
```
Here it is best to supply only one column to each argument since Pohar-Perme estimates will not be computed for several types of events at the same time.
|