1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163
|
---
title: "Model based Imputation Methods"
author: Gregor de Cillia
output: rmarkdown::html_vignette
vignette: >
%\VignetteIndexEntry{Model based Imputation Methods}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.width = 6,
fig.align = "center"
)
```
This vignette showcases the functions `regressionImp()` and `rangerImpute()`,
which can both be used to generate imputations for several variables in a
dataset using a formula interface.
## Data
For data, a subset of `sleep` is used. The columns have been selected
deliberately to include some interactions between the missing values.
```{r setup, message = FALSE}
library(VIM)
library(magrittr)
dataset <- sleep[, c("Dream", "NonD", "BodyWgt", "Span")]
dataset$BodyWgt <- log(dataset$BodyWgt)
dataset$Span <- log(dataset$Span)
aggr(dataset)
str(dataset)
```
## Imputation
In order to invoke the imputation methods, a formula is used to specify which
variables are to be estimated and which variables should be used as regressors.
We will start by imputing `NonD` based in `BodyWgt` and `Span`.
```{r}
imp_regression <- regressionImp(NonD ~ BodyWgt + Span, dataset)
imp_ranger <- rangerImpute(NonD ~ BodyWgt + Span, dataset)
aggr(imp_regression, delimiter = "_imp")
```
We can see that for `regrssionImp()` there are still missings in `NonD` for all observations where
`Span` is unobserved. This is because the regression model could not be applied
to those observations. The same is true for the values imputed via
`rangerImpute()`.
## Diagnosing the results
As we can see in the next two plots, the correlation structure of `NonD` and
`BodyWgt` is preserved by both imputation methods. In the case of
`regressionImp()` all imputed values almost follow a straight line. This
suggests that the variable `Span` had little to no effect on the model.
```{r, fig.height=5}
imp_regression[, c("NonD", "BodyWgt", "NonD_imp")] %>%
marginplot(delimiter = "_imp")
```
For `rangerImpute()` on the other hand, `Span` played an important role in the
generation of the imputed values.
```{r, fig.height=5}
imp_ranger[, c("NonD", "BodyWgt", "NonD_imp")] %>%
marginplot(delimiter = "_imp")
imp_ranger[, c("NonD", "Span", "NonD_imp")] %>%
marginplot(delimiter = "_imp")
```
## Imputing multiple variables
To impute several variables at once, the formula in `rangerImpute()` and
`regressionImp()` can be specified with more than one column name in the
left hand side.
```{r}
imp_regression <- regressionImp(Dream + NonD ~ BodyWgt + Span, dataset)
imp_ranger <- rangerImpute(Dream + NonD ~ BodyWgt + Span, dataset)
aggr(imp_regression, delimiter = "_imp")
```
Again, there are missings left for both `Dream` and `NonD`.
## Performance of method
In order to validate the performance of `regressionImp()` the `iris` dataset is used. Firstly, some values are randomly set to `NA`.
```{r}
library(reactable)
data(iris)
df <- iris
colnames(df) <- c("S.Length","S.Width","P.Length","P.Width","Species")
# randomly produce some missing values in the data
set.seed(1)
nbr_missing <- 50
y <- data.frame(row=sample(nrow(iris),size = nbr_missing,replace = T),
col=sample(ncol(iris)-1,size = nbr_missing,replace = T))
y<-y[!duplicated(y),]
df[as.matrix(y)]<-NA
aggr(df)
sapply(df, function(x)sum(is.na(x)))
```
We can see that there are missings in all variables and some observations reveal missing values on several points. In the next step we perform a multiple variable imputation and `Species` serves as a regressor.
```{r}
imp_regression <- regressionImp(S.Length + S.Width + P.Length + P.Width ~ Species, df)
aggr(imp_regression, delimiter = "imp")
```
The plot indicates that all missing values have been imputed by the `regressionImp()` algorithm. The following table displays the rounded first five results of the imputation for all variables.
```{r echo=F,warning=F}
results <- cbind("TRUE1" = as.numeric(iris[as.matrix(y[which(y$col==1),])]),
"IMPUTED1" = round(as.numeric(imp_regression[as.matrix(y[which(y$col==1),])]),2),
"TRUE2" = as.numeric(iris[as.matrix(y[which(y$col==2),])]),
"IMPUTED2" = round(as.numeric(imp_regression[as.matrix(y[which(y$col==2),])]),2),
"TRUE3" = as.numeric(iris[as.matrix(y[which(y$col==3),])]),
"IMPUTED3" = round(as.numeric(imp_regression[as.matrix(y[which(y$col==3),])]),2),
"TRUE4" = as.numeric(iris[as.matrix(y[which(y$col==4),])]),
"IMPUTED4" = round(as.numeric(imp_regression[as.matrix(y[which(y$col==4),])]),2))[1:5,]
reactable(results, columns = list(
TRUE1 = colDef(name = "True"),
IMPUTED1 = colDef(name = "Imputed"),
TRUE2 = colDef(name = "True"),
IMPUTED2 = colDef(name = "Imputed"),
TRUE3 = colDef(name = "True"),
IMPUTED3 = colDef(name = "Imputed"),
TRUE4 = colDef(name = "True"),
IMPUTED4 = colDef(name = "Imputed")
),
columnGroups = list(
colGroup(name = "S.Length", columns = c("TRUE1", "IMPUTED1")),
colGroup(name = "S.Width", columns = c("TRUE2", "IMPUTED2")),
colGroup(name = "P.Length", columns = c("TRUE3", "IMPUTED3")),
colGroup(name = "P.Width", columns = c("TRUE4", "IMPUTED4"))
),
striped = TRUE,
highlight = TRUE,
bordered = TRUE
)
```
|