File: donorImp.Rmd

package info (click to toggle)
r-cran-vim 6.2.2%2Bdfsg-1
links: PTS, VCS
area: main
in suites: bookworm, forky, sid, trixie
size: 1,556 kB
sloc: cpp: 141; sh: 12; makefile: 2
file content (150 lines) | stat: -rw-r--r-- 5,752 bytes
parent folder | download | duplicates (4)
---
title: "Donor based Imputation Methods"
author: Wolfgang Rannetbauer
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Donor based Imputation Methods}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.width = 6,
  fig.align = "center"
)
```

## Overview

In addition to Model based Imputation Methods (see `vignette("modelImp")`) the `VIM` package also presents donor based imputation methods, namely Hot-Deck Imputation, k-Nearest Neighbour Imputation and fast matching/imputation based on categorical variable.
  
This vignette showcases the functions `hotdeck()` and `kNN()`, which can both be used to generate imputations for several variables in a
dataset. Moreover, the function `matchImpute()` is presented, which is in contrast a imputation method based on categorical variables.

## Data

The following example demonstrates the functionality of `hodeck()` and `kNN()` using a subset of `sleep`. The columns have been selected deliberately to include some interactions between the missing values. 

```{r setup, message=F}
library(VIM)
library(magrittr)
dataset <- sleep[, c("Dream", "NonD", "BodyWgt", "Span")]
dataset$BodyWgt <- log(dataset$BodyWgt)
dataset$Span <- log(dataset$Span)
aggr(dataset)
```

The plot indicates several missing values in `Dream`, `NonD`, and `Span. `

```{r}
sapply(dataset, function(x)sum(is.na(x)))
```

## Imputation

The call of the functions is straightforward. We will start by just imputing `NonD` based on the other variables. Besides imputing missing variables for a single variable, these functions also support imputation of multiple variables.
For `matchImpute()` suitable donors are searched based on matching of the categorical variables. 


```{r}
imp_hotdeck <- hotdeck(dataset, variable = "NonD")  # hotdeck imputation
imp_knn <- kNN(dataset, variable = "NonD") # kNN imputation
imp_match <- matchImpute(dataset, variable = "NonD", match_var = c("BodyWgt","Span")) # match imputation
aggr(imp_knn, delimiter = "_imp")
aggr(imp_match, delimiter = "_imp")
```

We can see that `kNN()` imputed all missing values for `NonD` in our dataset. The same is true for the values imputed via `hotdeck()`.
The specified variables in `matchImpute()` serve as a donor and enable imputation for `NonD`.

### Diagnosing the results

As we can see in the next two plots, the origninal data structure of `NonD` and `Span` is preserved by `hotdeck()`. `kNN()` reveals the typically procedure of methods, which are based on similar data points weighted by the distance. 

```{r, fig.height=5}
imp_hotdeck[, c("NonD", "Span", "NonD_imp")] %>% 
  marginplot(delimiter = "_imp")
imp_knn[, c("NonD", "Span", "NonD_imp")] %>% 
  marginplot(delimiter = "_imp")
```

`matchImpute()` works by sampling values from the suitable donors and also provides reasonable results.
```{r, fig.height=5}
imp_match[, c("NonD", "Span", "NonD_imp")] %>% 
  marginplot(delimiter = "_imp")
```


## Performance of method

In order to validate the performance of `kNN()` and to highlight the ability to impute different datatypes the `iris` dataset is used. Firstly, some values are randomly set to `NA`. 

```{r}
data(iris)
df <- iris
colnames(df) <- c("S.Length","S.Width","P.Length","P.Width","Species")
# randomly produce some missing values in the data
set.seed(1)
nbr_missing <- 50
y <- data.frame(row = sample(nrow(iris), size = nbr_missing, replace = TRUE),
                col = sample(ncol(iris), size = nbr_missing, replace = TRUE))
y<-y[!duplicated(y), ]
df[as.matrix(y)] <- NA

aggr(df)
sapply(df, function(x) sum(is.na(x)))
```

We can see that there are missings in all variables and some observations reveal missing values on several points. 

```{r}
imp_knn <- kNN(df)
aggr(imp_knn, delimiter = "imp")
```

The plot indicates that all missing values have been imputed by `kNN()`. The following table displays the rounded first five results of the imputation for all variables.  

```{r echo=F,warning=F}
library(reactable)

results <- cbind("TRUE1" = as.numeric(iris[as.matrix(y[which(y$col==1),])]),
                 "IMPUTED1" = round(as.numeric(imp_knn[as.matrix(y[which(y$col==1),])]),2),
                 "TRUE2" = as.numeric(iris[as.matrix(y[which(y$col==2),])]),
                 "IMPUTED2" = round(as.numeric(imp_knn[as.matrix(y[which(y$col==2),])]),2),
                 "TRUE3" = as.numeric(iris[as.matrix(y[which(y$col==3),])]),
                 "IMPUTED3" = round(as.numeric(imp_knn[as.matrix(y[which(y$col==3),])]),2),
                 "TRUE4" = as.numeric(iris[as.matrix(y[which(y$col==4),])]),
                 "IMPUTED4" = round(as.numeric(imp_knn[as.matrix(y[which(y$col==4),])]),2),
                 "TRUE5" = (iris[as.matrix(y[which(y$col==5),])]),
                 "IMPUTED5" = (imp_knn[as.matrix(y[which(y$col==5),])]))[1:5,]

reactable(results, columns = list(
    TRUE1 = colDef(name = "True"),
    IMPUTED1 = colDef(name = "Imputed"),
    TRUE2 = colDef(name = "True"),
    IMPUTED2 = colDef(name = "Imputed"),
    TRUE3 = colDef(name = "True"),
    IMPUTED3 = colDef(name = "Imputed"),
    TRUE4 = colDef(name = "True"),
    IMPUTED4 = colDef(name = "Imputed"),
    TRUE5 = colDef(name = "True"),
    IMPUTED5 = colDef(name = "Imputed")
  ),
    columnGroups = list(
    colGroup(name = "S.Length", columns = c("TRUE1", "IMPUTED1")),
    colGroup(name = "S.Width", columns = c("TRUE2", "IMPUTED2")),
    colGroup(name = "P.Length", columns = c("TRUE3", "IMPUTED3")),
    colGroup(name = "P.Width", columns = c("TRUE4", "IMPUTED4")),
    colGroup(name = "Species", columns = c("TRUE5", "IMPUTED5"))
  ),
  striped = TRUE,
  highlight = TRUE,
  bordered = TRUE
)

```