File: imputation.md

package info (click to toggle)
mlpack 4.7.0-2
  • links: PTS, VCS
  • area: main
  • in suites: sid
  • size: 32,064 kB
  • sloc: cpp: 233,202; python: 1,940; sh: 1,201; lisp: 414; makefile: 85
file content (270 lines) | stat: -rw-r--r-- 9,838 bytes parent folder | download | duplicates (2)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
# Imputation

mlpack provides functionality for replacing missing values in a dataset either
with imputed values, user-specified values, or removing points from a dataset
that have missing values.

Imputing or removing missing values is an important part of the data science
pipeline, as mlpack's machine learning techniques do not support learning on
data that contain missing values.

 * [`Imputer`](#imputer): the class that is used for imputation.

 * [Imputation strategies](#imputation-strategies): the set of imputation
   strategies supported by `Imputer`.
   - [`MeanImputation`](#meanimputation): replace missing values with the mean
     value in a row or column.
   - [`MedianImputation`](#medianimputation): replace missing values with the
     median value in a row or column.
   - [`ListwiseDeletion`](#listwisedeletion): remove any data points that
     contain missing values.
   - [`CustomImputation<>`](#customimputation): replace missing values with a
     user-specified custom value.
   - [Custom imputation strategies](#custom-imputation-strategies): use a fully
     custom strategy to impute missing values

 * [Simple examples](#simple-examples) of imputing missing values with
   `Imputer`.

## `Imputer`

The `Imputer` class offers a simple interface to impute missing values into a
single dimension of a data matrix.

### Constructor

 * `imp = Imputer()`
   - Construct an imputer using the default imputation strategy
     ([`MeanImputation`](#meanimputation)).

 * `imp = Imputer<Strategy>()`
 * `imp = Imputer<Strategy>(strategy)`
   - Construct an imputer, specifying the imputation strategy manually.
   - `Strategy` should be [`MeanImputation`](#meanimputation),
     [`MedianImputation`](#medianimputation),
     [`ListwiseDeletion`](#listwisedeletion),
     [`CustomImputation<>`](#customimputation), or a
     [custom imputation class](#custom-imputation-strategies).
   - Optionally, specify an instantiated imputation strategy (`strategy`); this
     is only useful with [`CustomImputation<>`](#customimputation) or a
     [custom imputation class](#custom-imputation-strategies).

### `Impute()`

Once an `Imputer` object is constructed, the `Impute()` function can be used to
replace missing values.

 * `imp.Impute(data, missingValue, dimension)`
 * `imp.Impute(data, missingValue, dimension, columnMajor=true)`
   - Replace instances of `missingValue` in row `dimension` (a `size_t`) of the
     given matrix `data`.
   - `data` should be a
     [column-major data matrix](../matrices.md#representing-data-in-mlpack);
     if `columnMajor` is set to `false`, then missing values in *column*
     `dimension` (instead of row `dimension`) will be replaced.
   - `missingValue` should be the same type as the element type of `data` (e.g.
     if `data` is `arma::mat`, then `missingValue` should be `double`).

***Notes***:

 * Any missing value can be chosen for `missingValue`, but using
   [NaN](https://cplusplus.com/reference/cmath/nan-function/) is a good choice.

 * If imputation in the entire data matrix is desired, using
   [`data.replace()`](https://arma.sourceforge.net/docs.html#replace) is likely
   an easier and faster approach.  `Imputer` does not support this because most
   imputation strategies depend on data specific to a single dimension, not the
   entire matrix.

<!-- TODO: when MissingToNan() support is fully documented, we can update the
first bullet point to mention that it works well in a pipeline where you load
with MissingIsNan() to get NaNs, then impute -->

## Imputation strategies

mlpack provides four imputation strategies that can be used with `Imputer`.  It
is also possible to write a
[fully custom imputation strategy](#custom-imputation-strategies).

### `MeanImputation`

 * `MeanImputation` computes the mean of non-missing elements in a dimension and
   uses this value to replace missing elements.

 * The constructor of `MeanImputation` has no parameters, and so passing an
   instantiated `strategy` to the [constructor of `Imputer`](#constructor) is
   not necessary.

 * `MeanImputation` does not support imputation on sparse matrix (e.g.
   `arma::sp_mat`); use dense matrices (e.g. `arma::mat`) instead.

 * For more details, see
   [the source code](/src/mlpack/core/data/imputation_methods/mean_imputation.hpp).

### `MedianImputation`

 * `MedianImputation` computes the median of non-missing elements in a dimension
   and uses this value to replace missing elements.

 * The constructor of `MedianImputation` has no parameters, and so passing an
   instantiated `strategy` to the [constructor of `Imputer`](#constructor) is
   not necessary.

 * `MedianImputation` does not support imputation on sparse matrix (e.g.
   `arma::sp_mat`); use dense matrices (e.g. `arma::mat`) instead.

 * For more details, see
   [the source code](/src/mlpack/core/data/imputation_methods/median_imputation.hpp).

### `ListwiseDeletion`

 * `ListwiseDeletion` removes any data points that contain missing elements in
   the specified dimension to [`Impute()`](#impute).

 * The constructor of `ListwiseDeletion` has no parameters, and so passing an
   instantiated `strategy` to the [constructor of `Imputer`](#constructor) is
   not necessary.

 * `ListwiseDeletion` does not support imputation on sparse matrix (e.g.
   `arma::sp_mat`); use dense matrices (e.g. `arma::mat`) instead.

 * For more details, see
   [the source code](/src/mlpack/core/data/imputation_methods/listwise_deletion.hpp).

### `CustomImputation<>`

 * `CustomImputation<>` replaces any missing elements in a dimension with a
   specified value.

 * `c = CustomImputation<>(value)` creates a `CustomImputation` object for use
   with the [constructor of `Imputer`](#constructor), which will replace missing
   values with `value` (specified as a `double`) when [`Impute()`](#impute) is
   called.

 * `value` will be casted to the element type of the data matrix when
   [`Impute()`](#impute) is called.

 * If a different element type is used for the data matrix,
   `c = CustomImputation<T>(value)` can be used to specify a `value` of the
   desired type `T` (e.g. `float`, `int`, etc.).  In this situation, the class
   `Imputer<CustomImputation<T>>` will need to be used.

 * For more details, see
   [the source code](/src/mlpack/core/data/imputation_methods/custom_imputation.hpp).

### Custom imputation strategies

Implementing a fully custom imputation strategy requires a class with only one
method, matching the following API:

```c++
class FullyCustomImputation
{
 public:
  // This function should replace values in `data` in row `dimension` that have
  // value `missingValue` with whatever the custom imputation strategy value
  // should be.
  //
  // If `columnMajor` is `false`, then values should be replaced in the column
  // `dimension` instead.
  //
  // Remember when checking for missing values that NaN != NaN---use
  // `std::isnan()` instead to check if `missingValue` is NaN, and to check if
  // values in `data` are NaN.
  //
  // `data` will be an Armadillo matrix type or equivalent type matching the
  // Armadillo API.
  template<typename MatType>
  void Impute(MatType& data,
              const typename MatType::elem_type missingValue,
              const size_t dimension,
              const bool columnMajor = true);
};
```

## Simple examples

Replace all NaNs in dimension 2 of a random matrix with the mean in that
dimension.

```c++
// Create a random matrix with integer values that are either 0, 1, or NaN.
arma::mat data = arma::randi<arma::mat>(10, 20, arma::distr_param(0, 2));
// Replace the value 2 with NaN.
data.replace(2, std::nan(""));

mlpack::Imputer<mlpack::MeanImputation> imputer;

std::cout << "Dimension 2 before imputation:" << std::endl;
std::cout << data.row(2);

imputer.Impute(data, std::nan(""), 2);

std::cout << "Dimension 2 after imputation:" << std::endl;
std::cout << data.row(2);
```

---

Replace the value 0.0 in dimension 3 of a random matrix with the median in that
dimension.

```c++
// Create a random matrix with NaNs and random values in [0.5, 1].
arma::mat data(10, 20, arma::fill::randu);
// Replace anything below 0.5 with NaN.
data.transform([](double val) { return (val <= 0.5) ? std::nan("") : val; });

mlpack::Imputer<mlpack::MedianImputation> imputer;

std::cout << "Dimension 3 before imputation:" << std::endl;
std::cout << data.row(3);

imputer.Impute(data, std::nan(""), 3);

std::cout << "Dimension 3 after imputation:" << std::endl;
std::cout << data.row(3);
```

---

Remove any columns where dimension 4 contains a NaN value from a random matrix.

```c++
// Create a random matrix with values in [0, 1].  In dimension 4, any value less
// than 0.3 will be turned into a NaN.
arma::mat data(10, 1000, arma::fill::randu);
data.row(4).transform(
    [](double val) { return (val < 0.3) ? std::nan("") : val; });

mlpack::Imputer<mlpack::ListwiseDeletion> imputer;

std::cout << "Dataset contains " << data.n_cols << " points before removing "
    << "points that have NaN in dimension 4." << std::endl;

imputer.Impute(data, std::nan(""), 4);

std::cout << "Dataset contains " << data.n_cols << " points after removing "
    << "points that have NaN in dimension 4." << std::endl;
```

---

Replace the value 0.0 in dimension 0 of a random matrix with the value 2.5.  Use
a 32-bit floating point matrix as the data type.

```c++
// Create random matrix with values in [0, 5].
arma::fmat data = arma::randi<arma::fmat>(5, 20, arma::distr_param(0, 5));

mlpack::CustomImputation<> c(2.5); // Replace values with 2.5.
mlpack::Imputer<mlpack::CustomImputation<>> imputer(c);

std::cout << "Dimension 0 before imputation:" << std::endl;
std::cout << data.row(0);

imputer.Impute(data, 0.0, 0);

std::cout << "Dimension 0 after imputation:" << std::endl;
std::cout << data.row(0);
```