File: split.md

package info (click to toggle)
mlpack 4.6.2-1
  • links: PTS, VCS
  • area: main
  • in suites: sid
  • size: 31,272 kB
  • sloc: cpp: 226,039; python: 1,934; sh: 1,198; lisp: 414; makefile: 85
file content (250 lines) | stat: -rw-r--r-- 10,588 bytes parent folder | download | duplicates (2)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
# Dataset splitting

mlpack provides a simple functions for splitting a dataset into a training set
and a test set.

 * [`data::Split()`](#datasplitdata): split a dataset into a training set
   and test set, optionally with labels and weights.

 * [`data::StratifiedSplit()`](#datastratifiedsplit): perform a stratified
   split, ensuring that the training and test set have the same ratios of each
   label.

---

## `data::SplitData()`

 * `data::Split(input, trainData, testData, testRatio, shuffleData=true)`
 * `data::Split(input, inputLabels, trainData, testData, trainLabels, testLabels, testRatio, shuffleData=true)`
 * `data::Split(input, inputLabels, inputWeights, trainData, testData, trainLabels, testLabels, trainWeights, testWeights, testRatio, shuffleData=true)`
   - Perform a standard train/test split, with a factor of `testRatio` of the
     data stored in the test set.
   - If `shuffleData` is `false`, the first points in the dataset will be used
     for the training set, and the last points for the test set.
   - `inputLabels` can be specified if there are also labels that should be
     split; `inputLabels.n_cols` should be equal to `input.n_cols`.
   - `inputWeights` can be specified if there are also instance weights that
     should be split; `inputWeights.n_cols` should be equal to `input.n_cols`.

See more details on each parameter in the [Parameters](#parameters) section.

---

## `data::StratifiedSplit()`

 * `data::StratifiedSplit(input, inputLabels, trainData, testData, trainLabels,
   testLabels, testRatio, shuffleData=true)`
   - Perform a stratified train/test split, with a factor of `testRatio` of the
     dataset stored in the test set.
   - A stratified split ensures that the ratio of classes in the training and
     test sets matches the ratio of classes in the original dataset.  This can
     be useful for highly imbalanced datasets.
   - If `shuffleData` is `false`, the first points in the dataset for each class
     will be used for the training set, and the last points for the test set.

See more details on each parameter in the [Parameters](#parameters) section.

---

## Parameters

| **name** | **type** | **description** | **default** |
|----------|----------|-----------------|-------------|
| `input` | [`arma::mat`](../matrices.md) | [Column-major](../matrices.md#representing-data-in-mlpack) data matrix. | _(N/A)_ |
| `inputLabels` | [`arma::Row<size_t>`](../matrices.md) | Labels for data matrix.  Should have length `data.n_cols`. | _(N/A)_ |
| `inputWeights` | [`arma::rowvec`](../matrices.md) | Weights for data matrix. Should have length `data.n_cols`. | _(N/A)_ |
| `trainData` | [`arma::mat&`](../matrices.md) | Matrix to store training points in.  Will be set to size `data.n_rows` x `((1.0 - testRatio) * data.n_cols)`. | _(N/A)_ |
| `testData` | [`arma::mat&`](../matrices.md) | Matrix to store test points in. Will be set to size `data.n_rows` x `testRatio * data.n_cols`. | _(N/A)_ |
| `trainLabels` | [`arma::Row<size_t>&`](../matrices.md) | Vector to store training labels in.  Will be set to length `trainData.n_cols`. | _(N/A)_ |
| `testLabels` | [`arma::Row<size_t>&`](../matrices.md) | Vector to store test labels in.  Will be set to length `testData.n_cols`. | _(N/A)_ |
| `trainWeights` | [`arma::rowvec&`](../matrices.md) | Vector to store training weights in.  Will be set to length `trainData.n_cols`. | _(N/A)_ |
| `testWeights` | [`arma::rowvec&`](../matrices.md) | Vector to store test weights in.  Will be set to length `testData.n_cols`. | _(N/A)_ |
| `testRatio` | `double` | Fraction of columns in `input` to use for test set. Typically between 0.1 and 0.25. | _(N/A)_ |
| `shuffleData` | `bool` | If `true`, then training and test sets are sampled randomly from `input`. | `true` |

***Notes:***

 - Any matrix or cube type matching the Armadillo API can be used for `input`,
   `trainData`, and `testData` (e.g. `arma::fmat`, `arma::sp_mat`, `arma::cube`,
   etc.).
    * All three matrices must have the same type.
    * [`arma::field<>`](https://arma.sourceforge.net/docs.html#field) types can
      also be used.

 - Any row vector, matrix, or cube type matching the Armadillo API can be used
   for `labels`, `trainLabels`, and `testLabels` (e.g. `arma::urowvec`,
   `arma::Row<unsigned short>`, `arma::cube`, etc.).
    * All three label parameters must have the same type.
    * [`arma::field<>`](https://arma.sourceforge.net/docs.html#field) types may
      also be used, so long as the given `field` has the same number of columns
      as `input`.

 - Any row vector, matrix, or cube type matching the Armadillo API can be used
   for `weights`, trainWeights`, and `testWeights` (e.g. `arma::frowvec`,
   `arma::fmat`, `arma::cube`, etc.).
    * All three weight parameters must have the same type.
    * [`arma::field<>`](https://arma.sourceforge.net/docs.html#field) types may
      also be used, so long as the given `field` has the same number of columns
      as `input`.

 - Splitting is done on *columns*; so, if vector types are given, they should be
   row vectors (e.g. `arma::rowvec`, `arma::Row<size_t>`, etc.).

## Example usage

Split the unlabeled `cloud` dataset, using 20% of the dataset for the test set.

```c++
// See https://datasets.mlpack.org/cloud.csv.
arma::mat dataset;
mlpack::data::Load("cloud.csv", dataset, true);

arma::mat trainData, testData;

// Split the data, using 20% of the data for the test set.
mlpack::data::Split(dataset, trainData, testData, 0.2);

// Print the size of each matrix.
std::cout << "Full data size:     " << dataset.n_rows << " x " << dataset.n_cols
    << "." << std::endl;

std::cout << "Training data size: " << trainData.n_rows << " x "
    << trainData.n_cols << "." << std::endl;
std::cout << "Test data size:     " << testData.n_rows << " x "
    << testData.n_cols << "." << std::endl;
```

---

Split the mixed categorical `telecom_churn` dataset and associated responses for
regression, using 25% of the dataset for the test set.  Use 32-bit floating
point elements to represent both the data and responses.

```c++
// See https://datasets.mlpack.org/telecom_churn.arff.
arma::fmat dataset;
mlpack::data::DatasetInfo info; // Holds which dimensions are categorical.
mlpack::data::Load("telecom_churn.arff", dataset, info, true);

// See https://datasets.mlpack.org/telecom_churn.responses.csv.
arma::frowvec labels;
mlpack::data::Load("telecom_churn.responses.csv", labels, true);

arma::fmat trainData, testData;
arma::frowvec trainLabels, testLabels;

// Split the data, using 25% of the data for the test set.
// Note that Split() can accept many different types for the data and the
// labels---here we pass arma::frowvec instead of arma::Row<size_t>.
mlpack::data::Split(dataset, labels, trainData, testData, trainLabels,
    testLabels, 0.25);

// Print the size of each matrix.
std::cout << "Full data size:       " << dataset.n_rows << " x "
    << dataset.n_cols << "." << std::endl;
std::cout << "Full labels size:     " << labels.n_elem << "." << std::endl;

std::cout << std::endl;

std::cout << "Training data size:   " << trainData.n_rows << " x "
    << trainData.n_cols << "." << std::endl;
std::cout << "Training labels size: " << trainLabels.n_elem << "." << std::endl;
std::cout << "Test data size:     " << testData.n_rows << " x "
    << testData.n_cols << "." << std::endl;
std::cout << "Test labels size:   " << testLabels.n_elem << "." << std::endl;
```

---

Split the `movielens` dataset, which is a sparse matrix.  Don't shuffle when
splitting.

```c++
// See https://datasets.mlpack.org/movielens-100k.csv.
arma::sp_mat dataset;
mlpack::data::Load("movielens-100k.csv", dataset, true);

arma::sp_mat trainData, testData;

// Split the dataset without shuffling.
mlpack::data::Split(dataset, trainData, testData, 0.2, false);

// Print the first point of the dataset and the training set (these will be the
// same because we did not shuffle during splitting).
std::cout << "First point of full dataset:" << std::endl;
std::cout << dataset.col(0).t() << std::endl;

std::cout << "First point of training set:" << std::endl;
std::cout << trainData.col(0).t() << std::endl;

// Print the last point of the dataset and the test set (these will also be the
// same).
std::cout << "Last point of full dataset:" << std::endl;
std::cout << dataset.col(dataset.n_cols - 1).t() << std::endl;

std::cout << "Last point of test set:" << std::endl;
std::cout << testData.col(testData.n_cols - 1).t() << std::endl;

```

---

Perform a stratified sampling of the `covertype` dataset, printing the
percentage of each class in the original dataset and in the split datasets.

```c++
// See https://datasets.mlpack.org/covertype.data.csv.
arma::mat dataset;
mlpack::data::Load("covertype.data.csv", dataset, true);

// See https://datasets.mlpack.org/covertype.labels.csv.
arma::Row<size_t> labels;
mlpack::data::Load("covertype.labels.csv", labels, true);

arma::mat trainData, testData;
arma::Row<size_t> trainLabels, testLabels;

// Perform a stratified split, keeping 15% of the data as a test set.
mlpack::data::StratifiedSplit(dataset, labels, trainData, testData, trainLabels,
    testLabels, 0.15);

// Now compute the percentage of each label in the dataset.
const size_t numClasses = arma::max(labels) + 1;
arma::vec classPercentages(numClasses);
for (size_t i = 0; i < labels.n_elem; ++i)
  ++classPercentages[(size_t) labels[i]];
classPercentages /= labels.n_elem;

std::cout << "Percentages of each class in the full dataset:" << std::endl;
for (size_t i = 0; i < numClasses; ++i)
{
  std::cout << " - Class " << i << ": " << 100.0 * classPercentages[i] << "%."
      << std::endl;
}

// Now compute the percentage of each label in the training set.
classPercentages.zeros();
for (size_t i = 0; i < trainLabels.n_elem; ++i)
  ++classPercentages[(size_t) trainLabels[i]];
classPercentages /= trainLabels.n_elem;

std::cout << "Percentages of each class in the training set:" << std::endl;
for (size_t i = 0; i < numClasses; ++i)
{
  std::cout << " - Class " << i << ": " << 100.0 * classPercentages[i] << "%."
      << std::endl;
}

// Finally compute the percentage of each label in the test set.
classPercentages.zeros();
for (size_t i = 0; i < testLabels.n_elem; ++i)
  ++classPercentages[(size_t) testLabels[i]];
classPercentages /= testLabels.n_elem;

std::cout << "Percentages of each class in the training set:" << std::endl;
for (size_t i = 0; i < numClasses; ++i)
{
  std::cout << " - Class " << i << ": " << 100.0 * classPercentages[i] << "%."
      << std::endl;
}
```