File: matrices.md

package info (click to toggle)
mlpack 4.6.2-1
  • links: PTS, VCS
  • area: main
  • in suites: sid
  • size: 31,272 kB
  • sloc: cpp: 226,039; python: 1,934; sh: 1,198; lisp: 414; makefile: 85
file content (391 lines) | stat: -rw-r--r-- 13,573 bytes parent folder | download | duplicates (2)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
# Matrices in mlpack

mlpack uses Armadillo matrices for linear algebra support.  Armadillo is a fast
C++ matrix library which uses advanced template metaprogramming techniques to
provide the fastest possible linear algebra operations.

<center><p><img src="https://arma.sourceforge.net/img/armadillo_logo2.png" alt="Armadillo logo"></p></center>

Detailed documentation on Armadillo can be found on [the Armadillo
website](https://arma.sourceforge.net/docs.html).

Nonetheless, there are a few further caveats for mlpack Armadillo usage.

 * [An Armadillo primer](#an-armadillo-primer)
 * [Representing data in mlpack](#representing-data-in-mlpack)
 * [Loading data](#loading-data)
 * [Loading and using categorical data](#loading-and-using-categorical-data)
 * [Alternate matrix types](#alternate-matrix-types)
 * [Adapting from other toolkits (Eigen, etc.)](#adapting-from-other-toolkits-eigen-etc)

## An Armadillo primer

The Armadillo syntax is straightforward and is aimed at ease-of-use and
readability.  To give a flavor of what a linear algebra program using Armadillo
looks like, see the trivial (contrived) program below that performs some basic
matrix operations.

```c++
// Create a 10x15 matrix with random elements.
arma::mat m(10, 15, arma::fill::randu);

std::cout << "Size of m: " << m.n_rows << " x " << m.n_cols << "." << std::endl;

// Sum all elements in the matrix.
const double sumVal = arma::accu(m);
std::cout << "Sum of all elements: " << sumVal << "." << std::endl;

// Sum the elements in each column.
arma::rowvec sums = arma::sum(m, 0);
std::cout << "Sums in each column: " << sums;

// Add 1 to all elements.
m += 1;

// Subtract sums from each row.
m.each_row() -= sums;

// Print an individual element.
std::cout << "m(3, 4) is: " << m(3, 4) << "." << std::endl;
```

For more information on Armadillo, see the following resources:

 * [Armadillo documentation](https://arma.sourceforge.net/docs.html)
 * [Armadillo example
    program](https://arma.sourceforge.net/docs.html#example_prog)
 * [Armadillo/MATLAB syntax conversion
   table](https://arma.sourceforge.net/docs.html#syntax)

## Representing data in mlpack

Armadillo matrices, unlike numpy and some other toolkits, store data in a
***column-major*** format.  This means that each column is located in contiguous
memory; i.e., `x(0, 0)` is adjacent to `x(1, 0)` in memory.

This means that, for the vast majority of machine learning methods, it is faster
to store ***observations as columns*** and ***dimensions as rows***.  This is
counter to most standard machine learning texts!  It also has some implications
for linear algebra operations; for instance, computing the Gram matrix of a
matrix `X` is typically expressed as `X^T X`, but when using column-major
matrices, the expression must be `X X^T`.

In general, the following Armadillo types are commonly used inside mlpack:

 * `arma::mat`: datasets and general-purpose matrices
 * `arma::Row<size_t>`: integer response data, e.g., labels for classification
   datasets
 * `arma::rowvec`: floating-point response data, e.g., responses for regression
   datasets
 * `arma::vec`: general-purpose column vectors
 * `arma::sp_mat`, `arma::fmat`: alternate types for representing data; see
   [Alternate matrix types](#alternate-matrix-types)

## Loading data

mlpack provides two simple functions for loading and saving data matrices in a
column-major form:

 * `data::Load(filename, matrix, fatal=false, transpose=true, type=FileType::AutoDetect)` ([full documentation](load_save.md#numeric-data))
 * `data::Save(filename, matrix, fatal=false, transpose=true, type=FileType::AutoDetect)` ([full documentation](load_save.md#numeric-data))

As an example, consider the following CSV file:

```sh
$ cat data.csv
3,3,3,3,0
3,4,4,3,0
3,4,4,3,0
3,3,4,3,0
3,6,4,3,0
2,4,4,3,0
2,4,4,1,0
3,3,3,2,0
3,4,4,2,0
3,4,4,2,0
3,3,4,2,0
3,6,4,2,0
2,4,4,2,0
```

The following program will load the data, print information about it, and save a
modified dataset to disk.

```c++
// Load data from `data.csv` into `m`.  Throw an exception on failure (i.e. set
// `fatal` to `true`).
arma::mat m;
mlpack::data::Load("data.csv", m, true);

// Since mlpack uses column-major data,
//
//  - each column corresponds to a data point!
//  - each row corresponds to a dimension!
//
std::cout << "The matrix in 'data.csv' has: " << std::endl;
std::cout << " - " << m.n_cols << " points." << std::endl;
std::cout << " - " << m.n_rows << " dimensions." << std::endl;

std::cout << "The second point in the dataset: " << std::endl;
std::cout << m.col(1).t();

// Now modify the matrix and save to a different format (space-separated
// values).
m += 3;
mlpack::data::Save("data-mod.txt", m);
```

Although Armadillo does provide a `.load()` and `.save()` member function for
matrices, the `data::Load()` and `data::Save()` functions offer additional
flexibility, and ensure that data is saved and loaded in a column-major format.

## Loading and using categorical data

Some mlpack techniques support mixed categorical data, e.g., data where some
dimensions take only categorical values (e.g. `0`, `1`, `2`, etc.).  String data
and other non-numerical data can be represented as categorical values, and
mlpack has support to load mixed categorical data:

 * The `data::DatasetInfo` auxiliary class stores information about whether each
   dimension is numeric or categorical.  ([full
   documentation](load_save.md#datadatasetinfo))
 * `data::Load(filename, matrix, info, fatal=false, transpose=true)` ([full
   documentation](load_save.md#loading-categorical-data))

For example, consider the following CSV file that contains strings:

```sh
$ cat mixed_string_data.csv
3,"hello",3,"f",0
3,"goodbye",4,"f",0
3,"goodbye",4,"e",0
3,"hello",4,"d",0
3,"hello",4,"d",0
2,"hello",4,"d",0
2,"hello",4,"d",0
3,"goodbye",3,"f",0
3,"goodbye",4,"f",0
3,"hello",4,"f",0
3,"hello",4,"c",0
3,"hello",4,"f",0
2,"hello",4,"c",0
```

The following program will load the data file, print information about
categorical dimensions, and prepare the data for use with an mlpack algorithm
that supports mixed categorical data.

```c++
// Load data from `mixed_string_data.csv` into `m`.  Throw an exception on
// failure (i.e. set `fatal` to `true`).  This populates the `info` object.
arma::mat m;
mlpack::data::DatasetInfo info;
mlpack::data::Load("mixed_string_data.csv", m, info, true);

// Print information about the data.
std::cout << "The matrix in 'mixed_string_data.csv' has: " << std::endl;
std::cout << " - " << m.n_cols << " points." << std::endl;
std::cout << " - " << info.Dimensionality() << " dimensions." << std::endl;

// Print which dimensions are categorical.
for (size_t d = 0; d < info.Dimensionality(); ++d)
{
  if (info.Type(d) == mlpack::data::Datatype::categorical)
  {
    std::cout << " - Dimension " << d << " is categorical with "
        << info.NumMappings(d) << " distinct categories." << std::endl;
  }
}

// Modify the third point to be 4,"wonderful",1,"c",0.
// Note that we manually map the string values; MapString() returns the category
// for a given value.
m(0, 2) = 4;
m(1, 2) = info.MapString<double>("wonderful", 1); // Create new third category.
m(2, 2) = 1;
m(3, 2) = info.MapString<double>("c", 1);
m(4, 2) = 0;

// `m` can now be used with any mlpack algorithm that supports categorical data.
```

Not every mlpack method supports categorical data.  Below are the list of
methods that do have categorical data support:

 * [`DecisionTree`](methods/decision_tree.md)
 * [`DecisionTreeRegressor`](methods/decision_tree_regressor.md)
 * [`RandomForest`](methods/random_forest.md)
 * [`HoeffdingTree`](methods/hoeffding_tree.md)

## Alternate matrix types

mlpack's documentation focuses on the use of the `arma::mat`, `arma::vec`, and
`arma::rowvec` types, with an underlying `double` numeric type, e.g., 64-bit
floating-point.  But, many mlpack algorithms and support utilities have support
for alternate matrix types and element types:

 * Many methods, such as
   [`LogisticRegression`](methods/logistic_regression.md#advanced-functionality-different-element-types),
   allow specifying the matrix types as a template parameter.

 * Some methods, such as
   [`DecisionTree`](methods/decision_tree.md#using-different-element-types),
   accept different matrix types to `Train()`, `Classify()`, or `Predict()`,
   without needing to specify an explicit template parameter.

 * In general, any matrix type that supports the Armadillo API can be used; this
   includes:
    - Single-precision floating point matrices (`arma::fmat`, `arma::frowvec`,
      `arma::fvec`)
    - Sparse matrices (`arma::sp_mat`, `arma::sp_fmat`)
    - GPU matrices via [Bandicoot](https://coot.sourceforge.io) (`coot::mat`,
      `coot::fmat`) --- ***(note: support is under development and still
       experimental)***

---

A simple example of using single-precision floating point data to train an
[`AdaBoost`](methods/adaboost.md) model is below.

```c++
// 1000 random points in 10 dimensions, using 32-bit precision (float).
arma::fmat dataset(10, 1000, arma::fill::randu);
// Random labels for each point, totaling 5 classes.
arma::Row<size_t> labels =
    arma::randi<arma::Row<size_t>>(1000, arma::distr_param(0, 4));

// Train in the constructor, using floating-point data.
// The weak learner type is now a floating-point Perceptron.
using PerceptronType = mlpack::Perceptron<
    mlpack::SimpleWeightUpdate,
    mlpack::ZeroInitialization,
    arma::fmat>;
mlpack::AdaBoost<PerceptronType, arma::fmat> ab(dataset, labels, 5);

// Create test data (500 points).
arma::fmat testDataset(10, 500, arma::fill::randu);
arma::Row<size_t> predictions;
ab.Classify(testDataset, predictions);
// Now `predictions` holds predictions for the test dataset.

// Print some information about the test predictions.
std::cout << arma::accu(predictions == 3) << " test points classified as class "
    << "3." << std::endl;
```

---

A simple example of using sparse 32-bit floating point data to train a
[`LogisticRegression`](methods/logistic_regression.md) model is below.

```c++
// Create random, sparse 100-dimensional data.
arma::sp_fmat dataset;
dataset.sprandu(100, 5000, 0.3);
arma::Row<size_t> labels =
    arma::randi<arma::Row<size_t>>(5000, arma::distr_param(0, 1));

// Train with L2 regularization penalty parameter of 0.1.
mlpack::LogisticRegression<arma::sp_fmat> lr(dataset, labels, 0.1);

// Now classify a test point.
arma::sp_fvec point;
point.sprandu(100, 1, 0.3);

size_t prediction;
arma::fvec probabilitiesVec;
lr.Classify(point, prediction, probabilitiesVec);

std::cout << "Prediction for random test point: " << prediction << "."
    << std::endl;
std::cout << "Class probabilities for random test point: "
    << probabilitiesVec.t();
```

<!-- TODO: a simple Bandicoot example! -->

## Adapting from other toolkits (Eigen, etc.)

In general, C++ linear algebra toolkits store data in a column-major
representation, and transitioning between toolkits is a matter of getting access
to the underlying memory.

---

Copy an [Eigen](https://eigen.tuxfamily.org/index.php?title=Main_Page) matrix
into an Armadillo matrix.

```c++
// Note: this will only work if the Eigen matrix is stored in column-major
// order.  See https://eigen.tuxfamily.org/dox/group__TopicStorageOrders.html
// for more details.
Eigen::MatrixXd m;
const size_t rows = 10;
const size_t cols = 20;
m.setRandom(rows, cols); // 10x20 random matrix.

// Copy into an Armadillo matrix.
arma::mat mCopy(&m(0, 0), rows, cols);
```

---

Copy an [XTensor](https://xtensor.readthedocs.io/en/latest/) matrix into an
Armadillo matrix.

```c++
// Note: this will only work correctly if the layout_type of the XTensor matrix
// is column-major (i.e. xt::layout_type::column_major).  See
// https://xtensor.readthedocs.io/en/latest/container.html for more details.
const size_t rows = 10;
const size_t cols = 20;

// Create a random 10 x 20 matrix with normally distributed values.
// Note that we must ensure that the matrix is laid out in column-major form.
xt::xarray<double, xt::layout_type::column_major> m =
    xt::random::randn<double>({ rows, cols });

// Copy into an Armadillo matrix.
arma::mat mCopy(m.data(), rows, cols);
```

---

Make an Armadillo matrix that is an alias of an
[Eigen](https://eigen.tuxfamily.org/index.php?title=Main_Page) matrix.  Note
that changes to the Eigen matrix will be reflected in the Armadillo matrix, and
vice versa.

*If the Eigen matrix is deallocated, the Armadillo matrix will become invalid.
Be careful!  [More details
here](https://arma.sourceforge.net/docs.html#adv_constructors_mat).*

```c++
// Note: this will only work if the Eigen matrix is stored in column-major
// order.  See https://eigen.tuxfamily.org/dox/group__TopicStorageOrders.html
// for more details.
Eigen::MatrixXd m;
const size_t rows = 10;
const size_t cols = 20;
m.setRandom(rows, cols); // 10x20 random matrix.

// Make an Armadillo matrix that is an alias of the Eigen matrix.  This avoids
// the copy, but is potentially dangerous: be careful that the Eigen matrix is
// not deleted while the Armadillo matrix is in use!
//
// See https://arma.sourceforge.net/docs.html#adv_constructors_mat
arma::mat mAlias(&m(0, 0), rows, cols, false, true);
```

---

Copy an Armadillo matrix to an Eigen matrix.

```c++
const size_t rows = 10;
const size_t cols = 20;
arma::mat m(10, 20, arma::fill::randu);

// Construct the Eigen matrix by using a map of the Armadillo memory.
Eigen::MatrixXd eigenM(Eigen::Map<Eigen::MatrixXd>(m.memptr(), rows, cols));
```