File: cpp.md

package info (click to toggle)
mlpack 4.6.2-1
  • links: PTS, VCS
  • area: main
  • in suites: sid
  • size: 31,272 kB
  • sloc: cpp: 226,039; python: 1,934; sh: 1,198; lisp: 414; makefile: 85
file content (307 lines) | stat: -rw-r--r-- 10,608 bytes parent folder | download | duplicates (2)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
# mlpack in C++ quickstart

This page describes how you can quickly get started using mlpack in C++ and
gives a few examples of usage, and pointers to deeper documentation.

Keep in mind that mlpack also has interfaces to other languages, and quickstart
guides for those other languages are available too.  If that is what you are
looking for, see the quickstarts for [Python](python.md),
[the command line](cli.md), [Julia](julia.md), [R](r.md), or [Go](go.md).

## Installing mlpack

To use mlpack in C++, you only need the header files associated with the
libraries, and the dependencies Armadillo and ensmallen (detailed in the
[main README](../../README.md)).

The headers may already be pre-packaged for your distribution;
for instance, for Ubuntu and Debian you can simply run the command

```sh
sudo apt-get install libmlpack-dev
```

and on Fedora or Red Hat:

```sh
sudo dnf install mlpack-devel
```

You can also use a Docker image from Dockerhub, 
which has mlpack headers already installed:

```sh
docker run -it mlpack/mlpack /bin/bash
```

If you prefer to build mlpack from scratch, see the
[main README](../../README.md).

***Note for Ubuntu LTS Users***: The libmlpack-dev version in the Ubuntu LTS
repositories may not always be the latest. This can lead to issues, such as
missing header files (e.g., `mlpack.hpp` missing in versions prior to 4.0).  To
ensure compatibility with the latest mlpack features and examples, we recommend
building mlpack from source, as explained in the [main README](../../README.md).

***Warning:*** on Ubuntu and Debian systems, older versions of OpenBLAS (0.3.26
and older) can over-use the number of cores on your system, causing slow
execution of mlpack programs, especially mlpack's test suite.  To prevent this,
set `OMP_NUM_THREADS` as detailed [in the test build
guide](../user/install.md#build-tests), or install the `libopenblas-openmp-dev`
package on Ubuntu or Debian and remove `libopenblas-pthread-dev`.  Ubuntu 24.04,
Debian bookworm, and older are all affected by this issue.

## Installing mlpack from vcpkg

The mlpack port in vcpkg is kept up to date by Microsoft team members and community contributors. The url of vcpkg is: https://github.com/Microsoft/vcpkg . You can download and install mlpack using the vcpkg dependency manager:

```shell
git clone https://github.com/Microsoft/vcpkg.git
cd vcpkg
./bootstrap-vcpkg.sh  # ./bootstrap-vcpkg.bat for Windows
./vcpkg integrate install
./vcpkg install mlpack
```

If the version is out of date, please [create an issue or pull request](https://github.com/Microsoft/vcpkg) on the vcpkg repository.

## Installing mlpack from Conan

The mlpack recipe in [Conan](https://conan.io/) is kept up to date by the Conan team members and the community contributors.

Follow the instructions on [this page on how to set up Conan](https://conan.io/downloads).

Install mlpack:

```shell
conan install --requires="mlpack/[*]" --build=missing
```

If the version is outdated or there is a new release version, please [create an issue or pull request](https://github.com/conan-io/conan-center-index) on the conan-center-index repository.

## Simple quickstart example

As a really simple example of how to use mlpack in C++, let's do some simple
classification on a subset of the standard machine learning `covertype` dataset.
We'll first split the dataset into a training set and a test set, then we'll
train an mlpack random forest on the training data, and finally we'll print the
accuracy of the random forest on the test dataset.

The first step is to download the covertype dataset onto your system so that it
is available for the program.  A shell command below is given to do this:

```sh
# Get the dataset and unpack it.
wget https://www.mlpack.org/datasets/covertype-small.data.csv.gz
wget https://www.mlpack.org/datasets/covertype-small.labels.csv.gz
gunzip covertype-small.data.csv.gz covertype-small.labels.csv.gz
```

With that in place, let's write a C++ program to split the data and perform the
classification:

```c++
// Define these to print extra informational output and warnings.
#define MLPACK_PRINT_INFO
#define MLPACK_PRINT_WARN
#include <mlpack.hpp>

using namespace arma;
using namespace mlpack;
using namespace mlpack::tree;
using namespace std;

int main()
{
  // Load the datasets.
  mat dataset;
  Row<size_t> labels;
  if (!data::Load("covertype-small.data.csv", dataset))
    throw std::runtime_error("Could not read covertype-small.data.csv!");
  if (!data::Load("covertype-small.labels.csv", labels))
    throw std::runtime_error("Could not read covertype-small.labels.csv!");

  // Labels are 1-7, but we want 0-6 (we are 0-indexed in C++).
  labels -= 1;

  // Now split the dataset into a training set and test set, using 30% of the
  // dataset for the test set.
  mat trainDataset, testDataset;
  Row<size_t> trainLabels, testLabels;
  data::Split(dataset, labels, trainDataset, testDataset, trainLabels,
      testLabels, 0.3);

  // Create the RandomForest object and train it on the training data.
  RandomForest<> r(trainDataset,
                   trainLabels,
                   7 /* number of classes */,
                   10 /* number of trees */,
                   3 /* minimum leaf size */);

  // Compute and print the training error.
  Row<size_t> trainPredictions;
  r.Classify(trainDataset, trainPredictions);
  const double trainError =
      arma::accu(trainPredictions != trainLabels) * 100.0 / trainLabels.n_elem;
  cout << "Training error: " << trainError << "%." << endl;

  // Now compute predictions on the test points.
  Row<size_t> testPredictions;
  r.Classify(testDataset, testPredictions);
  const double testError =
      arma::accu(testPredictions != testLabels) * 100.0 / testLabels.n_elem;
  cout << "Test error: " << testError << "%." << endl;
}
```

Now, you can compile the program with your favorite C++ compiler; here's an
example command that uses `g++`, and assumes the file above is saved as
`cpp_quickstart_1.cpp`.

```sh
g++ -O3 -std=c++17 -o cpp_quickstart_1 cpp_quickstart_1.cpp -larmadillo -fopenmp
```

Then, you can run the program easily:

```sh
./cpp_quickstart_1
```

We can see by looking at the output that we achieve reasonably good accuracy on
the test dataset (80%+):

```
Training error: 19.4329%.
Test error: 24.17%.
```

It's easy to modify the code above to do more complex things, or to use
different mlpack learners, or to interface with other machine learning toolkits.

## Using mlpack for movie recommendations

In this example, we'll train a collaborative filtering model using mlpack's `CF`
class.  We'll train this on this
[MovieLens dataset](https://grouplens.org/datasets/movielens/), and then we'll
use the model that we train to give recommendations.

First, download the MovieLens dataset:

```sh
wget https://www.mlpack.org/datasets/ml-20m/ratings-only.csv.gz
wget https://www.mlpack.org/datasets/ml-20m/movies.csv.gz
gunzip ratings-only.csv.gz movies.csv.gz
```

Next, we can use the following C++ code:

```cpp
// Define these to print extra informational output and warnings.
#define MLPACK_PRINT_INFO
#define MLPACK_PRINT_WARN
#include <mlpack.hpp>

using namespace arma;
using namespace mlpack;
using namespace mlpack::cf;
using namespace std;

int main()
{
  // Load the ratings.
  mat ratings;
  if (!data::Load("ratings-only.csv", ratings))
    throw std::runtime_error("Could not load ratings-only.csv!");
  // Now, load the names of the movies as a single-feature categorical dataset.
  // We can use `moviesInfo.UnmapString(i, 0)` to get the i'th string.
  data::DatasetInfo moviesInfo;
  mat movies; // This will be unneeded.
  if (!data::Load("movies.csv", movies, moviesInfo))
    throw std::runtime_error("Could not load movies.csv!");

  // Split the ratings into a training set and a test set, using 10% of the
  // dataset for the test set.
  mat trainRatings, testRatings;
  data::Split(ratings, trainRatings, testRatings, 0.1);

  // Train the CF model using RegularizedSVD as the decomposition algorithm.
  // Here we use a rank of 10 for the decomposition.
  CFType<RegSVDPolicy> cf(
      trainRatings,
      RegSVDPolicy(),
      5, /* number of users to use for similarity computations */
      10 /* rank of decomposition */);

  // Now compute the RMSE for the test set user and item combinations.  To do
  // this we must assemble the list of users and items.
  Mat<size_t> combinations(2, testRatings.n_cols);
  for (size_t i = 0; i < testRatings.n_cols; ++i)
  {
    combinations(0, i) = size_t(testRatings(0, i)); // (user)
    combinations(1, i) = size_t(testRatings(1, i)); // (item)
  }
  vec predictions;
  cf.Predict(combinations, predictions);
  const double rmse = norm(predictions - testRatings.row(2).t(), 2) /
      sqrt((double) testRatings.n_cols);
  std::cout << "RMSE of trained model is " << rmse << "." << endl;

  // Compute the top 10 movies for user 1.
  Col<size_t> users = { 1 };
  Mat<size_t> recommendations;
  cf.GetRecommendations(10, recommendations, users);

  // Now print each movie.
  cout << "Recommendations for user 1:" << endl;
  for (size_t i = 0; i < recommendations.n_elem; ++i)
  {
    cout << "  " << (i + 1) << ". "
        << moviesInfo.UnmapString(recommendations[i], 2) << "." << endl;
  }
}
```

This can be compiled the same way as before, assuming the code is saved as
`cpp_quickstart_2.cpp`:

```sh
g++ -O3 -std=c++17 -o cpp_quickstart_2 cpp_quickstart_2.cpp -fopenmp -larmadillo
```

And then it can be easily run:

```
./cpp_quickstart_2
```

Here is some example output, showing that user 1 seems to have good taste in
movies:

```
RMSE of trained model is 0.795323.
Recommendations for user 1:
  1: Casablanca (1942)
  2: Pan's Labyrinth (Laberinto del fauno, El) (2006)
  3: Godfather, The (1972)
  4: Answer This! (2010)
  5: Life Is Beautiful (La Vita รจ bella) (1997)
  6: Adventures of Tintin, The (2011)
  7: Dark Knight, The (2008)
  8: Out for Justice (1991)
  9: Dr. Strangelove or: How I Learned to Stop Worrying and Love the Bomb (1964)
  10: Schindler's List (1993)
```

## Next steps with mlpack

Now that you have done some simple work with mlpack, you have seen how it can
easily plug into a data science production workflow in C++.  But these two
examples have only shown a little bit of the functionality of mlpack.  Lots of
other functionality is available.

Some of this functionality is demonstrated in the
[examples repository](https://github.com/mlpack/examples).

A full list of all classes and functions that mlpack implements can be found by
browsing the well-commented source code.