1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307
|
# mlpack in C++ quickstart
This page describes how you can quickly get started using mlpack in C++ and
gives a few examples of usage, and pointers to deeper documentation.
Keep in mind that mlpack also has interfaces to other languages, and quickstart
guides for those other languages are available too. If that is what you are
looking for, see the quickstarts for [Python](python.md),
[the command line](cli.md), [Julia](julia.md), [R](r.md), or [Go](go.md).
## Installing mlpack
To use mlpack in C++, you only need the header files associated with the
libraries, and the dependencies Armadillo and ensmallen (detailed in the
[main README](../../README.md)).
The headers may already be pre-packaged for your distribution;
for instance, for Ubuntu and Debian you can simply run the command
```sh
sudo apt-get install libmlpack-dev
```
and on Fedora or Red Hat:
```sh
sudo dnf install mlpack-devel
```
You can also use a Docker image from Dockerhub,
which has mlpack headers already installed:
```sh
docker run -it mlpack/mlpack /bin/bash
```
If you prefer to build mlpack from scratch, see the
[main README](../../README.md).
***Note for Ubuntu LTS Users***: The libmlpack-dev version in the Ubuntu LTS
repositories may not always be the latest. This can lead to issues, such as
missing header files (e.g., `mlpack.hpp` missing in versions prior to 4.0). To
ensure compatibility with the latest mlpack features and examples, we recommend
building mlpack from source, as explained in the [main README](../../README.md).
***Warning:*** on Ubuntu and Debian systems, older versions of OpenBLAS (0.3.26
and older) can over-use the number of cores on your system, causing slow
execution of mlpack programs, especially mlpack's test suite. To prevent this,
set `OMP_NUM_THREADS` as detailed [in the test build
guide](../user/install.md#build-tests), or install the `libopenblas-openmp-dev`
package on Ubuntu or Debian and remove `libopenblas-pthread-dev`. Ubuntu 24.04,
Debian bookworm, and older are all affected by this issue.
## Installing mlpack from vcpkg
The mlpack port in vcpkg is kept up to date by Microsoft team members and community contributors. The url of vcpkg is: https://github.com/Microsoft/vcpkg . You can download and install mlpack using the vcpkg dependency manager:
```shell
git clone https://github.com/Microsoft/vcpkg.git
cd vcpkg
./bootstrap-vcpkg.sh # ./bootstrap-vcpkg.bat for Windows
./vcpkg integrate install
./vcpkg install mlpack
```
If the version is out of date, please [create an issue or pull request](https://github.com/Microsoft/vcpkg) on the vcpkg repository.
## Installing mlpack from Conan
The mlpack recipe in [Conan](https://conan.io/) is kept up to date by the Conan team members and the community contributors.
Follow the instructions on [this page on how to set up Conan](https://conan.io/downloads).
Install mlpack:
```shell
conan install --requires="mlpack/[*]" --build=missing
```
If the version is outdated or there is a new release version, please [create an issue or pull request](https://github.com/conan-io/conan-center-index) on the conan-center-index repository.
## Simple quickstart example
As a really simple example of how to use mlpack in C++, let's do some simple
classification on a subset of the standard machine learning `covertype` dataset.
We'll first split the dataset into a training set and a test set, then we'll
train an mlpack random forest on the training data, and finally we'll print the
accuracy of the random forest on the test dataset.
The first step is to download the covertype dataset onto your system so that it
is available for the program. A shell command below is given to do this:
```sh
# Get the dataset and unpack it.
wget https://www.mlpack.org/datasets/covertype-small.data.csv.gz
wget https://www.mlpack.org/datasets/covertype-small.labels.csv.gz
gunzip covertype-small.data.csv.gz covertype-small.labels.csv.gz
```
With that in place, let's write a C++ program to split the data and perform the
classification:
```c++
// Define these to print extra informational output and warnings.
#define MLPACK_PRINT_INFO
#define MLPACK_PRINT_WARN
#include <mlpack.hpp>
using namespace arma;
using namespace mlpack;
using namespace mlpack::tree;
using namespace std;
int main()
{
// Load the datasets.
mat dataset;
Row<size_t> labels;
if (!data::Load("covertype-small.data.csv", dataset))
throw std::runtime_error("Could not read covertype-small.data.csv!");
if (!data::Load("covertype-small.labels.csv", labels))
throw std::runtime_error("Could not read covertype-small.labels.csv!");
// Labels are 1-7, but we want 0-6 (we are 0-indexed in C++).
labels -= 1;
// Now split the dataset into a training set and test set, using 30% of the
// dataset for the test set.
mat trainDataset, testDataset;
Row<size_t> trainLabels, testLabels;
data::Split(dataset, labels, trainDataset, testDataset, trainLabels,
testLabels, 0.3);
// Create the RandomForest object and train it on the training data.
RandomForest<> r(trainDataset,
trainLabels,
7 /* number of classes */,
10 /* number of trees */,
3 /* minimum leaf size */);
// Compute and print the training error.
Row<size_t> trainPredictions;
r.Classify(trainDataset, trainPredictions);
const double trainError =
arma::accu(trainPredictions != trainLabels) * 100.0 / trainLabels.n_elem;
cout << "Training error: " << trainError << "%." << endl;
// Now compute predictions on the test points.
Row<size_t> testPredictions;
r.Classify(testDataset, testPredictions);
const double testError =
arma::accu(testPredictions != testLabels) * 100.0 / testLabels.n_elem;
cout << "Test error: " << testError << "%." << endl;
}
```
Now, you can compile the program with your favorite C++ compiler; here's an
example command that uses `g++`, and assumes the file above is saved as
`cpp_quickstart_1.cpp`.
```sh
g++ -O3 -std=c++17 -o cpp_quickstart_1 cpp_quickstart_1.cpp -larmadillo -fopenmp
```
Then, you can run the program easily:
```sh
./cpp_quickstart_1
```
We can see by looking at the output that we achieve reasonably good accuracy on
the test dataset (80%+):
```
Training error: 19.4329%.
Test error: 24.17%.
```
It's easy to modify the code above to do more complex things, or to use
different mlpack learners, or to interface with other machine learning toolkits.
## Using mlpack for movie recommendations
In this example, we'll train a collaborative filtering model using mlpack's `CF`
class. We'll train this on this
[MovieLens dataset](https://grouplens.org/datasets/movielens/), and then we'll
use the model that we train to give recommendations.
First, download the MovieLens dataset:
```sh
wget https://www.mlpack.org/datasets/ml-20m/ratings-only.csv.gz
wget https://www.mlpack.org/datasets/ml-20m/movies.csv.gz
gunzip ratings-only.csv.gz movies.csv.gz
```
Next, we can use the following C++ code:
```cpp
// Define these to print extra informational output and warnings.
#define MLPACK_PRINT_INFO
#define MLPACK_PRINT_WARN
#include <mlpack.hpp>
using namespace arma;
using namespace mlpack;
using namespace mlpack::cf;
using namespace std;
int main()
{
// Load the ratings.
mat ratings;
if (!data::Load("ratings-only.csv", ratings))
throw std::runtime_error("Could not load ratings-only.csv!");
// Now, load the names of the movies as a single-feature categorical dataset.
// We can use `moviesInfo.UnmapString(i, 0)` to get the i'th string.
data::DatasetInfo moviesInfo;
mat movies; // This will be unneeded.
if (!data::Load("movies.csv", movies, moviesInfo))
throw std::runtime_error("Could not load movies.csv!");
// Split the ratings into a training set and a test set, using 10% of the
// dataset for the test set.
mat trainRatings, testRatings;
data::Split(ratings, trainRatings, testRatings, 0.1);
// Train the CF model using RegularizedSVD as the decomposition algorithm.
// Here we use a rank of 10 for the decomposition.
CFType<RegSVDPolicy> cf(
trainRatings,
RegSVDPolicy(),
5, /* number of users to use for similarity computations */
10 /* rank of decomposition */);
// Now compute the RMSE for the test set user and item combinations. To do
// this we must assemble the list of users and items.
Mat<size_t> combinations(2, testRatings.n_cols);
for (size_t i = 0; i < testRatings.n_cols; ++i)
{
combinations(0, i) = size_t(testRatings(0, i)); // (user)
combinations(1, i) = size_t(testRatings(1, i)); // (item)
}
vec predictions;
cf.Predict(combinations, predictions);
const double rmse = norm(predictions - testRatings.row(2).t(), 2) /
sqrt((double) testRatings.n_cols);
std::cout << "RMSE of trained model is " << rmse << "." << endl;
// Compute the top 10 movies for user 1.
Col<size_t> users = { 1 };
Mat<size_t> recommendations;
cf.GetRecommendations(10, recommendations, users);
// Now print each movie.
cout << "Recommendations for user 1:" << endl;
for (size_t i = 0; i < recommendations.n_elem; ++i)
{
cout << " " << (i + 1) << ". "
<< moviesInfo.UnmapString(recommendations[i], 2) << "." << endl;
}
}
```
This can be compiled the same way as before, assuming the code is saved as
`cpp_quickstart_2.cpp`:
```sh
g++ -O3 -std=c++17 -o cpp_quickstart_2 cpp_quickstart_2.cpp -fopenmp -larmadillo
```
And then it can be easily run:
```
./cpp_quickstart_2
```
Here is some example output, showing that user 1 seems to have good taste in
movies:
```
RMSE of trained model is 0.795323.
Recommendations for user 1:
1: Casablanca (1942)
2: Pan's Labyrinth (Laberinto del fauno, El) (2006)
3: Godfather, The (1972)
4: Answer This! (2010)
5: Life Is Beautiful (La Vita รจ bella) (1997)
6: Adventures of Tintin, The (2011)
7: Dark Knight, The (2008)
8: Out for Justice (1991)
9: Dr. Strangelove or: How I Learned to Stop Worrying and Love the Bomb (1964)
10: Schindler's List (1993)
```
## Next steps with mlpack
Now that you have done some simple work with mlpack, you have seen how it can
easily plug into a data science production workflow in C++. But these two
examples have only shown a little bit of the functionality of mlpack. Lots of
other functionality is available.
Some of this functionality is demonstrated in the
[examples repository](https://github.com/mlpack/examples).
A full list of all classes and functions that mlpack implements can be found by
browsing the well-commented source code.
|