1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 939 940 941 942 943 944 945 946 947 948 949 950 951 952 953 954 955 956 957 958 959 960 961 962 963 964 965 966 967 968 969 970 971 972 973 974 975 976 977 978 979 980 981 982 983 984 985 986 987 988 989 990 991 992 993 994 995 996 997 998 999 1000 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 1011 1012 1013 1014 1015 1016 1017 1018 1019 1020 1021 1022 1023 1024 1025 1026 1027 1028 1029 1030 1031 1032 1033 1034 1035 1036 1037 1038 1039 1040 1041 1042 1043 1044 1045 1046 1047 1048 1049 1050 1051 1052 1053 1054 1055 1056 1057 1058 1059 1060 1061 1062 1063 1064 1065 1066 1067 1068 1069 1070 1071 1072 1073 1074 1075 1076 1077 1078 1079 1080 1081 1082 1083 1084 1085 1086 1087 1088 1089 1090 1091 1092 1093 1094 1095 1096 1097 1098 1099 1100 1101 1102 1103 1104 1105 1106 1107 1108 1109 1110 1111 1112 1113 1114 1115 1116 1117 1118 1119 1120 1121 1122 1123 1124 1125 1126 1127 1128 1129 1130 1131 1132 1133 1134 1135 1136 1137 1138 1139 1140 1141 1142 1143 1144 1145 1146 1147 1148 1149 1150 1151 1152 1153 1154 1155 1156 1157 1158 1159 1160 1161 1162 1163 1164 1165 1166 1167 1168 1169 1170 1171 1172 1173 1174 1175 1176 1177 1178 1179 1180 1181 1182 1183 1184 1185 1186 1187 1188 1189 1190 1191 1192 1193 1194 1195 1196 1197 1198 1199 1200 1201 1202 1203 1204 1205 1206 1207 1208 1209 1210 1211 1212 1213 1214 1215 1216 1217 1218 1219 1220 1221 1222 1223 1224 1225 1226 1227 1228 1229 1230 1231 1232 1233 1234 1235 1236 1237 1238 1239 1240 1241 1242 1243 1244 1245 1246 1247 1248 1249 1250 1251 1252 1253 1254 1255 1256 1257 1258 1259 1260 1261 1262 1263 1264 1265 1266 1267 1268 1269 1270 1271 1272 1273 1274 1275 1276 1277 1278 1279 1280 1281 1282 1283 1284 1285 1286 1287 1288 1289 1290 1291 1292 1293 1294 1295 1296 1297 1298 1299 1300 1301 1302 1303 1304 1305 1306 1307 1308 1309 1310 1311 1312 1313 1314 1315 1316 1317 1318 1319 1320 1321 1322 1323 1324 1325 1326 1327 1328 1329 1330 1331 1332 1333 1334 1335 1336 1337 1338 1339 1340 1341 1342 1343 1344 1345 1346 1347 1348 1349 1350 1351 1352 1353 1354 1355 1356 1357 1358 1359 1360 1361 1362 1363 1364 1365 1366 1367 1368 1369 1370 1371 1372 1373 1374 1375 1376 1377 1378 1379 1380 1381
|
\documentclass{article}
%\VignetteEngine{knitr::knitr}
%\VignetteIndexEntry{Dimensionality Reduction}
%\VignetteKeyword{Dimensionality Reduction}
\usepackage[utf8]{inputenc}
\usepackage[T1]{fontenc}
\usepackage{hyperref}
\usepackage{amsmath,amssymb}
\usepackage{booktabs}
\usepackage{tikz}
\usetikzlibrary{trees}
\usepackage[sectionbib,round]{natbib}
\title{\pkg{dimRed} and \pkg{coRanking}---Unifying Dimensionality Reduction in R}
\author{Guido Kraemer \and Markus Reichstein \and Miguel D.\ Mahecha}
% these are taken from RJournal.sty:
\makeatletter
\DeclareRobustCommand\code{\bgroup\@noligs\@codex}
\def\@codex#1{\texorpdfstring%
{{\normalfont\ttfamily\hyphenchar\font=-1 #1}}%
{#1}\egroup}
\newcommand{\kbd}[1]{{\normalfont\texttt{#1}}}
\newcommand{\key}[1]{{\normalfont\texttt{\uppercase{#1}}}}
\DeclareRobustCommand\samp{`\bgroup\@noligs\@sampx}
\def\@sampx#1{{\normalfont\texttt{#1}}\egroup'}
\newcommand{\var}[1]{{\normalfont\textsl{#1}}}
\let\env=\code
\newcommand{\file}[1]{{`\normalfont\textsf{#1}'}}
\let\command=\code
\let\option=\samp
\newcommand{\dfn}[1]{{\normalfont\textsl{#1}}}
% \acronym is effectively disabled since not used consistently
\newcommand{\acronym}[1]{#1}
\newcommand{\strong}[1]{\texorpdfstring%
{{\normalfont\fontseries{b}\selectfont #1}}%
{#1}}
\let\pkg=\strong
\newcommand{\CRANpkg}[1]{\href{https://CRAN.R-project.org/package=#1}{\pkg{#1}}}%
\let\cpkg=\CRANpkg
\newcommand{\ctv}[1]{\href{https://CRAN.R-project.org/view=#1}{\emph{#1}}}
\newcommand{\BIOpkg}[1]{\href{https://www.bioconductor.org/packages/release/bioc/html/#1.html}{\pkg{#1}}}
\makeatother
\begin{document}
\maketitle
\abstract{ %
This document is based on the manuscript of \citet{kraemer_dimred_2018} which
was published in the R-Journal and has been modified and extended to fit the
format of a package vignette and to match the extended functionality of the
\pkg{dimRed} package.
``Dimensionality reduction'' (DR) is a widely used approach to find low
dimensional and interpretable representations of data that are natively
embedded in high-dimensional spaces. %
DR can be realized by a plethora of methods with different properties,
objectives, and, hence, (dis)advantages. The resulting low-dimensional data
embeddings are often difficult to compare with objective criteria. %
Here, we introduce the \CRANpkg{dimRed} and \CRANpkg{coRanking} packages for
the R language. %
These open source software packages enable users to easily access multiple
classical and advanced DR methods using a common interface. %
The packages also provide quality indicators for the embeddings and easy
visualization of high dimensional data. %
The \pkg{coRanking} package provides the functionality for assessing DR methods in the
co-ranking matrix framework. %
In tandem, these packages allow for uncovering complex structures high
dimensional data. %
Currently 15 DR methods are available in the package, some of which were not
previously available to R users. %
Here, we outline the \pkg{dimRed} and \pkg{coRanking} packages and
make the implemented methods understandable to the interested reader. %
}
\section{Introduction}
\label{sec:intro}
Dimensionality Reduction (DR) essentially aims to find low dimensional
representations of data while preserving their key properties. %
Many methods exist in literature, optimizing different criteria: %
maximizing the variance or the statistical independence of the projected data, %
minimizing the reconstruction error under different constraints, %
or optimizing for different error metrics, %
just to name a few. %
Choosing an inadequate method may imply that much of the underlying structure
remains undiscovered. %
Often the structures of interest in a data set can be well represented by fewer
dimensions than exist in the original data. %
Data compression of this kind has the additional benefit of making the encoded
information better conceivable to our brains for further analysis tasks
like classification or regression problems. %
For example, the morphology of a plant's leaves, stems, and seeds reflect the
environmental conditions the species usually grow in (e.g.,\ plants with large
soft leaves will never grow in a desert but might have an advantage in a humid
and shadowy environment). %
Because the morphology of the entire plant depends on the environment, many
morphological combinations will never occur in nature and the morphological
space of all plant species is tightly constrained. %
\citet{diaz_global_2016} found that out of six observed morphological characteristics
only two embedding dimensions were enough to represent three quarters of the totally
observed variability. %
DR is a widely used approach for the detection of structure in multivariate
data, and has applications in a variety of fields. %
In climatology, DR is used to find the modes of some phenomenon, e.g.,\ the first
Empirical Orthogonal Function of monthly mean sea surface temperature of a given
region over the Pacific is often linked to the El Ni\~no Southern
Oscillation or
ENSO \citep[e.g.,\ ][]{hsieh_nonlinear_2004}. %
In ecology the comparison of sites with different species abundances is a
classical multivariate problem: each observed species adds an extra dimension,
and because species are often bound to certain habitats, there is a lot of
redundant information. Using DR is a popular technique to represent the sites in
few dimensions, e.g.,\ \citet{aart_distribution_1972} matches wolfspider
communities to habitat and \citet{morrall_soil_1974} match soil fungi data to
soil types. (In ecology the general name for DR is ordination or indirect
gradient analysis.) %
Today, hyperspectral satellite imagery collects so many bands that it is very
difficult to analyze and interpret the data directly. %
Resuming the data into a set of few, yet independent, components is one way to
reduce complexity \citep[e.g.,\ see][]{laparra_dimensionality_2015}. %
DR can also be used to visualize the interiors of deep neural networks
\citep[e.g.,\ see ][]{han_deep_2016}, where the high dimensionality comes from
the large number of weights used in a neural network and convergence can be
visualized by means of DR\@. %
We could find many more example applications here but this is not the main focus
of this publication. %
The difficulty in applying DR is that each DR method is designed to maintain
certain aspects of the original data and therefore may be appropriate for one
task and inappropriate for another. %
Most methods also have parameters to tune and follow different assumptions. The
quality of the outcome may strongly depend on their tuning, which adds
additional complexity. %
DR methods can be modeled after physical models with attracting and repelling
forces (Force Directed Methods), projections onto low dimensional planes (PCA,
ICA), divergence of statistical distributions (SNE family), or the reconstruction
of local spaces or points by their neighbors (LLE). %
As an example for how changing internal parameters of a method can have a great
impact, the breakthrough for Stochastic Neighborhood Embedding (SNE) methods
came when a Student's $t$-distribution was used instead of a normal distribution
to model probabilities in low dimensional space to avoid the ``crowding
problem'', that is,\ a sphere in high dimensional space has a much larger volume
than in low dimensional space and may contain too many points to be represented
accurately in few dimensions. %
The $t$-distribution, allows medium distances to be accurately represented in
few dimensions by larger distances due to its heavier tails. %
The result is called in $t$-SNE and is especially good at preserving local
structures in very few dimensions, this feature made $t$-SNE useful for a wide
array of data visualization tasks and the method became much more popular than
standard SNE (around six times more citations of
\citet{van_der_maaten_visualizing_2008} compared to
\citet{hinton_stochastic_2003} in Scopus \citep{noauthor_scopus_nodate}). %
There are a number of software packages for other languages providing collections of methods: In
Python there is scikit-learn \citep{scikit-learn}, which contains a module for
DR. In Julia we currently find ManifoldLearning.jl for nonlinear and
MultivariateStats.jl for linear DR methods. %
There are several toolboxes for DR implemented in Matlab
\citep{van_der_maaten_dimensionality_2009,
arenas-garcia_kernel_2013}. The Shogun
toolbox \citep{soeren_sonnenburg_2017_1067840} implements a variety of methods for
dimensionality reduction in C++ and offers bindings for a many common high level
languages (including R, but the installation is anything but simple, as
there is no CRAN package). %
However, there is no comprehensive package for R and none of the former
mentioned software packages provides means to consistently compare the quality
of different methods for DR. %
For many applications it can be difficult to objectively find the right method
or parameterization for the DR task. %
This paper presents the \pkg{dimRed} and \pkg{coRanking} packages for
the popular programming language R. Together, they
provide a standardized interface to various dimensionality reduction methods and quality
metrics for embeddings. They are implemented using the S4 class system
of R, making the packages
both easy to use and to extend.
The design goal for these packages is to enable researchers, who may not necessarily be experts in DR, to
apply the methods in their own work and to objectively identify the
most suitable
methods for their data. %
This paper provides an overview of the methods collected in the
packages and contains examples as to how
to use the packages. %
The notation in this paper will be as follows: $X = [x_i]_{1\leq i \leq n}^T \in
\mathbb{R}^{n\times p}$, and the observations $x_i \in \mathbb{R}^p$. %
These observations may be transformed prior to the dimensionality reduction
step (e.g.,\ centering and/or standardization) resulting in $X' = [x'_i]_{1\leq i
\leq n}^T \in \mathbb{R}^{n\times p}$. %
A DR method then embeds each vector in $X'$ onto a vector in $Y = [y_i]_{1\leq i
\leq n}^T \in \mathbb{R}^{n\times q}$ with $y_i \in \mathbb{R}^q$, ideally
with $q \ll p$. %
Some methods provide an explicit mapping $f(x'_i) = y_i$. Some even offer an
inverse mapping $f^{-1}(y_{i}) = \hat x'_{i}$, such that one can reconstruct a
(usually approximate) sample from the low-dimensional representation. %
For some methods, pairwise distances between points are needed, we set $d_{ij} =
d(x_{i}, x_{j})$ and $\hat{d}_{ij} = d(y_i, y_j)$, where $d$ is some appropriate
distance function.
When referring to \code{functions} in the \pkg{dimRed} package or base R simply the
function name is mentioned, functions from other packages are
referenced with their namespace, as with \code{package::function}.
\begin{figure}[htbp]
\centering
\input{classification_tree.tex}
\caption{%
Classification of dimensionality reduction methods. Methods
in bold face are implemented in \pkg{dimRed}.
Modified from \citet{van_der_maaten_dimensionality_2009}.
}\label{fig:classification}
\end{figure}
\section{Dimensionality Reduction Methods}
\label{sec:dimredtec}
In the following section we do not aim for an exhaustive explanation to every
method in \pkg{dimRed} but rather to provide a general idea on how the
methods work. %
An overview and classification of the most commonly used DR methods can be found
in Figure~\ref{fig:classification}.
In all methods, parameters have to be optimized or decisions have to be made,
even if it is just about the preprocessing steps of data. %
The \pkg{dimRed} package tries to make the optimization process for parameters as easy as
possible, but, if possible, the parameter space should be narrowed down using
prior knowledge. %
Often decisions can be made based on theoretical knowledge. For example,\ sometimes an
analysis requires data to be kept in their original scales and sometimes this is
exactly what has to be avoided as when comparing different physical
units. %
Sometimes decisions based on the experience of others can be made, e.g.,\ the
Gaussian kernel is probably the most universal kernel and therefore should be
tested first if there is a choice. %
All methods presented here have the embedding dimensionality, $q$, as a
parameter (or \code{ndim} as a parameter for \code{embed}). %
For methods based on eigenvector decomposition, the result generally does not
depend on the number of dimensions, i.e.,\ the first dimension will be the same,
no matter if we decide to calculate only two dimensions or more. %
If more dimensions are added, more information is maintained, the first
dimension is the most important and higher dimensions are successively less
important. %
This means, that a method based on eigenvalue decomposition only has to be run
once if one wishes to compare the embedding in different dimensions. %
In optimization based methods this is generally not the case, the number of
dimensions has to be chosen a priori, an embedding of 2 and 3 dimensions may
vary significantly, and there is no ordered importance of dimensions. %
This means that comparing dimensions of optimization-based methods is
computationally much more expensive. %
We try to give the computational complexity of the methods. Because of the
actual implementation, computation times may differ largely. %
R is an interpreted language, so all parts of an algorithm that are implemented
in R often will tend to be slow compared to methods that call efficient
implementations in a compiled language. %
Methods where most of the computing time is spent for eigenvalue decomposition
do have very efficient implementations as R uses optimized linear algebra
libraries. Although, eigenvalue decomposition itself does not scale very well in
naive implementations ($\mathcal{O}(n^3)$).
\subsection{PCA}
\label{sec:pca}
Principal Component Analysis (PCA) is the most basic technique for reducing
dimensions. It dates back to \citet{pearson_lines_1901}. PCA finds a linear
projection ($U$) of the high dimensional space into a low dimensional space $Y =
XU$, maintaining maximum variance of the data. It is based on solving the
following eigenvalue problem:
\begin{equation}
(C_{XX}-\lambda_k I)u_k=0\label{eq:pca}
\end{equation}
where $C_{XX} = \frac 1 n X^TX$ is the covariance matrix, $\lambda_k$ and $u_k$
are the $k$-th eigenvalue and eigenvector, and $I$ is the identity matrix. %
The equation has several solutions for different values of $\lambda_k$ (leaving
aside the trivial solution $u_k = 0$). %
PCA can be efficiently applied to large data sets, because it computationally
scales as $\mathcal{O}(np^2 + p^3)$, that is, it scales linearly with the number of
samples and R uses specialized linear algebra libraries for such kind of
computations.
PCA is a rotation around the origin and there exist a forward and inverse
mapping. %
PCA may suffer from a scale problem, i.e.,\ when one variable dominates the
variance simply because it is in a higher scale, to remedy this, the data can be
scaled to zero mean and unit variance, depending on the use case, if this is
necessary or desired. %
Base R implements PCA in the functions \code{prcomp} and \code{princomp}; but
several other implementations exist i.e., \BIOpkg{pcaMethods} from Bioconductor
which implements versions of PCA that can deal with missing data. %
The \pkg{dimRed} package wraps \code{prcomp}.
\subsection{kPCA}
\label{sec:kpca}
Kernel Principal Component Analysis (kPCA) extends PCA to deal with nonlinear
dependencies among variables. %
The idea behind kPCA is to map the data into a high dimensional space using a
possibly non-linear function $\phi$ and then to perform a PCA in this high
dimensional space. %
Some mathematical tricks are used for efficient computation. %
If the columns of X are centered around $0$, then the principal components can
also be computed from the inner product matrix $K = X^TX$. %
Due to this way of calculating a PCA, we do not need to explicitly map all points
into the high dimensional space and do the calculations there, it is enough to
obtain the inner product matrix or kernel matrix $K \in \mathbb{R}^{n\times n}$
of the mapped points \citep{scholkopf_nonlinear_1998}. %
Here is an example calculating the kernel matrix using a Gaussian kernel:
\begin{equation}\label{eq:gauss}
K = \phi(x_i)^T \phi(x_j) = \kappa(x_i, x_j) = \exp\left(
-\frac{\| x_i- x_j\|^2}{2 \sigma^2}
\right),
\end{equation}
where $\sigma$ is a length scale parameter accounting for the width of the
kernel. %
The other trick used is known as the ``representers theorem.'' The interested
reader is referred to \citet{scholkopf_generalized_2001}.
The kPCA method is very flexible and there exist many kernels for special
purposes. The most common kernel function is the Gaussian kernel
(Equation\ \ref{eq:gauss}). %
The flexibility comes at the price that the method has to be finely
tuned for the data set because some parameter combinations are simply
unsuitable for certain data. %
The method is not suitable for very large data sets, because memory
scales with $\mathcal{O}(n^2)$ and computation time with
$\mathcal{O}(n^3)$. %
Diffusion Maps, Isomap, Locally Linear Embedding, and some other techniques can
be seen as special cases of kPCA. In which case, an out-of-sample extension using the Nyström
formula can be applied \citep{bengio_learning_2004}. %
This can also yield applications for bigger data, where an embedding is trained
with a sub-sample of all data and then the data is embedded using the Nyström
formula.
Kernel PCA in R is implemented in the \CRANpkg{kernlab} package using the function
\code{kernlab::kpca}, and supports a number of kernels and
user defined functions. For details see the help page for \code{kernlab::kpca}.
The \pkg{dimRed} package wraps \code{kernlab::kpca} but additionally
provides forward and inverse methods \citep{bakir_learning_2004} which can be
used to fit out-of-sample data or to visualize the transformation of the data
space. %
\subsection{Classical Scaling}
\label{sec:classscale}
What today is called Classical Scaling was first introduced by
\citet{torgerson_multidimensional_1952}. It uses an eigenvalue decomposition of
a transformed distance matrix to find an embedding that maintains the distances
of the distance matrix. %
The method works because of the same reason that kPCA works, i.e.,\ classical
scaling can be seen as a kPCA with kernel $x^Ty$. %
A matrix of Euclidean distances can be transformed into an inner product matrix
by some simple transformations and therefore yields the same result as a
PCA\@. %
Classical scaling is conceptually more general than PCA in that arbitrary
distance matrices can be used, i.e.,\ the method does not even need the original
coordinates, just a distance matrix $D$. %
Then it tries to find an embedding $Y$ so that $\hat d_{ij}$ is as similar to
$d_{ij}$ as possible.
The disadvantage is that it is computationally much more demanding, i.e.,\
an eigenvalue decomposition of an $n\times n$ matrix has to be computed.
This step requires $\mathcal{O}(n^2)$ memory and $\mathcal{O}(n^3)$
computation time, while PCA requires only the eigenvalue decomposition
of a $d\times d$ matrix and usually $n \gg d$. %
R implements classical scaling in the \code{cmdscale}
function. %
The \pkg{dimRed} package wraps \code{cmdscale} and allows the specification
of arbitrary distance functions for calculating the distance matrix. Additionally
a forward method is implemented.
\subsection{Isomap}
\label{sec:isomap}
As Classical Scaling can deal with arbitrarily defined distances,
\citet{tenenbaum_global_2000} suggested to approximate the
structure of the manifold by using geodesic distances. %
In practice, a graph is created by either keeping only
the connections between every point and its $k$ nearest neighbors to
produce a $k$-nearest neighbor graph ($k$-NNG), or simply by keeping all
distances smaller than a value $\varepsilon$ producing an
$\varepsilon$-neighborhood graph ($\varepsilon$-NNG). %
Geodesic distances are obtained by recording the distance on the
graph and classical scaling is used to find an embedding in fewer
dimensions. This leads to an ``unfolding'' of possibly convoluted
structures (see Figure~\ref{fig:knn}).
Isomap's computational cost is dominated by the eigenvalue decomposition and
therefore scales with $\mathcal{O}(n^3)$. %
Other related techniques can use more efficient algorithms
because the distance matrix becomes sparse due to a different preprocessing.
In R, Isomap is implemented in the \CRANpkg{vegan} package. The
\code{vegan::isomap} calculates an Isomap embedding and \code{vegan::isomapdist}
calculates a geodesic distance matrix. %
The \pkg{dimRed} package uses its own implementation. This implementation is
faster mainly due to using a KD-tree for the nearest neighbor search (from the
\CRANpkg{RANN} package) and to a faster implementation for the shortest path
search in the $k$-NNG (from the \CRANpkg{igraph} package). %
The implementation in \pkg{dimRed} also includes a forward method that can be
used to train the embedding on a subset of data points and then use these points
to approximate an embedding for the remaining points. This technique is
generally referred to as landmark Isomap \citep{de_silva_sparse_2004}. %
\subsection{Locally Linear Embedding}
\label{sec:lle}
Points that lie on a manifold in a high dimensional space can be
reconstructed through linear combinations of their neighborhoods if the
manifold is well sampled and the neighbohoods lie on a locally linear
patch. %
These reconstruction weights, $W$, are the same in the high dimensional
space as the internal coordinates of the manifold. %
Locally Linear Embedding \citep[LLE; ][]{roweis_nonlinear_2000} is a
technique that constructs a weight matrix
$W \in \mathbb{R}^{n\times n}$ with elements $w_{ij}$ so that
\begin{equation}
\sum_{i=1}^n \bigg\| x_i-
\sum_{j=1}^{n} w_{ij}x_j \bigg\|^2\label{eq:lle}
\end{equation}
is minimized under the constraint that $w_{ij} = 0 $ if $x_j$ does not belong
to the neighborhood and the constraint that $\sum_{j=1}^n w_{ij} = 1$. %
Finally the embedding is made in such a way that the following cost function is
minimized for $Y$,
\begin{equation}
\sum_{i=1}^n\bigg\| y_i - \sum_{j=1}^n w_{ij}y_j
\bigg\|^2.\label{eq:lle2}
\end{equation}
This can be solved using an eigenvalue decomposition.
Conceptually the method is similar to Isomap but it is
computationally much nicer because the weight matrix is
sparse and there exist efficient solvers. %
In R, LLE is implemented by the package \CRANpkg{lle}, the embedding can be
calculated with \code{lle::lle}.
Unfortunately the implementation does not make use of the sparsity of the weight matrix
$W$. %
The manifold must be well sampled and the neighborhood size must be
chosen appropriately for LLE to give good results. %
\subsection{Laplacian Eigenmaps}
\label{sec:laplaceigenmaps}
Laplacian Eigenmaps were originally developed under the name spectral clustering
to separate non-convex clusters. %
Later it was also used for graph embedding and DR
\citep{belkin_laplacian_2003}. %
A number of variants have been proposed. %
First, a graph is constructed, usually from a distance matrix, the graph can be
made sparse by keeping only the $k$ nearest neighbors, or by specifying an
$\varepsilon$ neighborhood. %
Then, a similarity matrix $W$ is calculated by using a Gaussian kernel (see Equation
\ref{eq:gauss}), if $c = 2 \sigma^2 = \infty$, then all distances are treated
equally, the smaller $c$ the more emphasis is given to differences in
distance. %
The degree of vertex $i$ is $d_i = \sum_{j=1}^n w_{ij}$ and the degree
matrix, $D$, is the diagonal matrix with entries $d_i$. %
Then we can form the graph Laplacian $L = D - W$ and, then, there are several ways how
to proceed, an overview can be found in \citet{luxburg_tutorial_2007}. %
The \pkg{dimRed} package implements the algorithm from
\citet{belkin_laplacian_2003}. Analogously to LLE, Laplacian eigenmaps
avoid computational complexity by creating a sparse matrix and not
having to estimate the distances between all pairs of points. %
Then the eigenvectors corresponding to the lowest eigenvalues larger
than $0$ of either the matrix $L$ or the normalized Laplacian
$D^{-1/2}LD^{-1/2}$ are computed and form the embedding.
\subsection{Diffusion Maps}
\label{sec:isodiffmaplle}
Diffusion Maps \citep{coifman_diffusion_2006} take a distance matrix
as input and calculates the transition probability matrix $P$ of a
diffusion process between the points to approximate the manifold. %
Then the embedding is done by an eigenvalue decompositon of $P$ to
calculate the coordinates of the embedding. %
The algorithm for calculating Diffusion Maps shares some elements with
the way Laplacian Eigenmaps are calculated. %
Both algorithms depart from the same weight matrix, Diffusion Maps
calculate the transition probability on the graph after $t$ time steps
and do the embedding on this probability matrix.
The idea is to simulate a diffusion process between the nodes of the
graph, which is more robust to short-circuiting than the $k$-NNG from
Isomap (see bottom right Figure \ref{fig:knn}). %
Diffusion maps in R are accessible via the
\code{diffusionMap::diffuse()} function, which is available in the
\CRANpkg{diffusionMap} package. %
Additional points can be approximated into an existing embedding using
the Nyström formula \citep{bengio_learning_2004}. %
The implementation in \pkg{dimRed} is based on the
\code{diffusionMap::diffuse} function.
% , which does not contain an
% approximation for unequally sampled manifolds
% \citep{coifman_geometric_2005}. %
\subsection{non-Metric Dimensional Scaling}
\label{sec:nmds}
While Classical Scaling and derived methods (see section
\nameref{sec:classscale}) use eigenvector decomposition to embed the data in
such a way that the given distances are maintained, non-Metric Dimensional
Scaling \citep[nMDS, ][]{kruskal_multidimensional_1964,kruskal_nonmetric_1964}
uses optimization methods to reach the same goal. %
Therefore a stress function,
\begin{equation}
\label{eq:stress}
S = \sqrt{\frac{\sum_{i<j} {( d_{ij} - \hat{d}_{ij} )}^2}
{\sum_{i<j}d_{ij}^2}},
\end{equation}
is used, and the algorithm tries to embed $y_i$ in such a way that the order of
the $d_{ij}$ is the same as the order of the $\hat d_{ij}$ Because optimization
methods can fit a wide variety of problems, there are very loose limits set to
the form of the error or stress function. %
For instance \citet{mahecha_nonlinear_2007} found that nMDS using geodesic
distances can be almost as powerful as Isomap for embedding biodiversity
patterns. %
Because of the flexibility of nMDS, there is a whole package in R devoted to
Multidimensional Scaling, \code{smacof} \citep{leeuw_multidimensional_2009}. %
Several packages provide implementations for nMDS in R, for example
\CRANpkg{MASS} and \pkg{vegan} with the functions \code{MASS::isoMDS} and
\code{vegan::monoMDS}. Related methods include Sammons Mapping which
con be found as \code{MASS::sammon}. %
The \pkg{dimRed} package wraps \code{vegan::monoMDS}.
\subsection{Force Directed Methods}
\label{sec:graph}
The data $X$ can be considered as a graph with weighted edges, where the weights
are the distances between points. %
Force directed algorithms see the edges of the graphs as springs or the result of
an electric charge of the nodes that result in an attractive or repulsive force
between the nodes, the algorithms then try to minimize the overall energy of the
graph. %
\begin{equation}
\label{eq:graphenergy}
E = \sum_{i<j}k_{ij} {( d_{ij} - \hat{d}_{ij} )}^2,
\end{equation}
where $k_{ij}$ is the spring constant for the spring connecting points
$i$ and $j$.
Graph embedding algorithms generally suffer from long running times
(though compared to
other methods presented here they do not scale as badly) and many local
optima. %
This is why a number of methods have been developed that try to deal with some
of the shortcomings, for example, the Kamada-Kawai \citep{kamada_algorithm_1989}, the
Fruchtermann-Reingold \citep{fruchterman_graph_1991}, or the DrL
\citep{martin_dr.l:_2007} algorithms. %
There are a number of graph embedding algorithms included in the
\CRANpkg{igraph} package, they can be accessed using the
\code{igraph::layout\_with\_*} function family. The \pkg{dimRed} package
only wraps the three algorithms mentioned above; there are many others
which are not interesting for dimensionality reduction. %
\subsection{$t$-SNE}
\label{sec:tsne}
Stochastic Neighbor Embedding \citep[SNE; ][]{hinton_stochastic_2003} is a
technique that minimizes the Kullback-Leibler divergence of scaled
similarities of the points $i$ and $j$ in a high dimensional space,
$p_{ij}$, and a low
dimensional space, $q_{ij}$:
\begin{equation}\label{eq:kldivergence}
KL(P \| Q) = \sum_{i\neq j} p_{ij}\log\frac{p_{ij}}{q_{ij}}.
\end{equation}
SNE uses a Gaussian kernel (see Equation~\ref{eq:gauss}) to compute similarities in
a high and a low dimensional space. The $t$-Distributed Stochastic Neighborhood
Embedding \citep[$t$-SNE; ][]{van_der_maaten_visualizing_2008} improves on SNE by
using a $t$-Distribution as a kernel in low dimensional space. %
Because of the heavy-tailed $t$-distribution, $t$-SNE maintains local
neighborhoods of the data better and penalizes wrong embeddings of dissimilar
points. %
This property makes it especially suitable to represent clustered data and
complex structures in few dimensions.
The $t$-SNE method has one parameter, perplexity, to tune. This determines the
neighborhood size of the kernels used.
The general runtime of $t$-SNE is $\mathcal{O}(n^2)$, but an efficient
implementation using tree search algorithms that scales as $\mathcal{O}(n\log n)$
exists and can be found in the \CRANpkg{Rtsne} package in R. %
The $t$-SNE implementation in \pkg{dimRed} wraps the \pkg{Rtsne}
package.
There exist a number of derived techniques for dimensionality reduction, e.g.,\
NeRV \citep{venna_information_2010} and JNE \citep{lee_type_2013}, that improve
results but for which there do not yet exist packages on CRAN implementing them.
% \begin{figure}[tb]
% \centering
% \includegraphics[width=.95\textwidth]{plots/kpca_grid}
% \caption[kPCA data space transformation.]{Transformation of a
% two-dimensional data space by kPCA\@. It can be observed that kPCA
% does not extrapolate well outside of the training range}
% \label{fig:kpca}
% \end{figure}
\subsection{ICA}
\label{sec:ica}
Independent Component Analysis (ICA) interprets the data $X$ as a mixture
of independent signals, e.g.,\ a number of sound sources recorded by
several microphones, and tries to ``un-mix'' them to find the original
signals in the recorded signals. %
ICA is a linear rotation of the data, just as PCA, but instead of
recovering the maximum variance, it recovers statistically independent
components. %
% To do so a pre-whitening is performed on $X$, so that $x_{\cdot i}$
% are uncorrelated and have variance $1$.
A signal matrix $S$ and a mixing matrix $A$ are estimated so that
$X = AS$.
There are a number of algorithms for ICA, the most widely used is
fastICA \citep{hyvarinen_fast_1999} because it provides a fast and robust
way to estimate $A$ and $S$.
% It considers two signals to be
% independent if they are non-gaussian, the algorithm uses a fast
% approximation of the negentropy \citep{comon_independent_1994} measure to make two
% variables independent.
FastICA maximizes a measure for non-Gaussianity called negentropy $J$
\citep{comon_independent_1994}. This is equivalent to minimizing
mutual information between the resulting components. %
Negentropy $J$ is defined as follows:
%mutual information $i$ is defined as:
\begin{align}\label{eq:negentropy}
H(u) &= - \int f(u) \log f(Y) \, \mathrm{d}u, \\
J(u) &= H(u_{\text{gauss}}) - H(u),
\end{align}
% $$I(y_1, y_2, \ldots, y_n) = J(\mathbf{y}) - \sum_i J(y_i)$$
where $u = (u_1, \ldots, u_n)^T$ is a random vector with density $f(\cdot)$ and
$u_{\text{gauss}}$ is a Gaussian random variable with the same covariance
structure as $u$. FastICA uses a very efficient approximation to calculate
negentropy. Because ICA can be translated into a simple linear projection, a
forward and an inverse method can be supplied.
There are a number of packages in R that implement algorithms for
ICA, the \pkg{dimRed} package wraps the
\texttt{fastICA::fastICA()} function from \CRANpkg{fastICA}.
\subsection{DRR}
\label{sec:drr}
Dimensionality Reduction via Regression is a very recent technique extending PCA
\citep{laparra_dimensionality_2015}. %
Starting from a rotated (PCA) solution $X'= XU$, it predicts redundant
information from the remaining components using non-linear regression.
\begin{equation}\label{eq:drr}
y_{\cdot i} = x'_{\cdot i} - f_i(x'_{\cdot 1}, x'_{\cdot 2}, \ldots, x'_{\cdot i-1})
\end{equation}
with $x_{\cdot i}$ and $y_{\cdot i}$ being the loading of observations on the
$i$-th axis. %
In theory, any kind of regression can be used. the
authors of the original paper choose Kernel Ridge Regression
\citep[KRR; ][]{saunders_ridge_1998} because it is a flexible nonlinear
regression technique and computational optimizations for a fast
calculation exist. DRR has another advantage over other techniques
presented here, because it provides an exact forward and inverse function.
The usage of KRR also has the advantage of making the method convex, here we list
it under non-convex methods, because other types of regression may make it non-convex.
Mathematically, functions are limited to map one input to a single output point.
Therefore, DRR reduces to PCA if manifolds are too complex;
but it seems very useful for slightly curved manifolds. The initial rotation is
important, because the result strongly depends on the order of dimensions in
high dimensional space.
DRR is implemented in the package \CRANpkg{DRR}. The package provides
forward and inverse functions which can be used to train on a
subset.
\section{Quality criteria}
\label{sec:quality}
The advantage of unsupervised learning is that one does not need to
specify classes or a target variable for the data under
scrutiny. Instead the chosen algorithm arranges the input data. For
example, arranged into clusters or into a lower dimensional
representation. %
In contrast to a supervised problem, there is no natural way to
directly measure the quality of any output or to compare two methods
by an objective measure like for instance modeling efficiency or
classification error. %
The reason is that every method optimizes a different error function,
and it would be unfair to compare $t$-SNE and PCA by means of either
recovered variance or KL-Divergence. %
One fair measure would be the reconstruction error, i.e.,\
reconstructing the original data from a limited number of dimensions,
but as discussed above not many methods provide forward and inverse
mappings. %
However, there are a series of independent estimators on the quality of a
low-dimensional embedding. %
The \pkg{dimRed} package provides a number of quality measures which have
been proposed in the literature to measure performance of dimensionality
reduction techniques.
\subsection{Co-ranking matrix based measures}
\label{sec:coranking}
The co-ranking matrix \citep{lee_quality_2009} is a way to capture the
changes in ordinal distance. As before, let $d_{ij} = d(x_i,x_j)$
be the distances between $x_i$ and $x_j$, i.e.,\ in high dimensional
space and $\hat d_{ij} = d(y_i,y_j)$ the distances in low dimensional
space, then we can define the rank of $y_j$ with respect to $y_i$
\begin{equation}
\label{eq:rankmatrix-low-dim}
\hat r_{ij} = |\{k : \hat d_{ik} < \hat d_{ij} \text{ or } (\hat
d_{ik} = \hat d_{ij} \text{ and } 1 \leq k < j \leq n)\}|,
\end{equation}
and, analogously, the rank in high-dimensional space as:
\begin{equation}
\label{eq:rankmatrix-high-dim}
r_{ij} = |\{k : d_{ik} < d_{ij} \text{ or } (d_{ik} = d_{ij} \text{
and } 1 \leq k < j \leq n)\}|,
\end{equation}
where the notation $|A|$ denotes the number of elements in a set $A$. %
This means that we simply replace the distances in a distance matrix column
wise by their ranks. %
Therefore $r_{ij}$ is an integer which indicates that $x_i$ is the
$r_{ij}$-th closest neighbor of $x_j$ in the set $X$.
The co-ranking matrix $Q$ then has elements
\begin{equation}
\label{eq:coranking}
q_{kl} = |\{(i,j) : \hat r_{ij} = k \text{ and } r_{ij} = l\}|,
\end{equation}
which is the 2d-histogram of the ranks. That is, $q_{ij}$ is an integer which counts
how many points of distance rank $j$ became rank $i$. %
In a perfect DR, this matrix will only have non-zero entries in the diagonal; if
most of the non-zero entries are in the lower triangle, then the DR collapsed
far away points onto each other; if most of the non-zero entries are in the
upper triangle, then the DR teared close points apart. %
For a detailed description of the properties of the co-ranking matrix the reader
is referred to \citet{lueks_how_2011}. %
The co-ranking matrix can be computed using function
\code{coRanking::coranking()} and can be visualized using
\code{coRanking::imageplot()}. %
A good embedding should scatter the values around the diagonal of the matrix. If
the values are predominantly in the lower triangle, then the embedding collapses the original
structure causing far away points to be much closer; if the values are
predominantly in the upper triangle the points from the original structure are
torn apart. %
Nevertheless this method requires visual inspection of the matrix. %
For an automated assessment of quality, a scalar value that assigns a quality to
an embedding is needed. %
A number of metrics can be computed from the co-ranking matrix. For example:
\begin{equation}\label{eq:qnx}
Q_{N\!X}(k) = \frac{1}{kn} \sum_{i=1}^k\sum_{j=1}^k q_{ij},
\end{equation}
which is the number of points that belong to the $k$-th nearest neighbors in
both high- and low-dimensional space, normalized to give a maximum of $1$
\citep{lee_quality_2009}. %
This quantity can be adjusted for random embeddings, giving the Local
Continuity Meta Criterion \citep{chen_local_2006}:
\begin{equation}\label{eq:lcmc}
\text{LCMC}(k) = Q_{N\!X}(k) - \frac{k}{n-1}
\end{equation}
The above measures still depend on $k$, but LCMC has a well defined
maximum at $k_{\max} $. %
Two measures without parameters are then defined:
\begin{align}\label{eq:qlocalglobal}
Q_{\text{local}} &= \frac{1}{k_{\max}} \sum_{k=1}^{k_{\max}}
Q_{N\!X}(k) \text{ and} \\
Q_{\text{global}} &= \frac{1}{n - k_{\max}} \sum_{k=k_{\max}}^{n-1}
Q_{N\!X}(k).
\end{align}
These measure the preservation of local and global distances respectively. The
original authors advised using $Q_{\text{local}}$ over $Q_{\text{global}}$, but
this depends on the application.
LCMC($k$) can be normalized to a maximum of $1$, yielding the
following measure for a quality embedding \citep{lee_type_2013}:
\begin{equation}\label{eq:rnx}
R_{NX}(k) = \frac{(n-1) Q_{NX}(k) - k}{n-1-k},
\end{equation}
where a value of 0 corresponds to a random embedding and a value of 1
to a perfect embedding into the $k$-ary neighborhood. %
To transform $R_{NX}(k)$ into a parameterless measure, the area under
the curve can be used:
\begin{equation}\label{eq:auclnk}
\text{AUC}_{\ln k}\left(R_{NX}(k)\right) = \left( \sum_{k=1}^{n-2}
R_{NX}(k) \right) \Bigg/ \left( \sum_{k=1}^{n-2} 1/k \right).
\end{equation}
This measure is normalized to one and takes $k$ at a log-scale. %
Therefore it prefers methods that preserve local distances. %
In R, the co-ranking matrix can be calculated using the the
\code{coRanking::coranking} function. %
The \pkg{dimRed} package contains the functions \code{Q\_local},
\code{Q\_global}, \code{Q\_NX}, \code{LCMC}, and \code{R\_NX} to
calculate the above quality measures in addition to \code{AUC\_lnK\_R\_NX}. %
Calculating the co-ranking matrix is a relatively expensive operation
because it requires sorting every row of the distance matrix twice.
It therefore scales with $\mathcal{O}(n^2\log n)$. %
There is also a plotting function \code{plot\_R\_NX}, which plots the
$R_{NX}$ values with log-scaled $K$ and adds the $AUC_{\ln K}$ to the
legend (see Figure \ref{fig:plotexample}). %
There are a number of other measures that can be computed from a co-ranking
matrix, e.g.,\ see \citet{lueks_how_2011,lee_quality_2009}, or
\citet{babaee_assessment_2013}.
\subsection{Cophenetic correlation}
\label{sec:discor}
An old measure originally developed to compare clustering methods in the field
of phylogenetics is cophenetic correlation \citep{sokal_comparison_1962}. %
This method consists simply of the correlation between the upper or lower
triangles of the distance matrices (in dendrograms they are called cophenetic
matrices, hence the name) in a high and low dimensional space. Additionally the
distance measure and correlation method can be varied. %
In the \pkg{dimRed} package this is implemented in the
\code{cophenetic\_correlation} function. %
Some studies use a measure called ``residual variance''
\citep{tenenbaum_global_2000,mahecha_nonlinear_2007}, which is defined as
$$ 1 - r^2(D, \hat D), $$
where $r$ is the Pearson correlation and $D$, $\hat D$ are the distances
matrices consisting of elements $d_{ij}$ and $\hat d_{ij}$ respectively. %
\subsection{Reconstruction error}
\label{sec:reconerror}
The fairest and most common way to assess the quality of a dimensionality
reduction when the method provides an inverse mapping is
the reconstruction error.
The \pkg{dimRed} package includes a function to calculate the root mean
squared error which is defined as:
\begin{equation}\label{eq:rmse}
\text{RMSE} = \sqrt{\frac{1}{n} \sum_{i=1}^n {d(x'_i, x_i)}^2}
\end{equation}
with $x'_i = f^{-1}(y_{i})$, $f^{-1}$ being the
function that maps an embedded value back to feature space. %
The \pkg{dimRed} package provides the \code{reconstruction\_rmse} and
\code{reconstruction\_error} functions.
\section{Test data sets}
\label{sec:testdata}
There are a number of test data sets that are often used to showcase a
dimensionality reduction technique. Common ones being the 3d S-curve and the Swiss
roll, among others. %
These data sets have in common that they usually have three dimensions, and well
defined manifolds. %
Real world examples usually have more dimensions and often are much noisier, the
manifolds may not be well sampled and exhibit holes and large pieces may be
missing. %
Additionally, we cannot be sure if we can observe all the relevant variables. %
The \pkg{dimRed} package implements a number of test datasets that are being
used in literature to benchmark methods with the function
\code{dimRed::loadDataSet()}. %
For artificial datasets the number of points and the noise level can be
adjusted, the function also returns the internal coordinates.
\section{The \pkg{dimRed} Package}
The \pkg{dimRed} package collects DR methods readily implemented in R,
implements missing methods and offers means to compare the quality of
embeddings. %
The package is open source and available under the GPL3
license. Released versions of the package are
available through CRAN
(\url{https://cran.r-project.org/package=dimRed}) and development
versions are hosted on GitHub (\url{https://github.com/gdkrmr/dimRed}). %
The \pkg{dimRed} package provides a common interface and convenience functions for a variety of
different DR methods so that it is made easier to use and compare different
methods. An overview of the packages main functions can be found in Table~\ref{tab:interfacefun}. %
\begin{table}[h]
\centering
\begin{tabular}{l p{8cm}}
\toprule
Function & Description \\
\midrule
\code{embed} & Embed data using a DR method. \\
\code{quality} & Calculate a quality score from
the result of \code{embed}. \\
\code{plot} & Plot a \code{"dimRedData"} or
\code{"dimRedResult"} object, colors
the points automatically, for
exploring the data. \\
\code{plot\_R\_NX} & Compares the quality of various
embeddings. \\
\code{dimRedMethodList} & Returns a character vector that
contains all implemented DR
methods. \\
\code{dimRedQualityList} & Returns a character vector that
contains all implemented quality
measures. \\
\bottomrule
\end{tabular}
\caption[Main interface functions.]{The main interface functions
of the \pkg{dimRed} package.}\label{tab:interfacefun}
\end{table}
Internally, the package uses S4 classes but for normal usage the user
does not need to have any knowledge on the inner workings of the S4 class
system in R (cf.\ table~\ref{tab:s4classes}). %
The package contains simple conversion functions from and to standard
R-objects like a data.frame or a matrix. %
The \code{"dimRedData"} class provides a container for the data to be
processed. %
The slot \code{data} contains a matrix with dimensions in columns and
observations in rows, the slot \code{meta} may contain a data frame with
additional information, e.g.,\ categories or other information of the data
points. %
\begin{table}[h]
\centering
\begin{tabular}{l p{9cm}}
\toprule
Class Name & Function \\
\midrule
\code{"dimRedData"} & Holds the data for a DR.
Fed to \code{embed()}.
An \code{as.dimRedData()} methods exists for
\code{"data.frame"},
\code{"matrix"}, and
\code{"formula"} exist.\\
\code{"dimRedMethod"} & Virtual class, ancestor of all DR
methods. \\
\code{"dimRedResult"} & The result of \code{embed()},
the embedded data. \\
\bottomrule
\end{tabular}
\caption[S4 Classes]{The S4 classes used in the
\pkg{dimRed} package.}\label{tab:s4classes}
\end{table}
Each embedding method is a class which inherits from \code{"dimRedMethod"} which
means that it contains a function to generate \code{"dimRedResult"} objects and a
list of standard parameters. %
The class \code{"dimRedResult"} contains the data in reduced dimensions, the
original meta information along with the original data, and, if possible,
functions for the forward and inverse mapping. %
From a user-perspective the central function of the package is \code{embed}
which is called in the form \code{embed(data, method, \ldots)}, \code{data} can
take standard R objects such as instances of \code{"data.frame"}, \code{"matrix"}, or \code{"formula"},
as input. %
The \code{method} is given as a character vector. All available methods can be
listed by calling \samp{dimRedMethodList()}. %
Method-specific parameters can be passed through \code{...}; when no method-specific
parameters are given, defaults are chosen. %
The \code{embed} function returns an object of class \code{"dimRedResult"}.
For comparing different embeddings, \pkg{dimRed} contains the function
\code{quality} which relies on the output of \code{embed} and a method name. %
This function returns a scalar quality score; a vector that contains the names
of all quality functions is returned by calling \samp{dimRedQualityList()}. %
For easy visual examination, the package contains \code{plot} methods for
\code{"dimRedData"} and \code{"dimRedResult"} objects in order to plot high
dimensional data using parallel plots and pairwise scatter plots. %
Automatic coloring of data points is done using the available metadata. %
\section{Examples}
\label{sec:examples}
The comparison of different DR methods, choosing the right parameters for a
method, and the inspection of the results is simplified by
\pkg{dimRed}. %
This section contains a number of examples to highlight the usage of the package. %
% Example 1:
To compare methods of dimensionality reduction, first a test data set is loaded
using \code{loadDataSet}, then the \code{embed} function is used for DR
(\code{embed} can also handle standard R types like \code{matrix} and
\code{data.frame}). %
This makes it very simple to apply different methods of DR to the same data
e.g.,\ by defining a character vector of method names and then
iterating over these, say with
\code{lapply}. %
For inspection, \pkg{dimRed} provides methods for the \code{plot} function to
visualize the resulting embedding (Figure~\ref{fig:plotexample} b and d), internal
coordinates of the manifold are represented by color gradients. %
To visualize how well embeddings represent different neighborhood sizes, the
function \code{plot\_R\_NX} is used on a list of embedding results
(Figure~\ref{fig:plotexample}~c). %
The plots in figure~\ref{fig:plotexample} are produced by the following code: %
\begin{figure}[htp]
\centering
<<"pca_isomap_example",include=FALSE,fig.width=4,fig.height=4>>=
if(Sys.getenv("BNET_BUILD_VIGNETTE") != "") {
library(dimRed); library(ggplot2); #library(dplyr); library(tidyr)
## define which methods to apply
embed_methods <- c("Isomap", "PCA")
## load test data set
data_set <- loadDataSet("3D S Curve", n = 1000)
## apply dimensionality reduction
data_emb <- lapply(embed_methods, function(x) embed(data_set, x))
names(data_emb) <- embed_methods
## plot data set, embeddings, and quality analysis
## plot(data_set, type = "3vars")
## lapply(data_emb, plot, type = "2vars")
## plot_R_NX(data_emb)
add_label <- function(label)
grid::grid.text(label, 0.2, 1, hjust = 0, vjust = 1,
gp = grid::gpar(fontface = "bold",
cex = 1.5))
## pdf('~/phd/text/dimRedPackage/plots/plot_example.pdf', width = 4, height = 4)
## plot the results
plot(data_set, type = "3vars", angle = 15, mar = c(3, 3, 0, 0), box = FALSE, grid = FALSE, pch = 16)
add_label("a")
par(mar = c(4, 4, 0, 0) + 0.1, bty = "n", las = 1)
plot(data_emb$Isomap, type = "2vars", pch = 16)
add_label("b")
plot(data_emb$PCA, type = "2vars", pch = 16)
add_label("d")
## calculate quality scores
print(
plot_R_NX(data_emb) +
theme(legend.title = element_blank(),
legend.position = c(0.5, 0.1),
legend.justification = c(0.5, 0.1))
)
add_label("c")
} else {
# These cannot all be plot(1:10)!!! It's a mistery to me.
plot(1:10)
barplot(1:10)
hist(1:10)
plot(1:10)
}
@
\includegraphics[page=1,width=.45\textwidth]{figure/pca_isomap_example-1.pdf}
\includegraphics[page=1,width=.45\textwidth]{figure/pca_isomap_example-2.pdf}
\includegraphics[page=1,width=.45\textwidth]{figure/pca_isomap_example-3.pdf}
\includegraphics[page=1,width=.45\textwidth]{figure/pca_isomap_example-4.pdf}
\caption[dimRed example]{%
Comparing PCA and Isomap: %
(a) An S-shaped manifold, colors represent the internal coordinates of the
manifold. %
(b) Isomap embedding, the S-shaped manifold is unfolded. %
(c) $R_{NX}$ plotted agains neighborhood sizes, Isomap is much better at
preserving local distances and PCA is better at preserving global Euclidean
distances. %
The numbers on the legend are the $\text{AUC}_{1 / K}$.
(d) PCA projection of the data, the directions of maximum variance are preserved. %
}\label{fig:plotexample}
\end{figure}
<<eval=FALSE>>=
## define which methods to apply
embed_methods <- c("Isomap", "PCA")
## load test data set
data_set <- loadDataSet("3D S Curve", n = 1000)
## apply dimensionality reduction
data_emb <- lapply(embed_methods, function(x) embed(data_set, x))
names(data_emb) <- embed_methods
## figure \ref{fig:plotexample}a, the data set
plot(data_set, type = "3vars")
## figures \ref{fig:plotexample}b (Isomap) and \ref{fig:plotexample}d (PCA)
lapply(data_emb, plot, type = "2vars")
## figure \ref{fig:plotexample}c, quality analysis
plot_R_NX(data_emb)
@
The function \code{plot\_R\_NX} produces a figure that plots the neighborhood
size ($k$ at a log-scale) against the quality measure $\text{R}_{NX}(k)$ (see
Equation \ref{eq:rnx}). %
This gives an overview of the general behavior of methods: if $\text{R}_{NX}$ is
high for low values of $K$, then local neighborhoods are maintained well; if
$\text{R}_{NX}$ is high for large values of $K$, then global gradients are
maintained well. %
It also provides a way to directly compare methods by plotting more than one
$\text{R}_{NX}$ curve and an overall quality of the embedding by taking the area
under the curve as an indicator for the overall quality of the embedding (see
fig~\ref{eq:auclnk}) which is shown as a number in the legend.
Therefore we can see from Figure~\ref{fig:plotexample}c that $t$-SNE is very good a
maintaining close and medium distances for the given data set, whereas PCA is only
better at maintaining the very large distances. %
The large distances are dominated by the overall bent shape of the S in 3D
space, while the close distances are not affected by this bending. %
This is reflected in the properties recovered by the different methods, the PCA
embedding recovers the S-shape, while $t$-SNE ignores the S-shape and recovers
the inner structure of the manifold.
% Example 2:
Often the quality of an embedding strongly depends on the choice of parameters,
the interface of \pkg{dimRed} can be used to facilitate searching the
parameter space.
Isomap has one parameter $k$ which determines
the number of neighbors used to construct the $k$-NNG\@. %
If this number is too large, then Isomap will resemble an MDS
(Figure~\ref{fig:knn} e), if the number is too small, the resulting embedding
contains holes (Figure~\ref{fig:knn} c). %
The following code finds the optimal value, $k_{\text{max}}$, for $k$ using the
$Q_{\text{local}}$ criterion, the results are visualized in Figure~\ref{fig:knn}
a:
\begin{figure}[htp]
\centering
<<include=FALSE>>=
if(Sys.getenv("BNET_BUILD_VIGNETTE") != "") {
library(dimRed)
library(cccd)
## Load data
ss <- loadDataSet("3D S Curve", n = 500)
## Parameter space
kk <- floor(seq(5, 100, length.out = 40))
## Embedding over parameter space
emb <- lapply(kk, function(x) embed(ss, "Isomap", knn = x))
## Quality over embeddings
qual <- sapply(emb, function(x) quality(x, "Q_local"))
## Find best value for K
ind_max <- which.max(qual)
k_max <- kk[ind_max]
add_label <- function(label){
par(xpd = TRUE)
b = par("usr")
text(b[1], b[4], label, adj = c(0, 1), cex = 1.5, font = 2)
par(xpd = FALSE)
}
names(qual) <- kk
}
@
<<"select_k",include=FALSE,fig.width=11,fig.height=5>>=
if(Sys.getenv("BNET_BUILD_VIGNETTE") != "") {
par(mfrow = c(1, 2),
mar = c(5, 4, 0, 0) + 0.1,
oma = c(0, 0, 0, 0))
plot(kk, qual, type = "l", xlab = "k", ylab = expression(Q[local]), bty = "n")
abline(v = k_max, col = "red")
add_label("a")
plot(ss, type = "3vars", angle = 15, mar = c(3, 3, 0, 0), box = FALSE, grid = FALSE, pch = 16)
add_label("b")
} else {
plot(1:10)
plot(1:10)
}
@
<<"knngraphs",include=FALSE,fig.width=8,fig.height=3>>=
if(Sys.getenv("BNET_BUILD_VIGNETTE") != "") {
par(mfrow = c(1, 3),
mar = c(5, 4, 0, 0) + 0.1,
oma = c(0, 0, 0, 0))
add_knn_graph <- function(ind) {
nn1 <- nng(ss@data, k = kk[ind])
el <- get.edgelist(nn1)
segments(x0 = emb[[ind]]@data@data[el[, 1], 1],
y0 = emb[[ind]]@data@data[el[, 1], 2],
x1 = emb[[ind]]@data@data[el[, 2], 1],
y1 = emb[[ind]]@data@data[el[, 2], 2],
col = "#00000010")
}
plot(emb[[2]]@data@data, type = "n", bty = "n")
add_knn_graph(2)
points(emb[[2]]@data@data, col = dimRed:::colorize(ss@meta),
pch = 16)
add_label("c")
plot(emb[[ind_max]]@data@data, type = "n", bty = "n")
add_knn_graph(ind_max)
points(emb[[ind_max]]@data@data, col = dimRed:::colorize(ss@meta),
pch = 16)
add_label("d")
plot(emb[[length(emb)]]@data@data, type = "n", bty = "n")
add_knn_graph(length(emb))
points(emb[[length(emb)]]@data@data, col = dimRed:::colorize(ss@meta),
pch = 16)
add_label("e")
} else {
plot(1:10)
plot(1:10)
plot(1:10)
}
@
\includegraphics[width=.95\textwidth]{figure/select_k-1.pdf}
\includegraphics[width=.95\textwidth]{figure/knngraphs-1.pdf}
\caption[estimating $k$ using @Q_\text{local}]{%
Using \pkg{dimRed} and the $Q_\text{local}$ indicator to estimate a
good value for the parameter $k$ in Isomap. %
(a) $Q_\text{local}$ for different values of $k$, the vertical red
line indicates the maximum $k_{\text{max}}$. %
(b) The original data set, a 2 dimensional manifold bent in an
S-shape in 3 dimensional space. %
Bottom row: Embeddings and $k$-NNG for different values of $k$. %
(c) When $k = 5$, the value for $k$ is too small resulting in holes in the
embedding, the manifold itself is still unfolded correctly. %
(d) Choose $k = k_\text{max}$, the best representation of the original
manifold in two dimensions achievable with Isomap. %
(e) $k = 100$, too large, the $k$-NNG does not approximate the manifold
any more. %
}\label{fig:knn}
\end{figure}
<<eval=FALSE>>=
## Load data
ss <- loadDataSet("3D S Curve", n = 500)
## Parameter space
kk <- floor(seq(5, 100, length.out = 40))
## Embedding over parameter space
emb <- lapply(kk, function(x) embed(ss, "Isomap", knn = x))
## Quality over embeddings
qual <- sapply(emb, function(x) quality(x, "Q_local"))
## Find best value for K
ind_max <- which.max(qual)
k_max <- kk[ind_max]
@
Figure~\ref{fig:knn}a shows how the $Q_{\text{local}}$ criterion changes when
varying the neighborhood size $k$ for Isomap, the gray lines in
Figure~\ref{fig:knn} represent the edges of the $k$-NN Graph. %
If the value for $k$ is too low, the inner structure of the manifold will still
be recovered, but it will be imperfect (Figure~\ref{fig:knn}c, note that the holes
appear in places that are not covered by the edges of the $k$-NN Graph),
therefore the $Q_{\text{local}}$ score is lower than optimal. %
If $k$ is too large, the error of the embedding is much larger due to short
circuiting and we observe a very steep drop in the $Q_{\text{local}}$ score. %
The short circuiting can be observed in Figure~\ref{fig:knn}e with the edges that
cross the gap between the tips and the center of the S-shape. %
% Example 3:
It is also very easy to compare across methods and quality scores. %
The following code produces a matrix of quality scores and methods,
where \code{dimRedMethodList} returns a character vector with all methods. A
visualization of the matrix can be found in Figure~\ref{fig:qualityexample}. %
\begin{figure}[htp]
\centering
<<"plot_quality",include=FALSE>>=
if(Sys.getenv("BNET_BUILD_VIGNETTE") != "") {
embed_methods <- dimRedMethodList()
quality_methods <- c("Q_local", "Q_global", "AUC_lnK_R_NX",
"cophenetic_correlation")
iris_data <- loadDataSet("Iris")
quality_results <- matrix(
NA, length(embed_methods), length(quality_methods),
dimnames = list(embed_methods, quality_methods)
)
embedded_data <- list()
for (e in embed_methods) {
try(embedded_data[[e]] <- embed(iris_data, e))
for (q in quality_methods)
try(quality_results[e,q] <- quality(embedded_data[[e]], q))
}
quality_results <- quality_results[order(rowMeans(quality_results)), ]
palette(c("#1b9e77", "#d95f02", "#7570b3", "#e7298a", "#66a61e"))
col_hsv <- rgb2hsv(col2rgb(palette()))
## col_hsv["v", ] <- col_hsv["v", ] * 3 / 1
palette(hsv(col_hsv["h",], col_hsv["s",], col_hsv["v",]))
par(mar = c(2, 8, 0, 0) + 0.1)
barplot(t(quality_results), beside = TRUE, col = 1:4,
legend.text = quality_methods, horiz = TRUE, las = 1,
cex.names = 0.85,
args.legend = list(x = "topleft", bg = "white", cex = 0.8))
} else {
plot(1:10)
}
@
\includegraphics[width=.5\textwidth]{figure/plot_quality-1.pdf}
\caption[Quality comparision]{%
A visualization of the \code{quality\_results} matrix. %
The methods are ordered by mean quality score. %
The reconstruction error was omitted, because a higher value means
a worse embedding, while in the present methods a higher score
means a better embedding. %
Parameters were not tuned for the example, therefore it should not
be seen as a general quality assessment of the methods. %
}\label{fig:qualityexample}
\end{figure}
<<eval=FALSE>>=
embed_methods <- dimRedMethodList()
quality_methods <- c("Q_local", "Q_global", "AUC_lnK_R_NX",
"cophenetic_correlation")
scurve <- loadDataSet("3D S Curve", n = 2000)
quality_results <- matrix(
NA, length(embed_methods), length(quality_methods),
dimnames = list(embed_methods, quality_methods)
)
embedded_data <- list()
for (e in embed_methods) {
embedded_data[[e]] <- embed(scurve, e)
for (q in quality_methods)
try(quality_results[e, q] <- quality(embedded_data[[e]], q))
}
@
This example showcases the simplicity with which different methods and quality criteria
can be combined. %
Because of the strong dependencies on parameters it is not advised to apply this
kind of analysis without tuning the parameters for each method separately. %
There is no automatized way to tune parameters in \pkg{dimRed}. %
\section{Conclusion}
\label{sec:conc}
This paper presents the \pkg{dimRed} and \pkg{coRanking} packages and
it provides a brief overview of the methods implemented therein. %
The \pkg{dimRed} package is written in the R language, one of the most popular
languages for data analysis. The package is freely available from CRAN. %
The package is object oriented and completely open source and therefore easily available
and extensible. %
Although most of the DR methods already had implementations in R,
\pkg{dimRed} adds some new methods for dimensionality reduction, and
\pkg{coRanking} adds methods for an independent quality control of DR
methods to the R ecosystem. %
DR is a widely used technique. However, due to the lack of easily usable tools,
choosing the right method for DR is complex and depends upon a variety of factors. %
The \pkg{dimRed} package aims to facilitate experimentation with different
techniques, parameters, and quality measures so that choosing the right method
becomes easier. %
The \pkg{dimRed} package wants to enable the user to objectively compare methods that
rely on very different algorithmic approaches. %
It makes the life of the programmer easier, because all methods are aggregated
in one place and there is a single interface and standardized classes to access
the functionality. %
\section{Acknowledgments}
\label{sec:ack}
We thank Dr.\ G.\ Camps-Valls and an anonymous reviewer for many useful
comments. %
This study was supported by the European Space Agency (ESA) via the Earth System
Data Lab project (\url{http://earthsystemdatacube.org}) and the EU via the H2020
project BACI, grant agreement No 640176. %
\bibliographystyle{abbrvnat}
\bibliography{bibliography}
\end{document}
|