File: fastcluster.Rtex

package info (click to toggle)
r-cran-fastcluster 1.3.0-1
  • links: PTS, VCS
  • area: main
  • in suites: forky, sid, trixie
  • size: 616 kB
  • sloc: cpp: 2,705; python: 854; makefile: 35; sh: 11
file content (706 lines) | stat: -rw-r--r-- 42,550 bytes parent folder | download | duplicates (2)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
\def\fastclusterversion{1.3.0}
\documentclass[fontsize=10pt,paper=letter,BCOR=-6mm,DIV=8]{scrartcl}
\usepackage[utf8]{inputenc}
\usepackage{lmodern}
\normalfont
\usepackage[T1]{fontenc}
\usepackage{textcomp}
\newcommand*\q{\textquotesingle}
\usepackage{amsmath}
\usepackage{amsfonts}
\usepackage{xcolor}
\usepackage{booktabs}
\usepackage{ifpdf}
  \ifpdf
    \newcommand*\driver{}
  \else
    \newcommand*\driver{dvipdfmx}
  \fi
\usepackage[%
  pdftitle={fastcluster manual},
  pdfauthor={Daniel Müllner},
%  pdfsubject={},
  pdfdisplaydoctitle=true,
%  pdfduplex=DuplexFlipLongEdge,
  pdfstartview=FitH,
  colorlinks=True,
  pdfhighlight=/I,
%  pdfborder={0 0 1},
%  linkbordercolor={1 .8 .8},
%  citebordercolor={.5 .9 .5},
%  urlbordercolor={.5 .7 1},
%  linkcolor={blue},
%  citecolor={blue},
  urlcolor={blue!80!black},
  linkcolor={red!80!black},
%  runcolor={blue},
%  filecolor={blue},
  pdfpagemode=UseOutlines,
  bookmarksopen=true,
  bookmarksopenlevel=1,
  bookmarksdepth=2,
  breaklinks=true,
  unicode=true,
  \driver
]{hyperref}
% Optimize the PDF targets and make the PDF file smaller
\ifpdf\RequirePackage{hypdestopt}\fi

\renewcommand*\sectionautorefname{Section}

\usepackage{typearea}

\DeclareMathOperator\size{size}
\DeclareMathOperator\Var{Var}

\newcommand*\linkage{\href{https://docs.scipy.org/doc/scipy/reference/generated/scipy.cluster.hierarchy.linkage.html}{\texttt{linkage}}}
\newcommand*\hierarchy{\href{https://docs.scipy.org/doc/scipy/reference/generated/scipy.cluster.hierarchy.linkage.html}{\texttt{scipy.\hskip0pt cluster.\hskip0pt hierarchy}}}
\newcommand*\hclust{\href{https://stat.ethz.ch/R-manual/R-patched/library/stats/html/hclust.html}{\texttt{hclust}}}
\newcommand*\stats{\href{https://stat.ethz.ch/R-manual/R-patched/library/stats/html/00Index.html}{\texttt{stats}}}
\newcommand*\flashClustPack{\href{https://CRAN.R-project.org/package=flashClust}{\texttt{flashClust}}}
\newcommand*\dist{\href{https://stat.ethz.ch/R-manual/R-patched/library/stats/html/dist.html}{\texttt{dist}}}
\newcommand*\print{\href{https://stat.ethz.ch/R-manual/R-patched/library/base/html/print.html}{\texttt{print}}}
\newcommand*\plot{\href{https://stat.ethz.ch/R-manual/R-patched/library/graphics/html/plot.html}{\texttt{plot}}}
\newcommand*\identify{\href{https://stat.ethz.ch/R-manual/R-patched/library/stats/html/identify.hclust.html}{\texttt{identify}}}
\newcommand*\rect{\href{https://stat.ethz.ch/R-manual/R-patched/library/stats/html/rect.hclust.html}{\texttt{rect.hclust}}}
\newcommand*\NA{\href{https://stat.ethz.ch/R-manual/R-patched/library/base/html/NA.html}{\texttt{NA}}}
\newcommand*\double{\href{https://stat.ethz.ch/R-manual/R-patched/library/base/html/double.html}{\texttt{double}}}
\newcommand*\matchcall{\href{https://stat.ethz.ch/R-manual/R-patched/library/base/html/match.call.html}{\texttt{match.call}}}
\newcommand*\NumPyarray{\href{https://docs.scipy.org/doc/numpy/reference/generated/numpy.ndarray.html}{NumPy array}}
\newcommand*\maskedarrays{\href{https://docs.scipy.org/doc/numpy/reference/maskedarray.html}{masked arrays}}
\newcommand*\pdist{\href{https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.pdist.html}{\texttt{scipy.spatial.distance.pdist}}}

%\usepackage{showframe}

\makeatletter
\newenvironment{methods}{%
  \list{}{\labelwidth\z@
    \itemindent-\leftmargin
    \let\makelabel\methodslabel}%
}{%
  \endlist
}
\newcommand*{\methodslabel}[1]{%
  %\hspace{\labelsep}%
  \hbox to \textwidth{\hspace{\labelsep}%
  \normalfont\bfseries\ttfamily
  #1\hskip-\labelsep\hfill}%
}

\makeatother
\setkomafont{descriptionlabel}{\normalfont\ttfamily\bfseries}


\begin{document}
%\VignetteIndexEntry{User's manual}
\title{The \textit{fastcluster} package: User's manual}
\author{\href{https://danifold.net}{Daniel Müllner}}
\date{May 4, 2025}
\subtitle{Version \fastclusterversion}
\maketitle

\makeatletter
\renewenvironment{quotation}{%
  \list{}{\listparindent 1em%
    \itemindent    \listparindent
    \leftmargin2.5em
    \rightmargin   \leftmargin
    \parsep        \z@ \@plus\p@
  }%
  \item\relax
}{%
  \endlist
}
\makeatother
\begin{abstract}\noindent\small
The fastcluster package is a C++ library for hierarchical, agglomerative
clustering. It efficiently implements the
seven most widely used clustering schemes: single, complete, average, weighted/\hskip0pt mcquitty, Ward, centroid and median linkage. The library currently has interfaces to two languages: R and Python/SciPy. Part of the functionality is designed as drop-in replacement for existing routines: \linkage{} in the SciPy package \hierarchy{}, \hclust{} in R's \stats{} package, and the \flashClustPack{} package. Once the fastcluster library is loaded at the beginning of the code, every program that uses hierarchical clustering can benefit immediately and effortlessly from the performance gain. Moreover, there are memory-saving routines for clustering of vector data, which go beyond what the existing packages provide.
\end{abstract}

\noindent
This document describes the usage for the two interfaces for R and Python and is meant as the reference document for the end user. Installation instructions are given in the file INSTALL in the source distribution and are not repeated here. The sections about the two interfaces are independent and in consequence somewhat redundant, so that users who need a reference for one interface need to consult only one section.

If you use the fastcluster package for scientific work, please cite it as:
\begin{quote}
Daniel Müllner, \textit{fastcluster: Fast Hierarchical, Agglomerative Clustering Routines for R and Python}, Journal of Statistical Software, \textbf{53} (2013), no.~9, 1--18, \url{https://doi.org/10.18637/jss.v053.i09}.
\end{quote}

The \hyperref[yule]{“Yule” distance function} changed in the Python interface of fastcluster version 1.2.0. This is following a \href{https://github.com/scipy/scipy/commit/3b22d1da98dc1b5f64bc944c21f398d4ba782bce}{change in SciPy 1.6.3}. The \hyperref[jaccard]{“Jaccard” distance function} changed in the Python interface of fastcluster version 1.3.0. This is following a \href{https://github.com/scipy/scipy/commit/ecf3ff0ff12666cbeaa5c61d5726fd0141657b54}{change in SciPy 1.15.0}. Therefore, the following pairings of SciPy and fastcluster versions are recommended:
\[
\begin{tabular}{ll}\toprule
  SciPy version $v$ & Recommended fastcluster version \\\midrule
  $v<\textrm{1.6.3}$ & 1.1.28 \\
  $\textrm{1.6.3} \leq v < \textrm{1.15.0}$ & 1.2.6 \\
  $v \geq \textrm{1.15.0}$ & latest ($\geq$ 1.3.0)\\\bottomrule
\end{tabular}
\]

The R interface does not have the “Yule” and “Jaccard” distance functions, hence is not affected by these changes.

The fastcluster package is considered stable and will undergo few changes from now on. If some years from now there have not been any updates, this does not necessarily mean that the package is unmaintained but maybe it just was not necessary to correct anything. Of course, please still report potential bugs and incompatibilities to \texttt{daniel@danifold.net}.

\tableofcontents

\section{The R interface}

Load the package with the following command:
\begin{quote}
  \texttt{library(\q fastcluster\q)}
\end{quote}

The package overwrites the function \hclust{} from the \stats{} package (in the same
way as the \flashClustPack{} package does). Please remove any references to the
\flashClustPack{} package in your R files to not accidentally overwrite the \hclust{}
function with the \flashClustPack{} version.

The \hyperref[hclust]{new \texttt{hclust} function} has exactly the same calling conventions as the old one. You may just load the package and immediately and effortlessly enjoy the performance improvements. The function is also an improvement to the \texttt{flashClust} function from the \flashClustPack{} package. Just replace every call to \texttt{flashClust} by \hyperref[hclust]{\texttt{hclust}} and expect your code to work as before, only faster.\footnote{If you are using flashClust prior to version 1.01, update it! See the change log for \flashClustPack{} at \url{https://cran.r-project.org/web/packages/flashClust/ChangeLog}.} In case the data includes infinite or NaN values, see \autoref{sec:infnan}.

If you need to access the old function or make sure that the right function is
called, specify the package as follows:
\begin{quote}
    \texttt{\hyperref[hclust]{fastcluster::hclust}(…)}\\
    \texttt{flashClust::hclust(…)}\\
    \texttt{stats::hclust(…)}
\end{quote}

Vector data can be clustered with a memory-saving algorithm with the command:
\begin{quote}
    \texttt{\hyperref[hclust.vector]{hclust.vector}(…)}
 \end{quote}

The following sections contain comprehensive descriptions of these methods.

\begin{methods}
\item [\normalfont\texttt{\textbf{hclust}}\,(\textit{d, method=\q complete\q, members=NULL})]
\phantomsection\label{hclust}
\addcontentsline{toc}{subsection}{\texttt{hclust}}

Hierarchical, agglomerative clustering on a condensed dissimilarity matrix.

This method has the same specifications as the method \hclust{} in the package \stats{} and \texttt{hclust} alias \texttt{flashClust} in the package \flashClustPack{}. In particular, the \print{}, \plot{}, \rect{} and \identify{} methods work as expected.

The argument $d$ is a condensed distance matrix, as it is produced by \dist.

The argument \textit{method} is one of the strings \textit{\q single\q}, \textit{\q complete\q}, \textit{\q average\q}, \textit{\q mcquitty\q}, \textit{\q centroid\q}, \textit{\q median\q}, \textit{\q ward.D\q}, \textit{\q ward.D2\q} or an unambiguous abbreviation thereof.

The argument \textit{members} specifies the sizes of the initial nodes, ie.\ the number of observations in the initial clusters. The default value \texttt{NULL} says that all initial nodes are singletons, ie.\ have size 1. Otherwise, \textit{members} must be a vector whose size is the number of input points. The vector is processed as a \double{} array so that not only integer cardinalities of nodes can be accounted for but also weighted nodes with real weights.

The general scheme of the agglomerative clustering procedure is as follows:
\begin{enumerate}
\item Start with $N$ singleton clusters (nodes) labeled $-1,\ldots, -N$, which represent the input points.
\item Find a pair of nodes with minimal distance among all pairwise distances.
\item Join the two nodes into a new node and remove the two old nodes. The new nodes are labeled consecutively $1,2,\ldots$
\item The distances from the new node to all other nodes is determined by the \textit{method} parameter (see below).
\item Repeat $N-1$ times from step 2, until there is one big node, which contains all original input points.
\end{enumerate}

The output of \texttt{hclust} is an object of class \texttt{\q hclust\q} and represents a \emph{stepwise dendrogram}. It contains the following fields:
\begin{description}
\item[\normalfont\textit{merge}]
This is an $(N-1)\times 2$ array. Row $i$ specifies the labels of the nodes which are joined step $i$ of the clustering.

\item[\normalfont\textit{height}]
This is a vector of length $N-1$. It contains the sequence of dissimilarities at which every pair of nearest nodes is joined.

\item[\normalfont\textit{order}]
This is a vector of length $N$. It contains a permutation of the numbers $1,\ldots N$ for the \plot{} method. When the dendrogram is plotted, this is the order in which the singleton nodes are plotted as the leaves of a rooted tree. The order is computed so that the dendrogram is plotted without intersections (except the case when there are inversions for the \textit{\q centroid\q} and \textit{\q median\q} methods). The choice of the \textit{\q order\q} sequence follows the same scheme as the \texttt{stats} package does, only with a faster algorithm. Note that there are many valid choices to order the nodes in a dendrogram without intersections. Also, subsequent points in the \textit{\q order\q} field are not always close in the ultrametric given by the dendrogram.

\item[\normalfont\textit{labels}]
This copies the attribute \textit{\q Labels\q} from the first input parameter $d$. It contains the labels for the objects being clustered.

\item[\normalfont\textit{method}]
The (unabbreviated) string for the \textit{\q method\q} parameter. See below for a specification of all available methods.

\item[\normalfont\textit{call}]
The full command that produced the result. See \matchcall.

\item[\normalfont\textit{dist.method}]
This \textit{\q method\q} attribute of the first input parameter $d$. This specifies which metric was used in the \texttt{dist} method which generated the first argument.
\end{description}

The parameter \textit{method} specifies which clustering scheme to use. The clustering scheme determines the distance from a new node to the other nodes. Denote the dissimilarities by $d$, the nodes to be joined by $I,J$, the new node by $K$ and any other node by $L$. The symbol $|I|$ denotes the size of the cluster $I$.

\begin{description}
\item [\normalfont\textit{method=\q single\q}:]
$\displaystyle d(K,L) = \min(d(I,L), d(J,L))$

The distance between two clusters $A,B$ is the closest distance between any two points in each cluster:
\[
 d(A,B)=\min_{a\in A, b\in B}d(a,b)
\]

\item [\normalfont\textit{method=\q complete\q}:]
$\displaystyle d(K,L) = \max(d(I,L), d(J,L))$

The distance between two clusters $A,B$ is the maximal distance between any two points in each cluster:
\[
 d(A,B)=\max_{a\in A, b\in B}d(a,b)
\]

\item [\normalfont\textit{method=\q average\q}:]
$\displaystyle d(K,L) = \frac{|I|\cdot d(I,L)+|J|\cdot d(J,L)}{|I|+|J|}$

The distance between two clusters $A,B$ is the average distance between the points in the two clusters:
\[
 d(A,B)=\frac1{|A||B|}\sum_{a\in A, b\in B}d(a,b)
\]

\item [\normalfont\textit{method=\q mcquitty\q}:]
$\displaystyle d(K,L) = \tfrac12(d(I,L)+d(J,L))$

There is no global description for the distance between clusters since the distance depends on the order of the merging steps.
\end{description}

The following three methods are intended for Euclidean data only, ie.\ when $X$ contains the pairwise \textbf{squared} distances between vectors in Euclidean space. The algorithm will work on any input, however, and it is up to the user to make sure that applying the methods makes sense.

\begin{description}
\item [\normalfont\textit{method=\q centroid\q}:]
$\displaystyle d(K,L) = \frac{|I|\cdot d(I,L)+|J|\cdot d(J,L)}{|I|+|J|}-\frac{|I|\cdot|J|\cdot d(I,J)}{(|I|+|J|)^2}$

There is a geometric interpretation: $d(A,B)$ is the distance between the centroids (ie.\ barycenters) of the clusters in Euclidean space:
\[
  d(A,B) = \|\vec c_A-\vec c_B\|^2,
\]
where $\vec c_A$ denotes the centroid of the points in cluster $A$.

\item [\normalfont\textit{method=\q median\q}:]
$\displaystyle d(K,L) = \tfrac12 d(I,L)+\tfrac12 d(J,L)-\tfrac14 d(I,J)$

Define the midpoint $\vec w_K$ of a cluster $K$ iteratively as $\vec w_K=k$ if $K=\{k\}$ is a singleton and as the midpoint $\frac12(\vec w_I+\vec w_J)$ if $K$ is formed by joining $I$ and $J$. Then we have
\[
 d(A,B)=\|\vec w_A-\vec w_B\|^2
\]
in Euclidean space for all nodes $A,B$. Notice however that this distance depends on the order of the merging steps.

\item [\normalfont\textit{method=\q ward.D\q}:]
$\displaystyle d(K,L) = \frac{(|I|+|L|)\cdot d(I,L)+(|J|+|L|)\cdot d(J,L)-|L|\cdot d(I,J)}{|I|+|J|+|L|}$

The global cluster dissimilarity can be expressed as
\[
d(A,B)=\frac{2|A||B|}{|A|+|B|}\cdot\|\vec c_A-\vec c_B\|^2,
\]
where $\vec c_A$ again denotes the centroid of the points in cluster $A$.

\item [\normalfont\textit{method=\q ward.D2\q}:]
This is the equivalent of \textit{\q ward.D\q}, but for input consisting of untransformed (in particular: \textbf{non-squared}) Euclidean distances. Internally, all distances are squared first, then method \textit{ward.D} is applied, and finally the square root of all heights in the dendrogram is taken. Thus, global cluster dissimilarity can be expressed as the square root of that for \textit{ward.D}, namely
\[
d(A,B)=\sqrt{\frac{2|A||B|}{|A|+|B|}}\cdot\|\vec c_A-\vec c_B\|.
\]
\end{description}

\item [\normalfont\texttt{\textbf{hclust.vector}}\,(\textit{X, method=\q single\q, members=NULL, metric=\q euclidean\q, p=NULL})]
\phantomsection\label{hclust.vector}
\addcontentsline{toc}{subsection}{\texttt{hclust.vector}}
This performs hierarchical, agglomerative clustering on vector data with memory-saving algorithms. While the \hyperref[hclust]{\texttt{hclust}} method requires $\Theta(N^2)$ memory for clustering of $N$ points, this method needs $\Theta(ND)$ for $N$ points in $\mathbb R^D$, which is usually much smaller.

The argument $X$ must be a two-dimensional matrix with \double{} precision values. It describes $N$ data points in $\mathbb R^D$ as an $(N\times D)$ matrix.

The parameter \textit{\q members\q} is the same as for \hyperref[hclust]{\texttt{hclust}}.

The parameter \textit{\q method\q} is one of the strings \textit{\q single\q}, \textit{\q centroid\q}, \textit{\q median\q}, \textit{\q ward\q}, or an unambiguous abbreviation thereof.

If \textit{method} is \textit{\q single\q}, single linkage clustering is performed on the data points with the metric which is specified by the \textit{metric} parameter. The choices are the same as in the \dist{} method: \textit{\q euclidean\q}, \textit{\q maximum\q}, \textit{\q manhattan\q}, \textit{\q canberra\q}, \textit{\q binary\q} and \textit{\q minkowski\q}. Any unambiguous substring can be given. The parameter \textit{p} is used for the \textit{\q minkowski\q} metric only.

The call
\begin{quote}
\texttt{hclust.vector(X, method=\q single\q, metric=[...])}
\end{quote}
is equivalent to
\begin{quote}
\texttt{hclust(dist(X, metric=[...]), method=\q single\q)}
\end{quote}
but uses less memory and is equally fast. Ties may be resolved differently, ie.\ if two pairs of nodes have equal, minimal dissimilarity values at some point, in the specific computer's representation for floating point numbers, either pair may be chosen for the next merging step in the dendrogram.

Note that the formula for the \textit{\q canberra\q} metric changed in R 3.5.0: Before R version 3.5.0, the \textit{\q canberra\q} metric was computed as
\[
 d(u,v) = \sum_j\frac{|u_j-v_j|}{|u_j+v_j|}.
\]
Starting with R version 3.5.0, the formula was corrected to
\[
 d(u,v) = \sum_j\frac{|u_j-v_j|}{|u_j|+|v_j|}.
\]
Summands with $u_j=v_j=0$ always contribute 0 to the sum. The second, newer formula equals SciPy's definition.

The fastcluster package detects the R version at runtime and chooses the formula accordingly, so that fastcluster and the \dist{} method always use the same formula for a given R version.

If \textit{method} is one of \textit{\q centroid\q}, \textit{\q median\q}, \textit{\q ward\q}, clustering is performed with respect to Euclidean distances. In this case, the parameter \textit{metric} must be \textit{\q euclidean\q}. Notice that \texttt{hclust.vector} operates on Euclidean distances for compatibility reasons with the \dist{} method, while \hyperref[hclust]{\texttt{hclust}} assumes \textbf{squared} Euclidean distances for compatibility with the \href{https://stat.ethz.ch/R-manual/R-patched/library/stats/html/hclust.html}{\texttt{stats::hclust}} method! Hence, the call
\phantomsection\label{squared}
\begin{quote}
\texttt{hc = hclust.vector(X, method=\q centroid\q)}
\end{quote}
is, aside from the lesser memory requirements, equivalent to
\begin{quote}
\texttt{d = dist(X)}\\
\texttt{hc = hclust(d\textasciicircum 2, method=\q centroid\q)}\\
\texttt{hc\$height = sqrt(hc\$height)}
\end{quote}
The same applies to the \textit{\q median\q} method. The \textit{\q ward\q} method in \hyperref[hclust.vector]{\texttt{hclust.vector}} is equivalent to \hyperref[hclust]{\texttt{hclust}} with method \textit{\q ward.D2\q}, but to method \textit{\q ward.D\q} only after squaring as above. Differences in these algebraically equivalent methods may arise only from floating-point inaccuracies and the resolution of ties (which may, however, in extreme cases affect the entire clustering result due to the inherently unstable nature of the clustering schemes).
\end{methods}
\section{The Python interface}

The fastcluster package is imported as usual by:
\begin{quote}
  \texttt{import fastcluster}
\end{quote}

It provides the following functions:
\begin{quote}
    \hyperref[linkage]{\texttt{linkage}}\,(\textit{X, method=\q single\q, metric=\q euclidean\q, preserve\_input=True})\\
    \hyperref[single]{\texttt{single}}\,($X$)\\
    \hyperref[complete]{\texttt{complete}}\,($X$)\\
    \hyperref[average]{\texttt{average}}\,($X$)\\
    \hyperref[weighted]{\texttt{weighted}}\,($X$)\\
    \hyperref[ward]{\texttt{ward}}\,($X$)\\
    \hyperref[centroid]{\texttt{centroid}}\,($X$)\\
    \hyperref[median]{\texttt{median}}\,($X$)\\
    \hyperref[linkage_vector]{\texttt{linkage\_vector}}\,(\textit{X, method=\q single\q, metric=\q euclidean\q, extraarg=None})
\end{quote}

The following sections contain comprehensive descriptions of these methods.

\begin{methods}
\item [\normalfont\texttt{fastcluster.\textbf{linkage}}\,(\textit{X, method=\q single\q, metric=\q euclidean\q, preserve\_input=\q True\q})]
\phantomsection\label{linkage}
\addcontentsline{toc}{subsection}{\texttt{linkage}}

Hierarchical, agglomerative clustering on a condensed dissimilarity matrix or on vector data.

Apart from the argument \textit{preserve\_input}, the method has the same input parameters and output format as the function of the same name in the module \hierarchy.

The argument $X$ is preferably a \NumPyarray{} with floating point entries (\texttt{X.dtype\hskip0pt==\hskip0pt numpy.double}). Any other data format will be converted before it is processed. NumPy's \maskedarrays{} are not treated as special, and the mask is simply ignored.

If $X$ is a one-dimensional array, it is considered a condensed matrix of pairwise dissimilarities in the format which is returned by \pdist. It contains the flattened, upper-triangular part of a pairwise dissimilarity matrix. That is, if there are $N$ data points and the matrix $d$ contains the dissimilarity between the $i$-th and $j$-th observation at position $d_{i,j}$, the vector $X$ has length $\binom N2$ and is ordered as follows:
\[
d
= \begin{pmatrix}
0&d_{0,1}&d_{0,2}&\ldots&d_{0,n-1}\\
 & 0&d_{1,2} & \ldots\\
&&0&\ldots\\
&&&\ddots\\
&&&&0
 \end{pmatrix}
= \begin{pmatrix}
0&X[0] &X[1]&\ldots&X[n-2]\\
 & 0&X[n-1] & \ldots\\
&&0&\ldots\\
&&&\ddots\\
&&&&0
 \end{pmatrix}
\]
The \textit{metric} argument is ignored in case of dissimilarity input.

The optional argument \textit{preserve\_input} specifies whether the method makes a working copy of the dissimilarity vector or writes temporary data into the existing array. If the dissimilarities are generated for the clustering step only and are not needed afterward, approximately half the memory can be saved by specifying \textit{preserve\_input=False}. Note that the input array $X$ contains unspecified values after this procedure. It is therefore safer to write
\begin{verbatim}
    linkage(X, method="...", preserve_input=False)
    del X
\end{verbatim}
to make sure that the matrix $X$ is not accessed accidentally  after it has been
used as scratch memory. (The single linkage algorithm does not write to the distance matrix or its copy anyway, so the \textit{preserve\_input} flag has no effect in this case.)

If $X$ contains vector data, it must be a two-dimensional array with $N$ observations in $D$ dimensions as an $(N\times D)$ array. The  \textit{preserve\_input} argument is ignored in this case. The specified \textit{metric} is used to generate pairwise distances from the input. The following two function calls yield equivalent output:
\begin{verbatim}
    linkage(pdist(X, metric), method="...", preserve_input=False)
    linkage(X, metric=metric, method="...")
\end{verbatim}

The two results are identical in most cases, but differences occur if ties are resolved differently: if the minimum in step 2 below is attained for more than one pair of nodes, either pair may be chosen. It is not guaranteed that both \texttt{linkage} variants choose the same pair in this case.

The general scheme of the agglomerative clustering procedure is as follows:
\begin{enumerate}
\item Start with $N$ singleton clusters (nodes) labeled $0,\ldots, N-1$, which represent the input points.
\item Find a pair of nodes with minimal distance among all pairwise distances.
\item Join the two nodes into a new node and remove the two old nodes. The new nodes are labeled consecutively $N,N+1,\ldots$
\item The distances from the new node to all other nodes is determined by the \textit{method} parameter (see below).
\item Repeat $N-1$ times from step 2, until there is one big node, which contains all original input points.
\end{enumerate}

The output of \texttt{linkage} is \emph{stepwise dendrogram}, which is represented as an $(N-1)\times 4$ \NumPyarray{} with floating point entries (\texttt{dtype=numpy.double}). The first two columns contain the node indices which are joined in each step. The input nodes are labeled $0,\ldots,N-1$, and the newly generated nodes have the labels $N,\ldots, 2N-2$. The third column contains the distance between the two nodes at each step, ie.\ the current minimal distance at the time of the merge. The fourth column counts the number of points which comprise each new node.

The parameter \textit{method} specifies which clustering scheme to use. The clustering scheme determines the distance from a new node to the other nodes. Denote the dissimilarities by $d$, the nodes to be joined by $I,J$, the new node by $K$ and any other node by $L$. The symbol $|I|$ denotes the size of the cluster $I$.

\begin{description}
\item [\normalfont\textit{method=\q single\q}:]
$\displaystyle d(K,L) = \min(d(I,L), d(J,L))$

The distance between two clusters $A,B$ is the closest distance between any two points in each cluster:
\[
 d(A,B)=\min_{a\in A, b\in B}d(a,b)
\]

\item [\normalfont\textit{method=\q complete\q}:]
$\displaystyle d(K,L) = \max(d(I,L), d(J,L))$

The distance between two clusters $A,B$ is the maximal distance between any two points in each cluster:
\[
 d(A,B)=\max_{a\in A, b\in B}d(a,b)
\]

\item [\normalfont\textit{method=\q average\q}:]
$\displaystyle d(K,L) = \frac{|I|\cdot d(I,L)+|J|\cdot d(J,L)}{|I|+|J|}$

The distance between two clusters $A,B$ is the average distance between the points in the two clusters:
\[
 d(A,B)=\frac1{|A||B|}\sum_{a\in A, b\in B}d(a,b)
\]

\item [\normalfont\textit{method=\q weighted\q}:]
$\displaystyle d(K,L) = \tfrac12(d(I,L)+d(J,L))$

There is no global description for the distance between clusters since the distance depends on the order of the merging steps.
\end{description}

The following three methods are intended for Euclidean data only, ie.\ when $X$ contains the pairwise (non-squared!)\ distances between vectors in Euclidean space. The algorithm will work on any input, however, and it is up to the user to make sure that applying the methods makes sense.

\begin{description}
\item [\normalfont\textit{method=\q centroid\q}:]
$\displaystyle d(K,L) = \sqrt{\frac{|I|\cdot d(I,L)^2+|J|\cdot d(J,L)^2}{|I|+|J|}-\frac{|I|\cdot|J|\cdot d(I,J)^2}{(|I|+|J|)^2}}$

There is a geometric interpretation: $d(A,B)$ is the distance between the centroids (ie.\ barycenters) of the clusters in Euclidean space:
\[
  d(A,B) = \|\vec c_A-\vec c_B\|,
\]
where $\vec c_A$ denotes the centroid of the points in cluster $A$.

\item [\normalfont\textit{method=\q median\q}:]
$\displaystyle d(K,L) = \sqrt{\tfrac12 d(I,L)^2+\tfrac12 d(J,L)^2-\tfrac14 d(I,J)^2}$

Define the midpoint $\vec w_K$ of a cluster $K$ iteratively as $\vec w_K=k$ if $K=\{k\}$ is a singleton and as the midpoint $\frac12(\vec w_I+\vec w_J)$ if $K$ is formed by joining $I$ and $J$. Then we have
\[
 d(A,B)=\|\vec w_A-\vec w_B\|
\]
in Euclidean space for all nodes $A,B$. Notice however that this distance depends on the order of the merging steps.

\item [\normalfont\textit{method=\q ward\q}:] \raggedright
$\displaystyle d(K,L) = \sqrt{\frac{(|I|+|L|)\cdot d(I,L)^2+(|J|+|L|)\cdot d(J,L)^2-|L|\cdot d(I,J)^2}{|I|+|J|+|L|}}$

The global cluster dissimilarity can be expressed as
\[
d(A,B)=\sqrt{\frac{2|A||B|}{|A|+|B|}}\cdot\|\vec c_A-\vec c_B\|,
\]
where $\vec c_A$ again denotes the centroid of the points in cluster $A$.
\end{description}

\item [\normalfont\texttt{fastcluster.\textbf{single}}\,(\textit{X})]
\phantomsection\addcontentsline{toc}{subsection}{\texttt{single}}\label{single}
Alias for \texttt{fastcluster.\textbf{linkage}}\,(\textit{X, method=\q single\q}).

\item [\normalfont\texttt{fastcluster.\textbf{complete}}\,(\textit{X})]
\phantomsection\addcontentsline{toc}{subsection}{\texttt{complete}}\label{complete}
Alias for \texttt{fastcluster.\textbf{linkage}}\,(\textit{X, method=\q complete\q}).

\item [\normalfont\texttt{fastcluster.\textbf{average}}\,(\textit{X})]
\phantomsection\addcontentsline{toc}{subsection}{\texttt{average}}\label{average}
Alias for \texttt{fastcluster.\textbf{linkage}}\,(\textit{X, method=\q average\q}).

\item [\normalfont\texttt{fastcluster.\textbf{weighted}}\,(\textit{X})]
\phantomsection\addcontentsline{toc}{subsection}{\texttt{weighted}}\label{weighted}
Alias for \texttt{fastcluster.\textbf{linkage}}\,(\textit{X, method=\q weighted\q}).

\item [\normalfont\texttt{fastcluster.\textbf{centroid}}\,(\textit{X})]
\phantomsection\addcontentsline{toc}{subsection}{\texttt{centroid}}\label{centroid}
Alias for \texttt{fastcluster.\textbf{linkage}}\,(\textit{X, method=\q centroid\q}).

\item [\normalfont\texttt{fastcluster.\textbf{median}}\,(\textit{X})]
\phantomsection\addcontentsline{toc}{subsection}{\texttt{median}}\label{median}
Alias for \texttt{fastcluster.\textbf{linkage}}\,(\textit{X, method=\q median\q}).

\item [\normalfont\texttt{fastcluster.\textbf{ward}}\,(\textit{X})]
\phantomsection\addcontentsline{toc}{subsection}{\texttt{ward}}\label{ward}
Alias for \texttt{fastcluster.\textbf{linkage}}\,(\textit{X, method=\q ward\q}).

\item [\normalfont\texttt{fastcluster.\textbf{linkage\_vector}}\,(\textit{X, method=\q single\q, metric=\q euclidean\q, extraarg=\q None\q})]
\phantomsection\addcontentsline{toc}{subsection}{\texttt{linkage\_vector}}\label{linkage_vector}
This performs hierarchical, agglomerative clustering on vector data with memory-saving algorithms. While the \hyperref[linkage]{\texttt{linkage}} method requires $\Theta(N^2)$ memory for clustering of $N$ points, this method needs $\Theta(ND)$ for $N$ points in $\mathbb R^D$, which is usually much smaller.

The argument $X$ has the same format as before, when $X$ describes vector data, ie.\ it is an $(N\times D)$ array. Also the output array has the same format. The parameter \textit{method} must be one of \textit{\q single\q}, \textit{\q centroid\q}, \textit{\q median\q}, \textit{\q ward\q}, ie.\ only for these methods there exist memory-saving algorithms currently. If \textit{method}, is one of \textit{\q centroid\q}, \textit{\q median\q}, \textit{\q ward\q}, the \textit{metric} must be \textit{\q euclidean\q}.

Like the \texttt{linkage} method, \texttt{linkage\_vector} does not treat NumPy's \maskedarrays{} as special and simply ignores the mask.

For single linkage clustering, any dissimilarity function may be chosen. Basically, every metric which is implemented in the method \pdist{} is reimplemented here. However, the metrics differ in some instances since a number of mistakes and typos (both in the code and in the documentation) were corrected in the \textit{fastcluster} package.\footnote{Hopefully, the SciPy metric will be corrected in future versions and some day coincide with the \textit{fastcluster} definitions. See the bug reports at \url{https://github.com/scipy/scipy/issues/2009}, \url{https://github.com/scipy/scipy/issues/2011}.}

Therefore, the available metrics with their definitions are listed below as a reference. The symbols $u$ and $v$ mostly denote vectors in $\mathbb R^D$ with coordinates $u_j$ and $v_j$ respectively. See below for additional metrics for Boolean vectors. Unless otherwise stated, the input array $X$ is converted to a floating point array (\texttt{X.dtype==\allowbreak numpy.double}) if it does has have already the required data type. Some metrics accept Boolean input; in this case this is stated explicitly below.

\begin{description}
\item[\normalfont\textit{\q euclidean\q}:] Euclidean metric, $L_2$ norm
\[
 d(u,v) = \| u-v\|_2 = \sqrt{\sum_j (u_j-v_j)^2}
\]

\item[\normalfont\textit{\q sqeuclidean\q}:] squared Euclidean metric
\[
 d(u,v) = \| u-v\|^2_2 = \sum_j (u_j-v_j)^2
\]

\item[\normalfont\textit{\q seuclidean\q}:] standardized Euclidean metric
\[
 d(u,v) = \sqrt{\sum_j (u_j-v_j)^2 /V_j}
\]

The vector $V=(V_0,\ldots,V_{D-1})$ is given as the \textit{extraarg} argument. If no \textit{extraarg} is given, $V_j$ is by default the unbiased sample variance of all observations in the $j$-th coordinate, $V_j = \Var_i(X_{i,j})=\frac1{N-1}\sum_i(X_{i,j}^2-\mu(X_j)^2)$. (Here, $\mu(X_j)$ denotes as usual the mean of $X_{i,j}$ over all rows $i$.)

\item[\normalfont\textit{\q mahalanobis\q}:] Mahalanobis distance
\[
 d(u,v) = \sqrt{(u-v)^{\mkern-3mu\top}V (u-v)}
\]
Here, $V=\textit{extraarg}$, a $(D\times D)$-matrix. If $V$ is not specified, the inverse of the covariance matrix \texttt{numpy.linalg.inv(numpy.\allowbreak cov(\allowbreak X, rowvar=False))} is used:
\[
 (V^{-1})_{j,k} = \frac1{N-1} \sum_i (X_{i,j}-\mu(X_j))(X_{i,k}-\mu(X_k))
\]

\item[\normalfont\textit{\q cityblock\q}:] the Manhattan distance, $L_1$ norm
\[
 d(u,v) = \sum_j |u_j-v_j|
\]

\item[\normalfont\textit{\q chebychev\q}:] the supremum norm, $L_\infty$ norm
\[
 d(u,v) = \max_j |u_j-v_j|
\]

\item[\normalfont\textit{\q minkowski\q}:] the $L_p$ norm
\[
 d(u,v) = \left(\sum_j |u_j-v_j|^p\right)^{1/p}
\]
This metric coincides with the \textit{cityblock}, \textit{euclidean} and \textit{chebychev} metrics for $p=1$, $p=2$ and $p=\infty$ (\texttt{numpy.inf}), respectively. The parameter $p$ is given as the \textit{\q extraarg\q} argument.

\item[\normalfont\textit{\q cosine\q}]
\[
 d(u,v) = 1 - \frac{\langle u,v\rangle}{\|u\|\cdot\|v\|} = 1 - \frac{\sum_j u_jv_j}{\sqrt{\sum_j u_j^2\cdot \sum_j v_j^2}}
\]

\item[\normalfont\textit{\q correlation\q}:] This method first mean-centers the rows of $X$ and then applies the \textit{cosine} distance. Equivalently, the \textit{correlation} distance measures $1-{}$\textrm{(Pearson's correlation coefficient)}.
\[
 d(u,v) = 1 - \frac{\langle u-\mu(u),v-\mu(v)\rangle}{\|u-\mu(u)\|\cdot\|v-\mu(v)\|},
\]

\item[\normalfont\textit{\q canberra\q}]
\[
 d(u,v) = \sum_j\frac{|u_j-v_j|}{|u_j|+|v_j|}
\]
Summands with $u_j=v_j=0$ contribute 0 to the sum.

\item[\normalfont\textit{\q braycurtis\q}]
\[
 d(u,v) = \frac{\sum_j |u_j-v_j|}{\sum_j |u_j+v_j|}
\]

\item[\textnormal{(user function):}] The parameter \textit{metric} may also be a function which accepts two NumPy floating point vectors and returns a number. Eg.\ the Euclidean distance could be emulated with
\begin{quote}
\texttt{fn = lambda u, v: numpy.sqrt(((u-v)*(u-v)).sum())}\\
\texttt{linkage\_vector(X, method=\q single\q, metric=fn)}
\end{quote}
This method, however, is much slower than the built-in function.

\item[\normalfont\textit{\q hamming\q}:] The Hamming distance accepts a Boolean array (\texttt{X.\allowbreak dtype\allowbreak ==\allowbreak bool}) for efficient storage. Any other data type is converted to \texttt{numpy.double}.
\[
 d(u,v) = |\{j\mid u_j\neq v_j\}|
\]

\end{description}
The following metrics are designed for Boolean vectors. The input array is converted to the \texttt{bool} data type if it is not Boolean already. Use the following abbreviations for the entries of a contingency table:
\begin{align*}
a &= |\{j\mid u_j\land v_j \}| & b &= |\{j\mid u_j\land(\lnot v_j)\}|\\
c &= |\{j\mid (\lnot u_j)\land v_j \}| & d &= |\{j\mid (\lnot u_j)\land(\lnot v_j)\}|
\end{align*}
Recall that $D$ denotes the number of dimensions, hence $D=a+b+c+d$.


\begin{description}
\item[\normalfont\textit{\q jaccard\q}]
\phantomsection\label{jaccard}
\[
 d(u,v) = \frac{b+c}{b+c+d}
\]
\[
 d(0,0) = 0
\]

\item[\normalfont\textit{\q yule\q}]
\phantomsection\label{yule}
\begin{align*}
  d(u,v) &= \frac{2bc}{ad+bc} && \text{if $bc \neq 0$}\\
  d(u,v) &= 0 && \text{if $bc = 0$}
\end{align*}
Note that the second clause $d(u,v)=0$ if $bc = 0$ was introduced in fastcluster version 1.2.0. Before, the result was NaN if the denominator in the formula was zero. fastcluster is following a \href{https://github.com/scipy/scipy/commit/3b22d1da98dc1b5f64bc944c21f398d4ba782bce}{change in SciPy 1.6.3} here.

\item[\normalfont\textit{\q dice\q}]
\begin{gather*}
 d(u,v) = \frac{b+c}{2a+b+c}\\
 d(0,0) = 0
\end{gather*}

\item[\normalfont\textit{\q rogerstanimoto\q}]
\[
 d(u,v) = \frac{2(b+c)}{b+c+D}
\]

\item[\normalfont\textit{\q russellrao\q}]
\[
 d(u,v) = \frac{b+c+d}{D}
\]

\item[\normalfont\textit{\q sokalsneath\q}]
\begin{gather*}
 d(u,v) = \frac{2(b+c)}{a+2(b+c)}\\
 d(0,0) = 0
\end{gather*}

\item[\normalfont\textit{\q kulsinski\q}]
\[
 d(u,v) = \frac 12\cdot\left(\frac b{a+b} + \frac c{a+c}\right)
\]

\item[\normalfont\textit{\q matching\q}]
\[
 d(u,v) = \frac{b+c}{D}
\]
Notice that when given a Boolean array, the \textit{matching} and \textit{hamming} distance are the same. The \textit{matching} distance formula, however, converts every input to Boolean first. Hence, the vectors $(0,1)$ and $(0,2)$ have zero \textit{matching} distance since they are both converted to $(\mathrm{False}, \mathrm{True})$ but the \textit{hamming} distance is $0.5$.

\item[\normalfont\textit{\q sokalmichener\q}] is an alias for \textit{\q matching\q}.

\end{description}

\end{methods}

\section{Behavior for NaN and infinite values}\label{sec:infnan}

Whenever the fastcluster package encounters a NaN value as the distance between nodes, either as the initial distance or as an updated distance after some merging steps, it raises an error. This was designed intentionally, even if there might be ways to propagate NaNs through the algorithms in a more or less sensible way. Indeed, since the clustering result depends on every single distance value, the presence of NaN values usually indicates a dubious clustering result, and therefore NaN values should be eliminated in preprocessing.

In the R interface for vector input, coordinates with {\NA} value are interpreted as missing data and treated in the same way as R's {\dist} function does. This results in valid output whenever the resulting distances are not NaN. The Python interface does not provide any way of handling missing coordinates, and data should be processed accordingly and given as pairwise distances to the clustering algorithms in this case.

The fastcluster package handles node distances and coordinates with infinite values correctly, as long as the formulas for the distance updates and the metric (in case of vector input) make sense. In concordance with the statement above, an error is produced if a NaN value results from performing arithmetic with infinity. Also, the usual proviso applies: internal formulas in the code are mathematically equivalent to the formulas as stated in the documentation only for finite, real numbers but might produce different results for $\pm\infty$. Apart from obvious cases like single or complete linkage, it is therefore recommended that users think about how they want infinite values to be treated by the distance update and metric formulas and then check whether the fastcluster code does exactly what they want in these special cases.

\section{Differences between the two interfaces}
\begin{itemize}
\item The \textit{\q mcquitty\q} method in R is called \textit{\q weighted\q} in Python.
\item R and SciPy use different conventions for the ``Euclidean'' methods \textit{\q centroid\q}, \textit{\q median\q}! R assumes that the dissimilarity matrix consists of squared Euclidean distances, while SciPy expects non-squared Euclidean distances. The fastcluster package respects these conventions and uses different formulas in the two interfaces.

The \textit{\q ward\q} method in the Python interface is identical to \textit{\q ward.D2\q} in the R interface.

If the same results in both interfaces ought to be obtained, then the \hyperref[hclust]{\texttt{hclust}} function in R must be input the entry-wise square of the distance matrix, \verb!d^2!, for the \textit{\q ward.D\q}, \textit{\q centroid\q} and \textit{\q median\q} methods, and later the square root of the height field in the dendrogram must be taken. The \hyperref[hclust.vector]{\texttt{hclust.vector}} method calculates non-squared Euclidean distances, like R's \dist{} method and identically to the Python interface. See the \hyperref[squared]{example} in the \hyperref[hclust.vector]{\texttt{hclust.vector}} documentation above.

For the \textit{\q average\q} and \textit{\q weighted\q} alias \textit{\q mcquitty\q} methods, the same, non-squared distance matrix \texttt{d} as in the Python interface must be used for the same results. The \textit{\q single\q} and \textit{\q complete\q} methods only depend on the relative order of the distances, hence it does not make a difference whether the method operates on the distances or the squared distances.

The code example in the R documentation (enter \texttt{?hclust} or \texttt{ex\-am\-ple(hclust)} in R) contains another instance where the squared distance matrix is generated from Euclidean data.

\item The Python interface is not designed to deal with missing values, and NaN values in the vector data raise an error message. The \hyperref[hclust.vector]{\texttt{hclust.vector}} method in the R interface, in contrast, deals with NaN and the (R specific) {\NA} values in the same way as the \dist{} method does. Confer the documentation for \dist{} for details.
\end{itemize}

\section{References}
\begin{trivlist}
\item
\textit{NumPy: Scientific computing tools for Python}, \url{http://numpy.org/}.

\item
Eric Jones, Travis Oliphant, Pearu Peterson et al., \textit{SciPy: Open Source Scientific Tools for Python}, 2001, \url{https://www.scipy.org}.

\item
\textit{R: A Language and Environment for Statistical Computing}, R Foundation for Statistical Computing, Vienna, 2011, \url{https://www.r-project.org}.
\end{trivlist}
\end{document}
%%% Local variables:
%%% mode: latex
%%% TeX-master: "fastcluster.Rtex"
%%% TeX-PDF-mode: t
%%% End: