File: chapter_cluster.rst

package info (click to toggle)
python-biopython 1.85%2Bdfsg-4
  • links: PTS, VCS
  • area: main
  • in suites: forky, sid, trixie
  • size: 126,372 kB
  • sloc: xml: 1,047,995; python: 332,722; ansic: 16,944; sql: 1,208; makefile: 140; sh: 81
file content (1752 lines) | stat: -rw-r--r-- 67,770 bytes parent folder | download
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
1015
1016
1017
1018
1019
1020
1021
1022
1023
1024
1025
1026
1027
1028
1029
1030
1031
1032
1033
1034
1035
1036
1037
1038
1039
1040
1041
1042
1043
1044
1045
1046
1047
1048
1049
1050
1051
1052
1053
1054
1055
1056
1057
1058
1059
1060
1061
1062
1063
1064
1065
1066
1067
1068
1069
1070
1071
1072
1073
1074
1075
1076
1077
1078
1079
1080
1081
1082
1083
1084
1085
1086
1087
1088
1089
1090
1091
1092
1093
1094
1095
1096
1097
1098
1099
1100
1101
1102
1103
1104
1105
1106
1107
1108
1109
1110
1111
1112
1113
1114
1115
1116
1117
1118
1119
1120
1121
1122
1123
1124
1125
1126
1127
1128
1129
1130
1131
1132
1133
1134
1135
1136
1137
1138
1139
1140
1141
1142
1143
1144
1145
1146
1147
1148
1149
1150
1151
1152
1153
1154
1155
1156
1157
1158
1159
1160
1161
1162
1163
1164
1165
1166
1167
1168
1169
1170
1171
1172
1173
1174
1175
1176
1177
1178
1179
1180
1181
1182
1183
1184
1185
1186
1187
1188
1189
1190
1191
1192
1193
1194
1195
1196
1197
1198
1199
1200
1201
1202
1203
1204
1205
1206
1207
1208
1209
1210
1211
1212
1213
1214
1215
1216
1217
1218
1219
1220
1221
1222
1223
1224
1225
1226
1227
1228
1229
1230
1231
1232
1233
1234
1235
1236
1237
1238
1239
1240
1241
1242
1243
1244
1245
1246
1247
1248
1249
1250
1251
1252
1253
1254
1255
1256
1257
1258
1259
1260
1261
1262
1263
1264
1265
1266
1267
1268
1269
1270
1271
1272
1273
1274
1275
1276
1277
1278
1279
1280
1281
1282
1283
1284
1285
1286
1287
1288
1289
1290
1291
1292
1293
1294
1295
1296
1297
1298
1299
1300
1301
1302
1303
1304
1305
1306
1307
1308
1309
1310
1311
1312
1313
1314
1315
1316
1317
1318
1319
1320
1321
1322
1323
1324
1325
1326
1327
1328
1329
1330
1331
1332
1333
1334
1335
1336
1337
1338
1339
1340
1341
1342
1343
1344
1345
1346
1347
1348
1349
1350
1351
1352
1353
1354
1355
1356
1357
1358
1359
1360
1361
1362
1363
1364
1365
1366
1367
1368
1369
1370
1371
1372
1373
1374
1375
1376
1377
1378
1379
1380
1381
1382
1383
1384
1385
1386
1387
1388
1389
1390
1391
1392
1393
1394
1395
1396
1397
1398
1399
1400
1401
1402
1403
1404
1405
1406
1407
1408
1409
1410
1411
1412
1413
1414
1415
1416
1417
1418
1419
1420
1421
1422
1423
1424
1425
1426
1427
1428
1429
1430
1431
1432
1433
1434
1435
1436
1437
1438
1439
1440
1441
1442
1443
1444
1445
1446
1447
1448
1449
1450
1451
1452
1453
1454
1455
1456
1457
1458
1459
1460
1461
1462
1463
1464
1465
1466
1467
1468
1469
1470
1471
1472
1473
1474
1475
1476
1477
1478
1479
1480
1481
1482
1483
1484
1485
1486
1487
1488
1489
1490
1491
1492
1493
1494
1495
1496
1497
1498
1499
1500
1501
1502
1503
1504
1505
1506
1507
1508
1509
1510
1511
1512
1513
1514
1515
1516
1517
1518
1519
1520
1521
1522
1523
1524
1525
1526
1527
1528
1529
1530
1531
1532
1533
1534
1535
1536
1537
1538
1539
1540
1541
1542
1543
1544
1545
1546
1547
1548
1549
1550
1551
1552
1553
1554
1555
1556
1557
1558
1559
1560
1561
1562
1563
1564
1565
1566
1567
1568
1569
1570
1571
1572
1573
1574
1575
1576
1577
1578
1579
1580
1581
1582
1583
1584
1585
1586
1587
1588
1589
1590
1591
1592
1593
1594
1595
1596
1597
1598
1599
1600
1601
1602
1603
1604
1605
1606
1607
1608
1609
1610
1611
1612
1613
1614
1615
1616
1617
1618
1619
1620
1621
1622
1623
1624
1625
1626
1627
1628
1629
1630
1631
1632
1633
1634
1635
1636
1637
1638
1639
1640
1641
1642
1643
1644
1645
1646
1647
1648
1649
1650
1651
1652
1653
1654
1655
1656
1657
1658
1659
1660
1661
1662
1663
1664
1665
1666
1667
1668
1669
1670
1671
1672
1673
1674
1675
1676
1677
1678
1679
1680
1681
1682
1683
1684
1685
1686
1687
1688
1689
1690
1691
1692
1693
1694
1695
1696
1697
1698
1699
1700
1701
1702
1703
1704
1705
1706
1707
1708
1709
1710
1711
1712
1713
1714
1715
1716
1717
1718
1719
1720
1721
1722
1723
1724
1725
1726
1727
1728
1729
1730
1731
1732
1733
1734
1735
1736
1737
1738
1739
1740
1741
1742
1743
1744
1745
1746
1747
1748
1749
1750
1751
1752
.. _`chapter:cluster`:

Cluster analysis
================

Cluster analysis is the grouping of items into clusters based on the
similarity of the items to each other. In bioinformatics, clustering is
widely used in gene expression data analysis to find groups of genes
with similar gene expression profiles. This may identify functionally
related genes, as well as suggest the function of presently unknown
genes.

The Biopython module ``Bio.Cluster`` provides commonly used clustering
algorithms and was designed with the application to gene expression data
in mind. However, this module can also be used for cluster analysis of
other types of data. ``Bio.Cluster`` and the underlying C Clustering
Library is described by De Hoon *et al.* [DeHoon2004]_.

The following four clustering approaches are implemented in
``Bio.Cluster``:

-  Hierarchical clustering (pairwise centroid-, single-, complete-, and
   average-linkage);

-  :math:`k`-means, :math:`k`-medians, and :math:`k`-medoids clustering;

-  Self-Organizing Maps;

-  Principal Component Analysis.

Data representation
~~~~~~~~~~~~~~~~~~~

The data to be clustered are represented by a :math:`n \times m`
Numerical Python array ``data``. Within the context of gene expression
data clustering, typically the rows correspond to different genes
whereas the columns correspond to different experimental conditions. The
clustering algorithms in ``Bio.Cluster`` can be applied both to rows
(genes) and to columns (experiments).

Missing values
~~~~~~~~~~~~~~

The :math:`n \times m` Numerical Python integer array ``mask`` indicates
if any of the values in ``data`` are missing. If ``mask[i, j] == 0``,
then ``data[i, j]`` is missing and is ignored in the analysis.

Random number generator
~~~~~~~~~~~~~~~~~~~~~~~

The :math:`k`-means/medians/medoids clustering algorithms and
Self-Organizing Maps (SOMs) include the use of a random number
generator. The uniform random number generator in ``Bio.Cluster`` is
based on the algorithm by L’Ecuyer [Lecuyer1988]_,
while random numbers following the binomial distribution are generated
using the BTPE algorithm by Kachitvichyanukul and Schmeiser
[Kachitvichyanukul1988]_. The random number generator
is initialized automatically during its first call. As this random
number generator uses a combination of two multiplicative linear
congruential generators, two (integer) seeds are needed for
initialization, for which we use the system-supplied random number
generator ``rand`` (in the C standard library). We initialize this
generator by calling ``srand`` with the epoch time in seconds, and use
the first two random numbers generated by ``rand`` as seeds for the
uniform random number generator in ``Bio.Cluster``.

.. _`sec:distancefunctions`:

Distance functions
------------------

In order to cluster items into groups based on their similarity, we
should first define what exactly we mean by *similar*. ``Bio.Cluster``
provides eight distance functions, indicated by a single character, to
measure similarity, or conversely, distance:

-  ``'e'``: Euclidean distance;

-  ``'b'``: City-block distance.

-  ``'c'``: Pearson correlation coefficient;

-  ``'a'``: Absolute value of the Pearson correlation coefficient;

-  ``'u'``: Uncentered Pearson correlation (equivalent to the cosine of
   the angle between two data vectors);

-  ``'x'``: Absolute uncentered Pearson correlation;

-  ``'s'``: Spearman’s rank correlation;

-  ``'k'``: Kendall’s :math:`\tau`.

The first two are true distance functions that satisfy the triangle
inequality:

.. math:: d\left(\underline{u},\underline{v}\right) \leq d\left(\underline{u},\underline{w}\right) + d\left(\underline{w},\underline{v}\right) \textrm{ for all } \underline{u}, \underline{v}, \underline{w},

and are therefore referred to as *metrics*. In everyday language, this
means that the shortest distance between two points is a straight line.

The remaining six distance measures are related to the correlation
coefficient, where the distance :math:`d` is defined in terms of the
correlation :math:`r` by :math:`d=1-r`. Note that these distance
functions are *semi-metrics* that do not satisfy the triangle
inequality. For example, for

.. math:: \underline{u}=\left(1,0,-1\right);

.. math:: \underline{v}=\left(1,1,0\right);

.. math:: \underline{w}=\left(0,1,1\right);

we find a Pearson distance
:math:`d\left(\underline{u},\underline{w}\right) = 1.8660`, while
:math:`d\left(\underline{u},\underline{v}\right)+d\left(\underline{v},\underline{w}\right) = 1.6340`.

Euclidean distance
~~~~~~~~~~~~~~~~~~

In ``Bio.Cluster``, we define the Euclidean distance as

.. math:: d = {1 \over n} \sum_{i=1}^{n} \left(x_i-y_i\right)^{2}.

Only those terms are included in the summation for which both
:math:`x_i` and :math:`y_i` are present, and the denominator :math:`n`
is chosen accordingly. As the expression data :math:`x_i` and
:math:`y_i` are subtracted directly from each other, we should make sure
that the expression data are properly normalized when using the
Euclidean distance.

City-block distance
~~~~~~~~~~~~~~~~~~~

The city-block distance, alternatively known as the Manhattan distance,
is related to the Euclidean distance. Whereas the Euclidean distance
corresponds to the length of the shortest path between two points, the
city-block distance is the sum of distances along each dimension. As
gene expression data tend to have missing values, in ``Bio.Cluster`` we
define the city-block distance as the sum of distances divided by the
number of dimensions:

.. math:: d = {1 \over n} \sum_{i=1}^n \left|x_i-y_i\right|.

This is equal to the distance you would have to walk between two points
in a city, where you have to walk along city blocks. As for the
Euclidean distance, the expression data are subtracted directly from
each other, and we should therefore make sure that they are properly
normalized.

The Pearson correlation coefficient
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The Pearson correlation coefficient is defined as

.. math:: r = \frac{1}{n} \sum_{i=1}^n \left( \frac{x_i -\bar{x}}{\sigma_x} \right) \left(\frac{y_i -\bar{y}}{\sigma_y} \right),

in which :math:`\bar{x}, \bar{y}` are the sample mean of :math:`x` and
:math:`y` respectively, and :math:`\sigma_x, \sigma_y` are the sample
standard deviation of :math:`x` and :math:`y`. The Pearson correlation
coefficient is a measure for how well a straight line can be fitted to a
scatterplot of :math:`x` and :math:`y`. If all the points in the
scatterplot lie on a straight line, the Pearson correlation coefficient
is either +1 or -1, depending on whether the slope of line is positive
or negative. If the Pearson correlation coefficient is equal to zero,
there is no correlation between :math:`x` and :math:`y`.

The *Pearson distance* is then defined as

.. math:: d_{\textrm{P}} \equiv 1 - r.

As the Pearson correlation coefficient lies between -1 and 1, the
Pearson distance lies between 0 and 2.

Absolute Pearson correlation
~~~~~~~~~~~~~~~~~~~~~~~~~~~~

By taking the absolute value of the Pearson correlation, we find a
number between 0 and 1. If the absolute value is 1, all the points in
the scatter plot lie on a straight line with either a positive or a
negative slope. If the absolute value is equal to zero, there is no
correlation between :math:`x` and :math:`y`.

The corresponding distance is defined as

.. math:: d_{\textrm A} \equiv 1 - \left|r\right|,

where :math:`r` is the Pearson correlation coefficient. As the absolute
value of the Pearson correlation coefficient lies between 0 and 1, the
corresponding distance lies between 0 and 1 as well.

In the context of gene expression experiments, the absolute correlation
is equal to 1 if the gene expression profiles of two genes are either
exactly the same or exactly opposite. The absolute correlation
coefficient should therefore be used with care.

Uncentered correlation (cosine of the angle)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

In some cases, it may be preferable to use the *uncentered correlation*
instead of the regular Pearson correlation coefficient. The uncentered
correlation is defined as

.. math:: r_{\textrm U} = \frac{1}{n} \sum_{i=1}^{n} \left(\frac{x_i}{\sigma_x^{(0)}} \right) \left(\frac{y_i}{\sigma_y^{(0)}} \right),

where

.. math::

   \begin{aligned}
   \sigma_x^{(0)} & = & \sqrt{{\frac{1}{n}} \sum_{i=1}^{n}x_i^2}; \nonumber \\
   \sigma_y^{(0)} & = & \sqrt{{\frac{1}{n}} \sum_{i=1}^{n}y_i^2}. \nonumber
   \end{aligned}

This is the same expression as for the regular Pearson correlation
coefficient, except that the sample means :math:`\bar{x}, \bar{y}` are
set equal to zero. The uncentered correlation may be appropriate if
there is a zero reference state. For instance, in the case of gene
expression data given in terms of log-ratios, a log-ratio equal to zero
corresponds to the green and red signal being equal, which means that
the experimental manipulation did not affect the gene expression.

The distance corresponding to the uncentered correlation coefficient is
defined as

.. math:: d_{\mbox{U}} \equiv 1 - r_{\mbox{U}},

where :math:`r_{\mbox{U}}` is the uncentered correlation. As the
uncentered correlation coefficient lies between -1 and 1, the
corresponding distance lies between 0 and 2.

The uncentered correlation is equal to the cosine of the angle of the
two data vectors in :math:`n`-dimensional space, and is often referred
to as such.

Absolute uncentered correlation
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

As for the regular Pearson correlation, we can define a distance measure
using the absolute value of the uncentered correlation:

.. math:: d_{\mbox{AU}} \equiv 1 - \left|r_{\mbox{U}}\right|,

where :math:`r_{\mbox{U}}` is the uncentered correlation coefficient. As
the absolute value of the uncentered correlation coefficient lies
between 0 and 1, the corresponding distance lies between 0 and 1 as
well.

Geometrically, the absolute value of the uncentered correlation is equal
to the cosine between the supporting lines of the two data vectors
(i.e., the angle without taking the direction of the vectors into
consideration).

Spearman rank correlation
~~~~~~~~~~~~~~~~~~~~~~~~~

The Spearman rank correlation is an example of a non-parametric
similarity measure, and tends to be more robust against outliers than
the Pearson correlation.

To calculate the Spearman rank correlation, we replace each data value
by their rank if we would order the data in each vector by their value.
We then calculate the Pearson correlation between the two rank vectors
instead of the data vectors.

As in the case of the Pearson correlation, we can define a distance
measure corresponding to the Spearman rank correlation as

.. math:: d_{\mbox{S}} \equiv 1 - r_{\mbox{S}},

where :math:`r_{\mbox{S}}` is the Spearman rank correlation.

Kendall’s :math:`\tau`
~~~~~~~~~~~~~~~~~~~~~~

Kendall’s :math:`\tau` is another example of a non-parametric similarity
measure. It is similar to the Spearman rank correlation, but instead of
the ranks themselves only the relative ranks are used to calculate
:math:`\tau` (see Snedecor & Cochran [Snedecor1989]_).

We can define a distance measure corresponding to Kendall’s :math:`\tau`
as

.. math:: d_{\mbox{K}} \equiv 1 - \tau.

As Kendall’s :math:`\tau` is always between -1 and 1, the corresponding
distance will be between 0 and 2.

Weighting
~~~~~~~~~

For most of the distance functions available in ``Bio.Cluster``, a
weight vector can be applied. The weight vector contains weights for the
items in the data vector. If the weight for item :math:`i` is
:math:`w_i`, then that item is treated as if it occurred :math:`w_i`
times in the data. The weight do not have to be integers.

.. _`sec:distancematrix`:

Calculating the distance matrix
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The distance matrix is a square matrix with all pairwise distances
between the items in ``data``, and can be calculated by the function
``distancematrix`` in the ``Bio.Cluster`` module:

.. code:: pycon

   >>> from Bio.Cluster import distancematrix
   >>> matrix = distancematrix(data)

where the following arguments are defined:

-  | ``data`` (required)
   | Array containing the data for the items.

-  | ``mask`` (default: ``None``)
   | Array of integers showing which data are missing. If
     ``mask[i, j] == 0``, then ``data[i, j]`` is missing. If ``mask`` is
     ``None``, then all data are present.

-  | ``weight`` (default: ``None``)
   | The weights to be used when calculating distances. If ``weight`` is
     ``None``, then equal weights are assumed.

-  | ``transpose`` (default: ``0``)
   | Determines if the distances between the rows of ``data`` are to be
     calculated (``transpose`` is ``False``), or between the columns of
     ``data`` (``transpose`` is ``True``).

-  | ``dist`` (default: ``'e'``, Euclidean distance)
   | Defines the distance function to be used (see
     :ref:`sec:distancefunctions`).

To save memory, the distance matrix is returned as a list of 1D arrays.
The number of columns in each row is equal to the row number. Hence, the
first row has zero elements. For example,

.. code:: pycon

   >>> from numpy import array
   >>> from Bio.Cluster import distancematrix
   >>> data = array([[0, 1,  2,  3],
   ...               [4, 5,  6,  7],
   ...               [8, 9, 10, 11],
   ...               [1, 2,  3,  4]])  # fmt: skip
   ...
   >>> distances = distancematrix(data, dist="e")

yields a distance matrix

.. code:: pycon

   >>> distances
   [array([], dtype=float64), array([ 16.]), array([ 64.,  16.]), array([  1.,   9.,  49.])]

which can be rewritten as

.. code:: python

   [array([], dtype=float64), array([16.0]), array([64.0, 16.0]), array([1.0, 9.0, 49.0])]

This corresponds to the distance matrix:

.. math::

   \left(
   \begin{array}{cccc}
   0  & 16 & 64 &  1  \\
   16 &  0 & 16 &  9  \\
   64 & 16 &  0 & 49  \\
    1 &  9 & 49 &  0
   \end{array}
   \right).

Calculating cluster properties
------------------------------

.. _`sec:clustercentroids`:

Calculating the cluster centroids
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The centroid of a cluster can be defined either as the mean or as the
median of each dimension over all cluster items. The function
``clustercentroids`` in ``Bio.Cluster`` can be used to calculate either:

.. code:: pycon

   >>> from Bio.Cluster import clustercentroids
   >>> cdata, cmask = clustercentroids(data)

where the following arguments are defined:

-  | ``data`` (required)
   | Array containing the data for the items.

-  | ``mask`` (default: ``None``)
   | Array of integers showing which data are missing. If
     ``mask[i, j] == 0``, then ``data[i, j]`` is missing. If ``mask`` is
     ``None``, then all data are present.

-  | ``clusterid`` (default: ``None``)
   | Vector of integers showing to which cluster each item belongs. If
     ``clusterid`` is ``None``, then all items are assumed to belong to
     the same cluster.

-  | ``method`` (default: ``'a'``)
   | Specifies whether the arithmetic mean (``method=='a'``) or the
     median (``method=='m'``) is used to calculate the cluster center.

-  | ``transpose`` (default: ``0``)
   | Determines if the centroids of the rows of ``data`` are to be
     calculated (``transpose`` is ``False``), or the centroids of the
     columns of ``data`` (``transpose`` is ``True``).

This function returns the tuple ``(cdata, cmask)``. The centroid data
are stored in the 2D Numerical Python array ``cdata``, with missing data
indicated by the 2D Numerical Python integer array ``cmask``. The
dimensions of these arrays are
:math:`\left(\textrm{number of clusters}, \textrm{number of columns}\right)`
if ``transpose`` is ``0``, or
:math:`\left(\textrm{number of rows}, \textrm{number of clusters}\right)`
if ``transpose`` is ``1``. Each row (if ``transpose`` is ``0``) or
column (if ``transpose`` is ``1``) contains the averaged data
corresponding to the centroid of each cluster.

Calculating the distance between clusters
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Given a distance function between *items*, we can define the distance
between two *clusters* in several ways. The distance between the
arithmetic means of the two clusters is used in pairwise
centroid-linkage clustering and in :math:`k`-means clustering. In
:math:`k`-medoids clustering, the distance between the medians of the
two clusters is used instead. The shortest pairwise distance between
items of the two clusters is used in pairwise single-linkage clustering,
while the longest pairwise distance is used in pairwise maximum-linkage
clustering. In pairwise average-linkage clustering, the distance between
two clusters is defined as the average over the pairwise distances.

To calculate the distance between two clusters, use

.. code:: pycon

   >>> from Bio.Cluster import clusterdistance
   >>> distance = clusterdistance(data)

where the following arguments are defined:

-  | ``data`` (required)
   | Array containing the data for the items.

-  | ``mask`` (default: ``None``)
   | Array of integers showing which data are missing. If
     ``mask[i, j] == 0``, then ``data[i, j]`` is missing. If ``mask`` is
     ``None``, then all data are present.

-  | ``weight`` (default: ``None``)
   | The weights to be used when calculating distances. If ``weight`` is
     ``None``, then equal weights are assumed.

-  | ``index1`` (default: ``0``)
   | A list containing the indices of the items belonging to the first
     cluster. A cluster containing only one item :math:`i` can be
     represented either as a list ``[i]``, or as an integer ``i``.

-  | ``index2`` (default: ``0``)
   | A list containing the indices of the items belonging to the second
     cluster. A cluster containing only one items :math:`i` can be
     represented either as a list ``[i]``, or as an integer ``i``.

-  | ``method`` (default: ``'a'``)
   | Specifies how the distance between clusters is defined:

   -  ``'a'``: Distance between the two cluster centroids (arithmetic
      mean);

   -  ``'m'``: Distance between the two cluster centroids (median);

   -  ``'s'``: Shortest pairwise distance between items in the two
      clusters;

   -  ``'x'``: Longest pairwise distance between items in the two
      clusters;

   -  ``'v'``: Average over the pairwise distances between items in the
      two clusters.

-  | ``dist`` (default: ``'e'``, Euclidean distance)
   | Defines the distance function to be used (see
     :ref:`sec:distancefunctions`).

-  | ``transpose`` (default: ``0``)
   | If ``transpose`` is ``False``, calculate the distance between the
     rows of ``data``. If ``transpose`` is ``True``, calculate the
     distance between the columns of ``data``.

Partitioning algorithms
-----------------------

Partitioning algorithms divide items into :math:`k` clusters such that
the sum of distances over the items to their cluster centers is minimal.
The number of clusters :math:`k` is specified by the user. Three
partitioning algorithms are available in ``Bio.Cluster``:

-  :math:`k`-means clustering

-  :math:`k`-medians clustering

-  :math:`k`-medoids clustering

These algorithms differ in how the cluster center is defined. In
:math:`k`-means clustering, the cluster center is defined as the mean
data vector averaged over all items in the cluster. Instead of the mean,
in :math:`k`-medians clustering the median is calculated for each
dimension in the data vector. Finally, in :math:`k`-medoids clustering
the cluster center is defined as the item which has the smallest sum of
distances to the other items in the cluster. This clustering algorithm
is suitable for cases in which the distance matrix is known but the
original data matrix is not available, for example when clustering
proteins based on their structural similarity.

The expectation-maximization (EM) algorithm is used to find this
partitioning into :math:`k` groups. In the initialization of the EM
algorithm, we randomly assign items to clusters. To ensure that no empty
clusters are produced, we use the binomial distribution to randomly
choose the number of items in each cluster to be one or more. We then
randomly permute the cluster assignments to items such that each item
has an equal probability to be in any cluster. Each cluster is thus
guaranteed to contain at least one item.

We then iterate:

-  Calculate the centroid of each cluster, defined as either the mean,
   the median, or the medoid of the cluster;

-  Calculate the distances of each item to the cluster centers;

-  For each item, determine which cluster centroid is closest;

-  Reassign each item to its closest cluster, or stop the iteration if
   no further item reassignments take place.

To avoid clusters becoming empty during the iteration, in
:math:`k`-means and :math:`k`-medians clustering the algorithm keeps
track of the number of items in each cluster, and prohibits the last
remaining item in a cluster from being reassigned to a different
cluster. For :math:`k`-medoids clustering, such a check is not needed,
as the item that functions as the cluster centroid has a zero distance
to itself, and will therefore never be closer to a different cluster.

As the initial assignment of items to clusters is done randomly, usually
a different clustering solution is found each time the EM algorithm is
executed. To find the optimal clustering solution, the :math:`k`-means
algorithm is repeated many times, each time starting from a different
initial random clustering. The sum of distances of the items to their
cluster center is saved for each run, and the solution with the smallest
value of this sum will be returned as the overall clustering solution.

How often the EM algorithm should be run depends on the number of items
being clustered. As a rule of thumb, we can consider how often the
optimal solution was found; this number is returned by the partitioning
algorithms as implemented in this library. If the optimal solution was
found many times, it is unlikely that better solutions exist than the
one that was found. However, if the optimal solution was found only
once, there may well be other solutions with a smaller within-cluster
sum of distances. If the number of items is large (more than several
hundreds), it may be difficult to find the globally optimal solution.

The EM algorithm terminates when no further reassignments take place. We
noticed that for some sets of initial cluster assignments, the EM
algorithm fails to converge due to the same clustering solution
reappearing periodically after a small number of iteration steps. We
therefore check for the occurrence of such periodic solutions during the
iteration. After a given number of iteration steps, the current
clustering result is saved as a reference. By comparing the clustering
result after each subsequent iteration step to the reference state, we
can determine if a previously encountered clustering result is found. In
such a case, the iteration is halted. If after a given number of
iterations the reference state has not yet been encountered, the current
clustering solution is saved to be used as the new reference state.
Initially, ten iteration steps are executed before resaving the
reference state. This number of iteration steps is doubled each time, to
ensure that periodic behavior with longer periods can also be detected.

:math:`k`-means and :math:`k`-medians
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The :math:`k`-means and :math:`k`-medians algorithms are implemented as
the function ``kcluster`` in ``Bio.Cluster``:

.. code:: pycon

   >>> from Bio.Cluster import kcluster
   >>> clusterid, error, nfound = kcluster(data)

where the following arguments are defined:

-  | ``data`` (required)
   | Array containing the data for the items.

-  | ``nclusters`` (default: ``2``)
   | The number of clusters :math:`k`.

-  | ``mask`` (default: ``None``)
   | Array of integers showing which data are missing. If
     ``mask[i, j] == 0``, then ``data[i, j]`` is missing. If ``mask`` is
     ``None``, then all data are present.

-  | ``weight`` (default: ``None``)
   | The weights to be used when calculating distances. If ``weight`` is
     ``None``, then equal weights are assumed.

-  | ``transpose`` (default: ``0``)
   | Determines if rows (``transpose`` is ``0``) or columns
     (``transpose`` is ``1``) are to be clustered.

-  | ``npass`` (default: ``1``)
   | The number of times the :math:`k`-means/-medians clustering
     algorithm is performed, each time with a different (random) initial
     condition. If ``initialid`` is given, the value of ``npass`` is
     ignored and the clustering algorithm is run only once, as it
     behaves deterministically in that case.

-  | ``method`` (default: ``a``)
   | describes how the center of a cluster is found:

   -  ``method=='a'``: arithmetic mean (:math:`k`-means clustering);

   -  ``method=='m'``: median (:math:`k`-medians clustering).

   For other values of ``method``, the arithmetic mean is used.

-  | ``dist`` (default: ``'e'``, Euclidean distance)
   | Defines the distance function to be used (see
     :ref:`sec:distancefunctions`). Whereas all eight distance
     measures are accepted by ``kcluster``, from a theoretical viewpoint
     it is best to use the Euclidean distance for the :math:`k`-means
     algorithm, and the city-block distance for :math:`k`-medians.

-  | ``initialid`` (default: ``None``)
   | Specifies the initial clustering to be used for the EM algorithm.
     If ``initialid`` is ``None``, then a different random initial
     clustering is used for each of the ``npass`` runs of the EM
     algorithm. If ``initialid`` is not ``None``, then it should be
     equal to a 1D array containing the cluster number (between ``0``
     and ``nclusters-1``) for each item. Each cluster should contain at
     least one item. With the initial clustering specified, the EM
     algorithm is deterministic.

This function returns a tuple ``(clusterid, error, nfound)``, where
``clusterid`` is an integer array containing the number of the cluster
to which each row or cluster was assigned, ``error`` is the
within-cluster sum of distances for the optimal clustering solution, and
``nfound`` is the number of times this optimal solution was found.

:math:`k`-medoids clustering
~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The ``kmedoids`` routine performs :math:`k`-medoids clustering on a
given set of items, using the distance matrix and the number of clusters
passed by the user:

.. code:: pycon

   >>> from Bio.Cluster import kmedoids
   >>> clusterid, error, nfound = kmedoids(distance)

where the following arguments are defined: , nclusters=2, npass=1,
initialid=None)\|

-  | ``distance`` (required)
   | The matrix containing the distances between the items; this matrix
     can be specified in three ways:

   -  as a 2D Numerical Python array (in which only the left-lower part
      of the array will be accessed):

      .. code:: python

         distance = array([[0.0, 1.1, 2.3], [1.1, 0.0, 4.5], [2.3, 4.5, 0.0]])

   -  as a 1D Numerical Python array containing consecutively the
      distances in the left-lower part of the distance matrix:

      .. code:: python

         distance = array([1.1, 2.3, 4.5])

   -  as a list containing the rows of the left-lower part of the
      distance matrix:

      .. code:: python

         distance = [array([]), array([1.1]), array([2.3, 4.5])]

   These three expressions correspond to the same distance matrix.

-  | ``nclusters`` (default: ``2``)
   | The number of clusters :math:`k`.

-  | ``npass`` (default: ``1``)
   | The number of times the :math:`k`-medoids clustering algorithm is
     performed, each time with a different (random) initial condition.
     If ``initialid`` is given, the value of ``npass`` is ignored, as
     the clustering algorithm behaves deterministically in that case.

-  | ``initialid`` (default: ``None``)
   | Specifies the initial clustering to be used for the EM algorithm.
     If ``initialid`` is ``None``, then a different random initial
     clustering is used for each of the ``npass`` runs of the EM
     algorithm. If ``initialid`` is not ``None``, then it should be
     equal to a 1D array containing the cluster number (between ``0``
     and ``nclusters-1``) for each item. Each cluster should contain at
     least one item. With the initial clustering specified, the EM
     algorithm is deterministic.

This function returns a tuple ``(clusterid, error, nfound)``, where
``clusterid`` is an array containing the number of the cluster to which
each item was assigned, ``error`` is the within-cluster sum of distances
for the optimal :math:`k`-medoids clustering solution, and ``nfound`` is
the number of times the optimal solution was found. Note that the
cluster number in ``clusterid`` is defined as the item number of the
item representing the cluster centroid.

Hierarchical clustering
-----------------------

Hierarchical clustering methods are inherently different from the
:math:`k`-means clustering method. In hierarchical clustering, the
similarity in the expression profile between genes or experimental
conditions are represented in the form of a tree structure. This tree
structure can be shown graphically by programs such as Treeview and Java
Treeview, which has contributed to the popularity of hierarchical
clustering in the analysis of gene expression data.

The first step in hierarchical clustering is to calculate the distance
matrix, specifying all the distances between the items to be clustered.
Next, we create a node by joining the two closest items. Subsequent
nodes are created by pairwise joining of items or nodes based on the
distance between them, until all items belong to the same node. A tree
structure can then be created by retracing which items and nodes were
merged. Unlike the EM algorithm, which is used in :math:`k`-means
clustering, the complete process of hierarchical clustering is
deterministic.

Several flavors of hierarchical clustering exist, which differ in how
the distance between subnodes is defined in terms of their members. In
``Bio.Cluster``, pairwise single, maximum, average, and centroid linkage
are available.

-  In pairwise single-linkage clustering, the distance between two nodes
   is defined as the shortest distance among the pairwise distances
   between the members of the two nodes.

-  In pairwise maximum-linkage clustering, alternatively known as
   pairwise complete-linkage clustering, the distance between two nodes
   is defined as the longest distance among the pairwise distances
   between the members of the two nodes.

-  In pairwise average-linkage clustering, the distance between two
   nodes is defined as the average over all pairwise distances between
   the items of the two nodes.

-  In pairwise centroid-linkage clustering, the distance between two
   nodes is defined as the distance between their centroids. The
   centroids are calculated by taking the mean over all the items in a
   cluster. As the distance from each newly formed node to existing
   nodes and items need to be calculated at each step, the computing
   time of pairwise centroid-linkage clustering may be significantly
   longer than for the other hierarchical clustering methods. Another
   peculiarity is that (for a distance measure based on the Pearson
   correlation), the distances do not necessarily increase when going up
   in the clustering tree, and may even decrease. This is caused by an
   inconsistency between the centroid calculation and the distance
   calculation when using the Pearson correlation: Whereas the Pearson
   correlation effectively normalizes the data for the distance
   calculation, no such normalization occurs for the centroid
   calculation.

For pairwise single-, complete-, and average-linkage clustering, the
distance between two nodes can be found directly from the distances
between the individual items. Therefore, the clustering algorithm does
not need access to the original gene expression data, once the distance
matrix is known. For pairwise centroid-linkage clustering, however, the
centroids of newly formed subnodes can only be calculated from the
original data and not from the distance matrix.

The implementation of pairwise single-linkage hierarchical clustering is
based on the SLINK algorithm [Sibson1973]_, which is
much faster and more memory-efficient than a straightforward
implementation of pairwise single-linkage clustering. The clustering
result produced by this algorithm is identical to the clustering
solution found by the conventional single-linkage algorithm. The
single-linkage hierarchical clustering algorithm implemented in this
library can be used to cluster large gene expression data sets, for
which conventional hierarchical clustering algorithms fail due to
excessive memory requirements and running time.

Representing a hierarchical clustering solution
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The result of hierarchical clustering consists of a tree of nodes, in
which each node joins two items or subnodes. Usually, we are not only
interested in which items or subnodes are joined at each node, but also
in their similarity (or distance) as they are joined. To store one node
in the hierarchical clustering tree, we make use of the class ``Node``,
which defined in ``Bio.Cluster``. An instance of ``Node`` has three
attributes:

-  ``left``

-  ``right``

-  ``distance``

Here, ``left`` and ``right`` are integers referring to the two items or
subnodes that are joined at this node, and ``distance`` is the distance
between them. The items being clustered are numbered from 0 to
:math:`\left(\textrm{number of items} - 1\right)`, while clusters are
numbered from -1 to :math:`-\left(\textrm{number of items}-1\right)`.
Note that the number of nodes is one less than the number of items.

To create a new ``Node`` object, we need to specify ``left`` and
``right``; ``distance`` is optional.

.. doctest . lib:numpy

.. code:: pycon

   >>> from Bio.Cluster import Node
   >>> Node(2, 3)
   (2, 3): 0
   >>> Node(2, 3, 0.91)
   (2, 3): 0.91

The attributes ``left``, ``right``, and ``distance`` of an existing
``Node`` object can be modified directly:

.. cont-doctest

.. code:: pycon

   >>> node = Node(4, 5)
   >>> node.left = 6
   >>> node.right = 2
   >>> node.distance = 0.73
   >>> node
   (6, 2): 0.73

An error is raised if ``left`` and ``right`` are not integers, or if
``distance`` cannot be converted to a floating-point value.

The Python class ``Tree`` represents a full hierarchical clustering
solution. A ``Tree`` object can be created from a list of ``Node``
objects:

.. doctest . lib:numpy

.. code:: pycon

   >>> from Bio.Cluster import Node, Tree
   >>> nodes = [Node(1, 2, 0.2), Node(0, 3, 0.5), Node(-2, 4, 0.6), Node(-1, -3, 0.9)]
   >>> tree = Tree(nodes)
   >>> print(tree)
   (1, 2): 0.2
   (0, 3): 0.5
   (-2, 4): 0.6
   (-1, -3): 0.9

The ``Tree`` initializer checks if the list of nodes is a valid
hierarchical clustering result:

.. cont-doctest

.. code:: pycon

   >>> nodes = [Node(1, 2, 0.2), Node(0, 2, 0.5)]
   >>> Tree(nodes)
   Traceback (most recent call last):
     File "<stdin>", line 1, in ?
   ValueError: Inconsistent tree

Individual nodes in a ``Tree`` object can be accessed using square
brackets:

.. cont-doctest

.. code:: pycon

   >>> nodes = [Node(1, 2, 0.2), Node(0, -1, 0.5)]
   >>> tree = Tree(nodes)
   >>> tree[0]
   (1, 2): 0.2
   >>> tree[1]
   (0, -1): 0.5
   >>> tree[-1]
   (0, -1): 0.5

As a ``Tree`` object is immutable, we cannot change individual nodes in
a ``Tree`` object. However, we can convert the tree to a list of nodes,
modify this list, and create a new tree from this list:

.. cont-doctest

.. code:: pycon

   >>> tree = Tree([Node(1, 2, 0.1), Node(0, -1, 0.5), Node(-2, 3, 0.9)])
   >>> print(tree)
   (1, 2): 0.1
   (0, -1): 0.5
   (-2, 3): 0.9
   >>> nodes = tree[:]
   >>> nodes[0] = Node(0, 1, 0.2)
   >>> nodes[1].left = 2
   >>> tree = Tree(nodes)
   >>> print(tree)
   (0, 1): 0.2
   (2, -1): 0.5
   (-2, 3): 0.9

This guarantees that any ``Tree`` object is always well-formed.

To display a hierarchical clustering solution with visualization
programs such as Java Treeview, it is better to scale all node distances
such that they are between zero and one. This can be accomplished by
calling the ``scale`` method on an existing ``Tree`` object:

.. code:: pycon

   >>> tree.scale()

This method takes no arguments, and returns ``None``.

Before drawing the tree, you may also want to reorder the tree nodes. A
hierarchical clustering solution of :math:`n` items can be drawn as
:math:`2^{n-1}` different but equivalent dendrograms by switching the
left and right subnode at each node. The ``tree.sort(order)`` method
visits each node in the hierarchical clustering tree and verifies if the
average order value of the left subnode is less than or equal to the
average order value of the right subnode. If not, the left and right
subnodes are exchanged. Here, the order values of the items are given by
the user. In the resulting dendrogram, items in the left-to-right order
will tend to have increasing order values. The method will return the
indices of the elements in the left-to-right order after sorting:

.. code:: pycon

   >>> indices = tree.sort(order)

such that item ``indices[i]`` will occur at position :math:`i` in the
dendrogram.

After hierarchical clustering, the items can be grouped into :math:`k`
clusters based on the tree structure stored in the ``Tree`` object by
cutting the tree:

.. code:: pycon

   >>> clusterid = tree.cut(nclusters=1)

where ``nclusters`` (defaulting to ``1``) is the desired number of
clusters :math:`k`. This method ignores the top :math:`k-1` linking
events in the tree structure, resulting in :math:`k` separated clusters
of items. The number of clusters :math:`k` should be positive, and less
than or equal to the number of items. This method returns an array
``clusterid`` containing the number of the cluster to which each item is
assigned. Clusters are numbered :math:`0` to :math:`k-1` in their
left-to-right order in the dendrogram.

Performing hierarchical clustering
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

To perform hierarchical clustering, use the ``treecluster`` function in
``Bio.Cluster``.

.. code:: pycon

   >>> from Bio.Cluster import treecluster
   >>> tree = treecluster(data)

where the following arguments are defined:

-  | ``data``
   | Array containing the data for the items.

-  | ``mask`` (default: ``None``)
   | Array of integers showing which data are missing. If
     ``mask[i, j] == 0``, then ``data[i, j]`` is missing. If ``mask`` is
     ``None``, then all data are present.

-  | ``weight`` (default: ``None``)
   | The weights to be used when calculating distances. If ``weight`` is
     ``None``, then equal weights are assumed.

-  | ``transpose`` (default: ``0``)
   | Determines if rows (``transpose`` is ``False``) or columns
     (``transpose`` is ``True``) are to be clustered.

-  | ``method`` (default: ``'m'``)
   | defines the linkage method to be used:

   -  ``method=='s'``: pairwise single-linkage clustering

   -  ``method=='m'``: pairwise maximum- (or complete-) linkage
      clustering

   -  ``method=='c'``: pairwise centroid-linkage clustering

   -  ``method=='a'``: pairwise average-linkage clustering

-  | ``dist`` (default: ``'e'``, Euclidean distance)
   | Defines the distance function to be used (see
     :ref:`sec:distancefunctions`).

To apply hierarchical clustering on a precalculated distance matrix,
specify the ``distancematrix`` argument when calling ``treecluster``
function instead of the ``data`` argument:

.. code:: pycon

   >>> from Bio.Cluster import treecluster
   >>> tree = treecluster(distancematrix=distance)

In this case, the following arguments are defined:

-  | ``distancematrix``
   | The distance matrix, which can be specified in three ways:

   -  as a 2D Numerical Python array (in which only the left-lower part
      of the array will be accessed):

      .. code:: python

         distance = array([[0.0, 1.1, 2.3], [1.1, 0.0, 4.5], [2.3, 4.5, 0.0]])

   -  as a 1D Numerical Python array containing consecutively the
      distances in the left-lower part of the distance matrix:

      .. code:: python

         distance = array([1.1, 2.3, 4.5])

   -  as a list containing the rows of the left-lower part of the
      distance matrix:

      .. code:: python

         distance = [array([]), array([1.1]), array([2.3, 4.5])]

   These three expressions correspond to the same distance matrix. As
   ``treecluster`` may shuffle the values in the distance matrix as part
   of the clustering algorithm, be sure to save this array in a
   different variable before calling ``treecluster`` if you need it
   later.

-  | ``method``
   | The linkage method to be used:

   -  ``method=='s'``: pairwise single-linkage clustering

   -  ``method=='m'``: pairwise maximum- (or complete-) linkage
      clustering

   -  ``method=='a'``: pairwise average-linkage clustering

   While pairwise single-, maximum-, and average-linkage clustering can
   be calculated from the distance matrix alone, pairwise
   centroid-linkage cannot.

When calling ``treecluster``, either ``data`` or ``distancematrix``
should be ``None``.

This function returns a ``Tree`` object. This object contains
:math:`\left(\textrm{number of items} - 1\right)` nodes, where the
number of items is the number of rows if rows were clustered, or the
number of columns if columns were clustered. Each node describes a
pairwise linking event, where the node attributes ``left`` and ``right``
each contain the number of one item or subnode, and ``distance`` the
distance between them. Items are numbered from 0 to
:math:`\left(\textrm{number of items} - 1\right)`, while clusters are
numbered -1 to :math:`-\left(\textrm{number of items}-1\right)`.

Self-Organizing Maps
--------------------

Self-Organizing Maps (SOMs) were invented by Kohonen to describe neural
networks (see for instance Kohonen, 1997
[Kohonen1997]_). Tamayo (1999) first applied
Self-Organizing Maps to gene expression data
[Tamayo1999]_.

SOMs organize items into clusters that are situated in some topology.
Usually a rectangular topology is chosen. The clusters generated by SOMs
are such that neighboring clusters in the topology are more similar to
each other than clusters far from each other in the topology.

The first step to calculate a SOM is to randomly assign a data vector to
each cluster in the topology. If rows are being clustered, then the
number of elements in each data vector is equal to the number of
columns.

An SOM is then generated by taking rows one at a time, and finding which
cluster in the topology has the closest data vector. The data vector of
that cluster, as well as those of the neighboring clusters, are adjusted
using the data vector of the row under consideration. The adjustment is
given by

.. math:: \Delta \underline{x}_{\textrm{cell}} = \tau \cdot \left(\underline{x}_{\textrm{row}} - \underline{x}_{\textrm{cell}} \right).

The parameter :math:`\tau` is a parameter that decreases at each
iteration step. We have used a simple linear function of the iteration
step:

.. math:: \tau = \tau_{\textrm{init}} \cdot \left(1 - {i \over n}\right),

:math:`\tau_{\textrm{init}}` is the initial value of :math:`\tau` as
specified by the user, :math:`i` is the number of the current iteration
step, and :math:`n` is the total number of iteration steps to be
performed. While changes are made rapidly in the beginning of the
iteration, at the end of iteration only small changes are made.

All clusters within a radius :math:`R` are adjusted to the gene under
consideration. This radius decreases as the calculation progresses as

.. math:: R = R_{\textrm{max}} \cdot \left(1 - {i \over n}\right),

in which the maximum radius is defined as

.. math:: R_{\textrm{max}} = \sqrt{N_x^2 + N_y^2},

where :math:`\left(N_x, N_y\right)` are the dimensions of the rectangle
defining the topology.

The function ``somcluster`` implements the complete algorithm to
calculate a Self-Organizing Map on a rectangular grid. First it
initializes the random number generator. The node data are then
initialized using the random number generator. The order in which genes
or samples are used to modify the SOM is also randomized. The total
number of iterations in the SOM algorithm is specified by the user.

To run ``somcluster``, use

.. code:: pycon

   >>> from Bio.Cluster import somcluster
   >>> clusterid, celldata = somcluster(data)

where the following arguments are defined:

-  | ``data`` (required)
   | Array containing the data for the items.

-  | ``mask`` (default: ``None``)
   | Array of integers showing which data are missing. If
     ``mask[i, j] == 0``, then ``data[i, j]`` is missing. If ``mask`` is
     ``None``, then all data are present.

-  | ``weight`` (default: ``None``)
   | contains the weights to be used when calculating distances. If
     ``weight`` is ``None``, then equal weights are assumed.

-  | ``transpose`` (default: ``0``)
   | Determines if rows (``transpose`` is ``0``) or columns
     (``transpose`` is ``1``) are to be clustered.

-  | ``nxgrid, nygrid`` (default: ``2, 1``)
   | The number of cells horizontally and vertically in the rectangular
     grid on which the Self-Organizing Map is calculated.

-  | ``inittau`` (default: ``0.02``)
   | The initial value for the parameter :math:`\tau` that is used in
     the SOM algorithm. The default value for ``inittau`` is 0.02, which
     was used in Michael Eisen’s Cluster/TreeView program.

-  | ``niter`` (default: ``1``)
   | The number of iterations to be performed.

-  | ``dist`` (default: ``'e'``, Euclidean distance)
   | Defines the distance function to be used (see
     :ref:`sec:distancefunctions`).

This function returns the tuple ``(clusterid, celldata)``:

-  | ``clusterid``:
   | An array with two columns, where the number of rows is equal to the
     number of items that were clustered. Each row contains the
     :math:`x` and :math:`y` coordinates of the cell in the rectangular
     SOM grid to which the item was assigned.

-  | ``celldata``:
   | An array with dimensions
     :math:`\left(\verb|nxgrid|, \verb|nygrid|, \textrm{number of columns}\right)`
     if rows are being clustered, or
     :math:`\left(\verb|nxgrid|, \verb|nygrid|, \textrm{number of rows}\right)`
     if columns are being clustered. Each element ``[ix][iy]`` of this
     array is a 1D vector containing the gene expression data for the
     centroid of the cluster in the grid cell with coordinates
     ``[ix][iy]``.

Principal Component Analysis
----------------------------

Principal Component Analysis (PCA) is a widely used technique for
analyzing multivariate data. A practical example of applying Principal
Component Analysis to gene expression data is presented by Yeung and
Ruzzo (2001) [Yeung2001]_.

In essence, PCA is a coordinate transformation in which each row in the
data matrix is written as a linear sum over basis vectors called
principal components, which are ordered and chosen such that each
maximally explains the remaining variance in the data vectors. For
example, an :math:`n \times 3` data matrix can be represented as an
ellipsoidal cloud of :math:`n` points in three dimensional space. The
first principal component is the longest axis of the ellipsoid, the
second principal component the second longest axis of the ellipsoid, and
the third principal component is the shortest axis. Each row in the data
matrix can be reconstructed as a suitable linear combination of the
principal components. However, in order to reduce the dimensionality of
the data, usually only the most important principal components are
retained. The remaining variance present in the data is then regarded as
unexplained variance.

The principal components can be found by calculating the eigenvectors of
the covariance matrix of the data. The corresponding eigenvalues
determine how much of the variance present in the data is explained by
each principal component.

Before applying principal component analysis, typically the mean is
subtracted from each column in the data matrix. In the example above,
this effectively centers the ellipsoidal cloud around its centroid in 3D
space, with the principal components describing the variation of points
in the ellipsoidal cloud with respect to their centroid.

The function ``pca`` below first uses the singular value decomposition
to calculate the eigenvalues and eigenvectors of the data matrix. The
singular value decomposition is implemented as a translation in C of the
Algol procedure ``svd`` [Golub1971]_, which uses
Householder bidiagonalization and a variant of the QR algorithm. The
principal components, the coordinates of each data vector along the
principal components, and the eigenvalues corresponding to the principal
components are then evaluated and returned in decreasing order of the
magnitude of the eigenvalue. If data centering is desired, the mean
should be subtracted from each column in the data matrix before calling
the ``pca`` routine.

To apply Principal Component Analysis to a rectangular matrix ``data``,
use

.. code:: pycon

   >>> from Bio.Cluster import pca
   >>> columnmean, coordinates, components, eigenvalues = pca(data)

This function returns a tuple
``columnmean, coordinates, components, eigenvalues``:

-  | ``columnmean``
   | Array containing the mean over each column in ``data``.

-  | ``coordinates``
   | The coordinates of each row in ``data`` with respect to the
     principal components.

-  | ``components``
   | The principal components.

-  | ``eigenvalues``
   | The eigenvalues corresponding to each of the principal components.

The original matrix ``data`` can be recreated by calculating
``columnmean +  dot(coordinates, components)``.

Handling Cluster/TreeView-type files
------------------------------------

Cluster/TreeView are GUI-based codes for clustering gene expression
data. They were originally written by `Michael
Eisen <http://rana.lbl.gov>`__ while at Stanford University
[Eisen1998]_. ``Bio.Cluster`` contains functions for
reading and writing data files that correspond to the format specified
for Cluster/TreeView. In particular, by saving a clustering result in
that format, TreeView can be used to visualize the clustering results.
We recommend using Alok Saldanha’s
http://jtreeview.sourceforge.net/\ Java TreeView program
[Saldanha2004]_, which can display hierarchical as well
as :math:`k`-means clustering results.

An object of the class ``Record`` contains all information stored in a
Cluster/TreeView-type data file. To store the information contained in
the data file in a ``Record`` object, we first open the file and then
read it:

.. code:: pycon

    >>> from Bio import Cluster
    >>> with open("mydatafile.txt") as handle:
    ...     record = Cluster.read(handle)
    ...

This two-step process gives you some flexibility in the source of the
data. For example, you can use

.. code:: pycon

   >>> import gzip  # Python standard library
   >>> handle = gzip.open("mydatafile.txt.gz", "rt")

to open a gzipped file, or

.. code:: pycon

   >>> from urllib.request import urlopen
   >>> from io import TextIOWrapper
   >>> url = "https://raw.githubusercontent.com/biopython/biopython/master/Tests/Cluster/cyano.txt"
   >>> handle = TextIOWrapper(urlopen(url))

to open a file stored on the Internet before calling ``read``.

The ``read`` command reads the tab-delimited text file
``mydatafile.txt`` containing gene expression data in the format
specified for Michael Eisen’s Cluster/TreeView program. In this file
format, rows represent genes and columns represent samples or
observations. For a simple time course, a minimal input file would look
like this:

.. container:: center

   ======= ========= ========== ====== ======= =======
   YORF    0 minutes 30 minutes 1 hour 2 hours 4 hours
   YAL001C 1         1.3        2.4    5.8     2.4
   YAL002W 0.9       0.8        0.7    0.5     0.2
   YAL003W 0.8       2.1        4.2    10.1    10.1
   YAL005C 1.1       1.3        0.8            0.4
   YAL010C 1.2       1          1.1    4.5     8.3
   ======= ========= ========== ====== ======= =======

Each row (gene) has an identifier that always goes in the first column.
In this example, we are using yeast open reading frame codes. Each
column (sample) has a label in the first row. In this example, the
labels describe the time at which a sample was taken. The first column
of the first row contains a special field that tells the program what
kind of objects are in each row. In this case, YORF stands for yeast
open reading frame. This field can be any alphanumeric value. The
remaining cells in the table contain data for the appropriate gene and
sample. The 5.8 in row 2 column 4 means that the observed value for gene
YAL001C at 2 hours was 5.8. Missing values are acceptable and are
designated by empty cells (e.g. YAL004C at 2 hours).

The input file may contain additional information. A maximal input file
would look like this:

.. container:: center

   ======= ========================== ======= ====== === === === ==== ====
   YORF    NAME                       GWEIGHT GORDER 0   30  1   2    4
   EWEIGHT                                           1   1   1   1    0
   EORDER                                            5   3   2   1    1
   YAL001C TFIIIC 138 KD SUBUNIT      1       1      1   1.3 2.4 5.8  2.4
   YAL002W UNKNOWN                    0.4     3      0.9 0.8 0.7 0.5  0.2
   YAL003W ELONGATION FACTOR EF1-BETA 0.4     2      0.8 2.1 4.2 10.1 10.1
   YAL005C CYTOSOLIC HSP70            0.4     5      1.1 1.3 0.8      0.4
   ======= ========================== ======= ====== === === === ==== ====

The added columns NAME, GWEIGHT, and GORDER and rows EWEIGHT and EORDER
are optional. The NAME column allows you to specify a label for each
gene that is distinct from the ID in column 1.

A ``Record`` object has the following attributes:

-  | ``data``
   | The data array containing the gene expression data. Genes are
     stored row-wise, while samples are stored column-wise.

-  | ``mask``
   | This array shows which elements in the ``data`` array, if any, are
     missing. If ``mask[i, j] == 0``, then ``data[i, j]`` is missing. If
     no data were found to be missing, ``mask`` is set to ``None``.

-  | ``geneid``
   | This is a list containing a unique description for each gene (i.e.,
     ORF numbers).

-  | ``genename``
   | This is a list containing a description for each gene (i.e., gene
     name). If not present in the data file, ``genename`` is set to
     ``None``.

-  | ``gweight``
   | The weights that are to be used to calculate the distance in
     expression profile between genes. If not present in the data file,
     ``gweight`` is set to ``None``.

-  | ``gorder``
   | The preferred order in which genes should be stored in an output
     file. If not present in the data file, ``gorder`` is set to
     ``None``.

-  | ``expid``
   | This is a list containing a description of each sample, e.g.
     experimental condition.

-  | ``eweight``
   | The weights that are to be used to calculate the distance in
     expression profile between samples. If not present in the data
     file, ``eweight`` is set to ``None``.

-  | ``eorder``
   | The preferred order in which samples should be stored in an output
     file. If not present in the data file, ``eorder`` is set to
     ``None``.

-  | ``uniqid``
   | The string that was used instead of UNIQID in the data file.

After loading a ``Record`` object, each of these attributes can be
accessed and modified directly. For example, the data can be
log-transformed by taking the logarithm of ``record.data``.

Calculating the distance matrix
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

To calculate the distance matrix between the items stored in the record,
use

.. code:: pycon

   >>> matrix = record.distancematrix()

where the following arguments are defined:

-  | ``transpose`` (default: ``0``)
   | Determines if the distances between the rows of ``data`` are to be
     calculated (``transpose`` is ``False``), or between the columns of
     ``data`` (``transpose`` is ``True``).

-  | ``dist`` (default: ``'e'``, Euclidean distance)
   | Defines the distance function to be used (see
     :ref:`sec:distancefunctions`).

This function returns the distance matrix as a list of rows, where the
number of columns of each row is equal to the row number (see section
:ref:`sec:distancematrix`).

Calculating the cluster centroids
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

To calculate the centroids of clusters of items stored in the record,
use

.. code:: pycon

   >>> cdata, cmask = record.clustercentroids()

-  | ``clusterid`` (default: ``None``)
   | Vector of integers showing to which cluster each item belongs. If
     ``clusterid`` is not given, then all items are assumed to belong to
     the same cluster.

-  | ``method`` (default: ``'a'``)
   | Specifies whether the arithmetic mean (``method=='a'``) or the
     median (``method=='m'``) is used to calculate the cluster center.

-  | ``transpose`` (default: ``0``)
   | Determines if the centroids of the rows of ``data`` are to be
     calculated (``transpose`` is ``False``), or the centroids of the
     columns of ``data`` (``transpose`` is ``True``).

This function returns the tuple ``cdata, cmask``; see section
:ref:`sec:clustercentroids` for a description.

.. _calculating-the-distance-between-clusters-1:

Calculating the distance between clusters
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

To calculate the distance between clusters of items stored in the
record, use

.. code:: pycon

   >>> distance = record.clusterdistance()

where the following arguments are defined:

-  | ``index1`` (default: ``0``)
   | A list containing the indices of the items belonging to the first
     cluster. A cluster containing only one item :math:`i` can be
     represented either as a list ``[i]``, or as an integer ``i``.

-  | ``index2`` (default: ``0``)
   | A list containing the indices of the items belonging to the second
     cluster. A cluster containing only one item :math:`i` can be
     represented either as a list ``[i]``, or as an integer ``i``.

-  | ``method`` (default: ``'a'``)
   | Specifies how the distance between clusters is defined:

   -  ``'a'``: Distance between the two cluster centroids (arithmetic
      mean);

   -  ``'m'``: Distance between the two cluster centroids (median);

   -  ``'s'``: Shortest pairwise distance between items in the two
      clusters;

   -  ``'x'``: Longest pairwise distance between items in the two
      clusters;

   -  ``'v'``: Average over the pairwise distances between items in the
      two clusters.

-  | ``dist`` (default: ``'e'``, Euclidean distance)
   | Defines the distance function to be used (see
     :ref:`sec:distancefunctions`).

-  | ``transpose`` (default: ``0``)
   | If ``transpose`` is ``False``, calculate the distance between the
     rows of ``data``. If ``transpose`` is ``True``, calculate the
     distance between the columns of ``data``.

.. _performing-hierarchical-clustering-1:

Performing hierarchical clustering
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

To perform hierarchical clustering on the items stored in the record,
use

.. code:: pycon

   >>> tree = record.treecluster()

where the following arguments are defined:

-  | ``transpose`` (default: ``0``)
   | Determines if rows (``transpose`` is ``False``) or columns
     (``transpose`` is ``True``) are to be clustered.

-  | ``method`` (default: ``'m'``)
   | defines the linkage method to be used:

   -  ``method=='s'``: pairwise single-linkage clustering

   -  ``method=='m'``: pairwise maximum- (or complete-) linkage
      clustering

   -  ``method=='c'``: pairwise centroid-linkage clustering

   -  ``method=='a'``: pairwise average-linkage clustering

-  | ``dist`` (default: ``'e'``, Euclidean distance)
   | Defines the distance function to be used (see
     :ref:`sec:distancefunctions`).

-  | ``transpose``
   | Determines if genes or samples are being clustered. If
     ``transpose`` is ``False``, genes (rows) are being clustered. If
     ``transpose`` is ``True``, samples (columns) are clustered.

This function returns a ``Tree`` object. This object contains
:math:`\left(\textrm{number of items} - 1\right)` nodes, where the
number of items is the number of rows if rows were clustered, or the
number of columns if columns were clustered. Each node describes a
pairwise linking event, where the node attributes ``left`` and ``right``
each contain the number of one item or subnode, and ``distance`` the
distance between them. Items are numbered from 0 to
:math:`\left(\textrm{number of items} - 1\right)`, while clusters are
numbered -1 to :math:`-\left(\textrm{number of items}-1\right)`.

Performing :math:`k`-means or :math:`k`-medians clustering
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

To perform :math:`k`-means or :math:`k`-medians clustering on the items
stored in the record, use

.. code:: pycon

   >>> clusterid, error, nfound = record.kcluster()

where the following arguments are defined:

-  | ``nclusters`` (default: ``2``)
   | The number of clusters :math:`k`.

-  | ``transpose`` (default: ``0``)
   | Determines if rows (``transpose`` is ``0``) or columns
     (``transpose`` is ``1``) are to be clustered.

-  | ``npass`` (default: ``1``)
   | The number of times the :math:`k`-means/-medians clustering
     algorithm is performed, each time with a different (random) initial
     condition. If ``initialid`` is given, the value of ``npass`` is
     ignored and the clustering algorithm is run only once, as it
     behaves deterministically in that case.

-  | ``method`` (default: ``a``)
   | describes how the center of a cluster is found:

   -  ``method=='a'``: arithmetic mean (:math:`k`-means clustering);

   -  ``method=='m'``: median (:math:`k`-medians clustering).

   For other values of ``method``, the arithmetic mean is used.

-  | ``dist`` (default: ``'e'``, Euclidean distance)
   | Defines the distance function to be used (see
     :ref:`sec:distancefunctions`).

This function returns a tuple ``(clusterid, error, nfound)``, where
``clusterid`` is an integer array containing the number of the cluster
to which each row or cluster was assigned, ``error`` is the
within-cluster sum of distances for the optimal clustering solution, and
``nfound`` is the number of times this optimal solution was found.

Calculating a Self-Organizing Map
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

To calculate a Self-Organizing Map of the items stored in the record,
use

.. code:: pycon

   >>> clusterid, celldata = record.somcluster()

where the following arguments are defined:

-  | ``transpose`` (default: ``0``)
   | Determines if rows (``transpose`` is ``0``) or columns
     (``transpose`` is ``1``) are to be clustered.

-  | ``nxgrid, nygrid`` (default: ``2, 1``)
   | The number of cells horizontally and vertically in the rectangular
     grid on which the Self-Organizing Map is calculated.

-  | ``inittau`` (default: ``0.02``)
   | The initial value for the parameter :math:`\tau` that is used in
     the SOM algorithm. The default value for ``inittau`` is 0.02, which
     was used in Michael Eisen’s Cluster/TreeView program.

-  | ``niter`` (default: ``1``)
   | The number of iterations to be performed.

-  | ``dist`` (default: ``'e'``, Euclidean distance)
   | Defines the distance function to be used (see
     :ref:`sec:distancefunctions`).

This function returns the tuple ``(clusterid, celldata)``:

-  | ``clusterid``:
   | An array with two columns, where the number of rows is equal to the
     number of items that were clustered. Each row contains the
     :math:`x` and :math:`y` coordinates of the cell in the rectangular
     SOM grid to which the item was assigned.

-  | ``celldata``:
   | An array with dimensions
     :math:`\left(\verb|nxgrid|, \verb|nygrid|, \textrm{number of columns}\right)`
     if rows are being clustered, or
     :math:`\left(\verb|nxgrid|, \verb|nygrid|, \textrm{number of rows}\right)`
     if columns are being clustered. Each element ``[ix][iy]`` of this
     array is a 1D vector containing the gene expression data for the
     centroid of the cluster in the grid cell with coordinates
     ``[ix][iy]``.

Saving the clustering result
~~~~~~~~~~~~~~~~~~~~~~~~~~~~

To save the clustering result, use

.. code:: pycon

   >>> record.save(jobname, geneclusters, expclusters)

where the following arguments are defined:

-  | ``jobname``
   | The string ``jobname`` is used as the base name for names of the
     files that are to be saved.

-  | ``geneclusters``
   | This argument describes the gene (row-wise) clustering result. In
     case of :math:`k`-means clustering, this is a 1D array containing
     the number of the cluster each gene belongs to. It can be
     calculated using ``kcluster``. In case of hierarchical clustering,
     ``geneclusters`` is a ``Tree`` object.

-  | ``expclusters``
   | This argument describes the (column-wise) clustering result for the
     experimental conditions. In case of :math:`k`-means clustering,
     this is a 1D array containing the number of the cluster each
     experimental condition belongs to. It can be calculated using
     ``kcluster``. In case of hierarchical clustering, ``expclusters``
     is a ``Tree`` object.

This method writes the text file ``jobname.cdt``, ``jobname.gtr``,
``jobname.atr``, ``jobname*.kgg``, and/or ``jobname*.kag`` for
subsequent reading by the Java TreeView program. If ``geneclusters`` and
``expclusters`` are both ``None``, this method only writes the text file
``jobname.cdt``; this file can subsequently be read into a new
``Record`` object.

Example calculation
-------------------

This is an example of a hierarchical clustering calculation, using
single linkage clustering for genes and maximum linkage clustering for
experimental conditions. As the Euclidean distance is being used for
gene clustering, it is necessary to scale the node distances
``genetree`` such that they are all between zero and one. This is needed
for the Java TreeView code to display the tree diagram correctly. To
cluster the experimental conditions, the uncentered correlation is being
used. No scaling is needed in this case, as the distances in ``exptree``
are already between zero and two.

The example data ``cyano.txt`` can be found in Biopython’s
``Tests/Cluster`` subdirectory and is from the paper
Hihara *et al.* 2001 [Hihara2001]_.

.. doctest ../Tests/Cluster lib:numpy

.. code:: pycon

   >>> from Bio import Cluster
   >>> with open("cyano.txt") as handle:
   ...     record = Cluster.read(handle)
   ...
   >>> genetree = record.treecluster(method="s")
   >>> genetree.scale()
   >>> exptree = record.treecluster(dist="u", transpose=1)
   >>> record.save("cyano_result", genetree, exptree)

This will create the files ``cyano_result.cdt``, ``cyano_result.gtr``,
and ``cyano_result.atr``.

Similarly, we can save a :math:`k`-means clustering solution:

.. doctest ../Tests/Cluster lib:numpy

.. code:: pycon

   >>> from Bio import Cluster
   >>> with open("cyano.txt") as handle:
   ...     record = Cluster.read(handle)
   ...
   >>> (geneclusters, error, ifound) = record.kcluster(nclusters=5, npass=1000)
   >>> (expclusters, error, ifound) = record.kcluster(nclusters=2, npass=100, transpose=1)
   >>> record.save("cyano_result", geneclusters, expclusters)

This will create the files ``cyano_result_K_G2_A2.cdt``,
``cyano_result_K_G2.kgg``, and ``cyano_result_K_A2.kag``.