1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 939 940 941 942 943 944 945 946 947 948 949 950 951 952 953 954 955 956 957 958 959 960 961 962 963 964 965 966 967 968 969 970 971 972 973 974 975 976 977 978 979 980 981 982 983 984 985 986 987 988 989 990 991 992 993 994 995 996 997 998 999 1000 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 1011 1012 1013 1014 1015 1016 1017 1018 1019 1020 1021 1022 1023 1024 1025 1026 1027 1028 1029 1030 1031 1032 1033 1034 1035 1036 1037 1038 1039 1040 1041 1042 1043 1044 1045 1046 1047 1048 1049 1050 1051 1052 1053 1054 1055 1056 1057 1058 1059 1060 1061 1062 1063 1064 1065 1066 1067 1068 1069 1070 1071 1072 1073 1074 1075 1076 1077 1078 1079 1080 1081 1082 1083 1084 1085 1086 1087 1088 1089 1090 1091 1092 1093 1094 1095 1096 1097 1098 1099 1100 1101 1102 1103 1104 1105 1106 1107 1108 1109 1110 1111 1112 1113 1114 1115 1116 1117 1118 1119 1120 1121 1122 1123 1124 1125 1126 1127 1128 1129 1130 1131 1132 1133 1134 1135 1136 1137 1138 1139 1140 1141 1142 1143 1144 1145 1146 1147 1148 1149 1150 1151 1152 1153 1154 1155 1156 1157 1158 1159 1160 1161 1162 1163 1164 1165 1166 1167 1168 1169 1170 1171 1172 1173 1174 1175 1176 1177 1178 1179 1180 1181 1182 1183 1184 1185 1186 1187 1188 1189 1190 1191 1192 1193 1194 1195 1196 1197 1198 1199 1200 1201 1202 1203 1204 1205 1206 1207 1208 1209 1210 1211 1212 1213 1214 1215 1216 1217 1218 1219 1220 1221 1222 1223 1224 1225 1226 1227 1228 1229 1230 1231 1232 1233 1234 1235 1236 1237 1238 1239 1240 1241 1242 1243 1244 1245 1246 1247 1248 1249 1250 1251 1252 1253 1254 1255 1256 1257 1258 1259 1260 1261 1262 1263 1264 1265 1266 1267 1268 1269 1270 1271 1272 1273 1274 1275 1276 1277 1278 1279 1280 1281 1282 1283 1284 1285 1286 1287 1288 1289 1290 1291 1292 1293 1294 1295 1296 1297 1298 1299 1300 1301 1302 1303 1304 1305 1306 1307 1308 1309 1310 1311 1312 1313 1314 1315 1316 1317 1318 1319 1320 1321 1322 1323 1324 1325 1326 1327 1328 1329 1330 1331 1332 1333 1334 1335 1336 1337 1338 1339 1340 1341 1342 1343 1344 1345 1346 1347 1348 1349 1350 1351 1352 1353 1354 1355 1356 1357 1358 1359 1360 1361 1362 1363 1364 1365 1366 1367 1368 1369 1370 1371 1372 1373 1374 1375 1376 1377 1378 1379 1380 1381 1382 1383 1384 1385 1386 1387 1388 1389 1390 1391 1392 1393 1394 1395 1396 1397 1398 1399 1400 1401 1402 1403 1404 1405 1406 1407 1408 1409 1410 1411 1412 1413 1414 1415 1416 1417 1418 1419 1420 1421 1422 1423 1424 1425 1426 1427 1428 1429 1430 1431 1432 1433 1434 1435 1436 1437 1438 1439 1440 1441 1442 1443 1444 1445 1446 1447 1448 1449 1450 1451 1452 1453 1454 1455 1456 1457 1458 1459 1460 1461 1462 1463 1464 1465 1466 1467 1468 1469 1470 1471 1472 1473 1474 1475 1476 1477 1478 1479 1480 1481 1482 1483 1484 1485 1486 1487 1488 1489 1490 1491 1492 1493 1494 1495 1496 1497 1498 1499 1500 1501 1502 1503 1504 1505 1506 1507 1508 1509 1510 1511 1512 1513 1514 1515 1516 1517 1518 1519 1520 1521 1522 1523 1524 1525 1526 1527 1528 1529 1530 1531 1532 1533 1534 1535 1536 1537 1538 1539 1540 1541 1542 1543 1544 1545 1546 1547 1548 1549 1550 1551 1552 1553 1554 1555 1556 1557 1558 1559 1560 1561 1562 1563 1564 1565 1566 1567 1568 1569 1570 1571 1572 1573 1574 1575 1576 1577 1578 1579 1580 1581 1582 1583 1584 1585 1586 1587 1588 1589 1590 1591 1592 1593 1594 1595 1596 1597 1598 1599 1600 1601 1602 1603 1604 1605 1606 1607 1608 1609 1610 1611 1612 1613 1614 1615 1616 1617 1618 1619 1620 1621 1622 1623 1624 1625 1626 1627 1628 1629 1630 1631 1632 1633 1634 1635 1636 1637 1638 1639 1640 1641 1642 1643 1644 1645 1646 1647 1648 1649 1650 1651 1652 1653 1654 1655 1656 1657 1658 1659 1660 1661 1662 1663 1664 1665 1666 1667 1668 1669 1670 1671 1672 1673 1674 1675 1676 1677 1678 1679 1680 1681 1682 1683 1684 1685 1686 1687 1688 1689 1690 1691 1692 1693 1694 1695 1696 1697 1698 1699 1700 1701 1702 1703 1704 1705 1706 1707 1708 1709 1710 1711 1712 1713 1714 1715 1716 1717 1718 1719 1720 1721 1722 1723 1724 1725 1726 1727 1728 1729 1730 1731 1732 1733 1734 1735 1736 1737 1738 1739 1740 1741 1742 1743 1744 1745 1746 1747 1748 1749 1750 1751 1752
|
.. _`chapter:cluster`:
Cluster analysis
================
Cluster analysis is the grouping of items into clusters based on the
similarity of the items to each other. In bioinformatics, clustering is
widely used in gene expression data analysis to find groups of genes
with similar gene expression profiles. This may identify functionally
related genes, as well as suggest the function of presently unknown
genes.
The Biopython module ``Bio.Cluster`` provides commonly used clustering
algorithms and was designed with the application to gene expression data
in mind. However, this module can also be used for cluster analysis of
other types of data. ``Bio.Cluster`` and the underlying C Clustering
Library is described by De Hoon *et al.* [DeHoon2004]_.
The following four clustering approaches are implemented in
``Bio.Cluster``:
- Hierarchical clustering (pairwise centroid-, single-, complete-, and
average-linkage);
- :math:`k`-means, :math:`k`-medians, and :math:`k`-medoids clustering;
- Self-Organizing Maps;
- Principal Component Analysis.
Data representation
~~~~~~~~~~~~~~~~~~~
The data to be clustered are represented by a :math:`n \times m`
Numerical Python array ``data``. Within the context of gene expression
data clustering, typically the rows correspond to different genes
whereas the columns correspond to different experimental conditions. The
clustering algorithms in ``Bio.Cluster`` can be applied both to rows
(genes) and to columns (experiments).
Missing values
~~~~~~~~~~~~~~
The :math:`n \times m` Numerical Python integer array ``mask`` indicates
if any of the values in ``data`` are missing. If ``mask[i, j] == 0``,
then ``data[i, j]`` is missing and is ignored in the analysis.
Random number generator
~~~~~~~~~~~~~~~~~~~~~~~
The :math:`k`-means/medians/medoids clustering algorithms and
Self-Organizing Maps (SOMs) include the use of a random number
generator. The uniform random number generator in ``Bio.Cluster`` is
based on the algorithm by L’Ecuyer [Lecuyer1988]_,
while random numbers following the binomial distribution are generated
using the BTPE algorithm by Kachitvichyanukul and Schmeiser
[Kachitvichyanukul1988]_. The random number generator
is initialized automatically during its first call. As this random
number generator uses a combination of two multiplicative linear
congruential generators, two (integer) seeds are needed for
initialization, for which we use the system-supplied random number
generator ``rand`` (in the C standard library). We initialize this
generator by calling ``srand`` with the epoch time in seconds, and use
the first two random numbers generated by ``rand`` as seeds for the
uniform random number generator in ``Bio.Cluster``.
.. _`sec:distancefunctions`:
Distance functions
------------------
In order to cluster items into groups based on their similarity, we
should first define what exactly we mean by *similar*. ``Bio.Cluster``
provides eight distance functions, indicated by a single character, to
measure similarity, or conversely, distance:
- ``'e'``: Euclidean distance;
- ``'b'``: City-block distance.
- ``'c'``: Pearson correlation coefficient;
- ``'a'``: Absolute value of the Pearson correlation coefficient;
- ``'u'``: Uncentered Pearson correlation (equivalent to the cosine of
the angle between two data vectors);
- ``'x'``: Absolute uncentered Pearson correlation;
- ``'s'``: Spearman’s rank correlation;
- ``'k'``: Kendall’s :math:`\tau`.
The first two are true distance functions that satisfy the triangle
inequality:
.. math:: d\left(\underline{u},\underline{v}\right) \leq d\left(\underline{u},\underline{w}\right) + d\left(\underline{w},\underline{v}\right) \textrm{ for all } \underline{u}, \underline{v}, \underline{w},
and are therefore referred to as *metrics*. In everyday language, this
means that the shortest distance between two points is a straight line.
The remaining six distance measures are related to the correlation
coefficient, where the distance :math:`d` is defined in terms of the
correlation :math:`r` by :math:`d=1-r`. Note that these distance
functions are *semi-metrics* that do not satisfy the triangle
inequality. For example, for
.. math:: \underline{u}=\left(1,0,-1\right);
.. math:: \underline{v}=\left(1,1,0\right);
.. math:: \underline{w}=\left(0,1,1\right);
we find a Pearson distance
:math:`d\left(\underline{u},\underline{w}\right) = 1.8660`, while
:math:`d\left(\underline{u},\underline{v}\right)+d\left(\underline{v},\underline{w}\right) = 1.6340`.
Euclidean distance
~~~~~~~~~~~~~~~~~~
In ``Bio.Cluster``, we define the Euclidean distance as
.. math:: d = {1 \over n} \sum_{i=1}^{n} \left(x_i-y_i\right)^{2}.
Only those terms are included in the summation for which both
:math:`x_i` and :math:`y_i` are present, and the denominator :math:`n`
is chosen accordingly. As the expression data :math:`x_i` and
:math:`y_i` are subtracted directly from each other, we should make sure
that the expression data are properly normalized when using the
Euclidean distance.
City-block distance
~~~~~~~~~~~~~~~~~~~
The city-block distance, alternatively known as the Manhattan distance,
is related to the Euclidean distance. Whereas the Euclidean distance
corresponds to the length of the shortest path between two points, the
city-block distance is the sum of distances along each dimension. As
gene expression data tend to have missing values, in ``Bio.Cluster`` we
define the city-block distance as the sum of distances divided by the
number of dimensions:
.. math:: d = {1 \over n} \sum_{i=1}^n \left|x_i-y_i\right|.
This is equal to the distance you would have to walk between two points
in a city, where you have to walk along city blocks. As for the
Euclidean distance, the expression data are subtracted directly from
each other, and we should therefore make sure that they are properly
normalized.
The Pearson correlation coefficient
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The Pearson correlation coefficient is defined as
.. math:: r = \frac{1}{n} \sum_{i=1}^n \left( \frac{x_i -\bar{x}}{\sigma_x} \right) \left(\frac{y_i -\bar{y}}{\sigma_y} \right),
in which :math:`\bar{x}, \bar{y}` are the sample mean of :math:`x` and
:math:`y` respectively, and :math:`\sigma_x, \sigma_y` are the sample
standard deviation of :math:`x` and :math:`y`. The Pearson correlation
coefficient is a measure for how well a straight line can be fitted to a
scatterplot of :math:`x` and :math:`y`. If all the points in the
scatterplot lie on a straight line, the Pearson correlation coefficient
is either +1 or -1, depending on whether the slope of line is positive
or negative. If the Pearson correlation coefficient is equal to zero,
there is no correlation between :math:`x` and :math:`y`.
The *Pearson distance* is then defined as
.. math:: d_{\textrm{P}} \equiv 1 - r.
As the Pearson correlation coefficient lies between -1 and 1, the
Pearson distance lies between 0 and 2.
Absolute Pearson correlation
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
By taking the absolute value of the Pearson correlation, we find a
number between 0 and 1. If the absolute value is 1, all the points in
the scatter plot lie on a straight line with either a positive or a
negative slope. If the absolute value is equal to zero, there is no
correlation between :math:`x` and :math:`y`.
The corresponding distance is defined as
.. math:: d_{\textrm A} \equiv 1 - \left|r\right|,
where :math:`r` is the Pearson correlation coefficient. As the absolute
value of the Pearson correlation coefficient lies between 0 and 1, the
corresponding distance lies between 0 and 1 as well.
In the context of gene expression experiments, the absolute correlation
is equal to 1 if the gene expression profiles of two genes are either
exactly the same or exactly opposite. The absolute correlation
coefficient should therefore be used with care.
Uncentered correlation (cosine of the angle)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
In some cases, it may be preferable to use the *uncentered correlation*
instead of the regular Pearson correlation coefficient. The uncentered
correlation is defined as
.. math:: r_{\textrm U} = \frac{1}{n} \sum_{i=1}^{n} \left(\frac{x_i}{\sigma_x^{(0)}} \right) \left(\frac{y_i}{\sigma_y^{(0)}} \right),
where
.. math::
\begin{aligned}
\sigma_x^{(0)} & = & \sqrt{{\frac{1}{n}} \sum_{i=1}^{n}x_i^2}; \nonumber \\
\sigma_y^{(0)} & = & \sqrt{{\frac{1}{n}} \sum_{i=1}^{n}y_i^2}. \nonumber
\end{aligned}
This is the same expression as for the regular Pearson correlation
coefficient, except that the sample means :math:`\bar{x}, \bar{y}` are
set equal to zero. The uncentered correlation may be appropriate if
there is a zero reference state. For instance, in the case of gene
expression data given in terms of log-ratios, a log-ratio equal to zero
corresponds to the green and red signal being equal, which means that
the experimental manipulation did not affect the gene expression.
The distance corresponding to the uncentered correlation coefficient is
defined as
.. math:: d_{\mbox{U}} \equiv 1 - r_{\mbox{U}},
where :math:`r_{\mbox{U}}` is the uncentered correlation. As the
uncentered correlation coefficient lies between -1 and 1, the
corresponding distance lies between 0 and 2.
The uncentered correlation is equal to the cosine of the angle of the
two data vectors in :math:`n`-dimensional space, and is often referred
to as such.
Absolute uncentered correlation
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
As for the regular Pearson correlation, we can define a distance measure
using the absolute value of the uncentered correlation:
.. math:: d_{\mbox{AU}} \equiv 1 - \left|r_{\mbox{U}}\right|,
where :math:`r_{\mbox{U}}` is the uncentered correlation coefficient. As
the absolute value of the uncentered correlation coefficient lies
between 0 and 1, the corresponding distance lies between 0 and 1 as
well.
Geometrically, the absolute value of the uncentered correlation is equal
to the cosine between the supporting lines of the two data vectors
(i.e., the angle without taking the direction of the vectors into
consideration).
Spearman rank correlation
~~~~~~~~~~~~~~~~~~~~~~~~~
The Spearman rank correlation is an example of a non-parametric
similarity measure, and tends to be more robust against outliers than
the Pearson correlation.
To calculate the Spearman rank correlation, we replace each data value
by their rank if we would order the data in each vector by their value.
We then calculate the Pearson correlation between the two rank vectors
instead of the data vectors.
As in the case of the Pearson correlation, we can define a distance
measure corresponding to the Spearman rank correlation as
.. math:: d_{\mbox{S}} \equiv 1 - r_{\mbox{S}},
where :math:`r_{\mbox{S}}` is the Spearman rank correlation.
Kendall’s :math:`\tau`
~~~~~~~~~~~~~~~~~~~~~~
Kendall’s :math:`\tau` is another example of a non-parametric similarity
measure. It is similar to the Spearman rank correlation, but instead of
the ranks themselves only the relative ranks are used to calculate
:math:`\tau` (see Snedecor & Cochran [Snedecor1989]_).
We can define a distance measure corresponding to Kendall’s :math:`\tau`
as
.. math:: d_{\mbox{K}} \equiv 1 - \tau.
As Kendall’s :math:`\tau` is always between -1 and 1, the corresponding
distance will be between 0 and 2.
Weighting
~~~~~~~~~
For most of the distance functions available in ``Bio.Cluster``, a
weight vector can be applied. The weight vector contains weights for the
items in the data vector. If the weight for item :math:`i` is
:math:`w_i`, then that item is treated as if it occurred :math:`w_i`
times in the data. The weight do not have to be integers.
.. _`sec:distancematrix`:
Calculating the distance matrix
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The distance matrix is a square matrix with all pairwise distances
between the items in ``data``, and can be calculated by the function
``distancematrix`` in the ``Bio.Cluster`` module:
.. code:: pycon
>>> from Bio.Cluster import distancematrix
>>> matrix = distancematrix(data)
where the following arguments are defined:
- | ``data`` (required)
| Array containing the data for the items.
- | ``mask`` (default: ``None``)
| Array of integers showing which data are missing. If
``mask[i, j] == 0``, then ``data[i, j]`` is missing. If ``mask`` is
``None``, then all data are present.
- | ``weight`` (default: ``None``)
| The weights to be used when calculating distances. If ``weight`` is
``None``, then equal weights are assumed.
- | ``transpose`` (default: ``0``)
| Determines if the distances between the rows of ``data`` are to be
calculated (``transpose`` is ``False``), or between the columns of
``data`` (``transpose`` is ``True``).
- | ``dist`` (default: ``'e'``, Euclidean distance)
| Defines the distance function to be used (see
:ref:`sec:distancefunctions`).
To save memory, the distance matrix is returned as a list of 1D arrays.
The number of columns in each row is equal to the row number. Hence, the
first row has zero elements. For example,
.. code:: pycon
>>> from numpy import array
>>> from Bio.Cluster import distancematrix
>>> data = array([[0, 1, 2, 3],
... [4, 5, 6, 7],
... [8, 9, 10, 11],
... [1, 2, 3, 4]]) # fmt: skip
...
>>> distances = distancematrix(data, dist="e")
yields a distance matrix
.. code:: pycon
>>> distances
[array([], dtype=float64), array([ 16.]), array([ 64., 16.]), array([ 1., 9., 49.])]
which can be rewritten as
.. code:: python
[array([], dtype=float64), array([16.0]), array([64.0, 16.0]), array([1.0, 9.0, 49.0])]
This corresponds to the distance matrix:
.. math::
\left(
\begin{array}{cccc}
0 & 16 & 64 & 1 \\
16 & 0 & 16 & 9 \\
64 & 16 & 0 & 49 \\
1 & 9 & 49 & 0
\end{array}
\right).
Calculating cluster properties
------------------------------
.. _`sec:clustercentroids`:
Calculating the cluster centroids
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The centroid of a cluster can be defined either as the mean or as the
median of each dimension over all cluster items. The function
``clustercentroids`` in ``Bio.Cluster`` can be used to calculate either:
.. code:: pycon
>>> from Bio.Cluster import clustercentroids
>>> cdata, cmask = clustercentroids(data)
where the following arguments are defined:
- | ``data`` (required)
| Array containing the data for the items.
- | ``mask`` (default: ``None``)
| Array of integers showing which data are missing. If
``mask[i, j] == 0``, then ``data[i, j]`` is missing. If ``mask`` is
``None``, then all data are present.
- | ``clusterid`` (default: ``None``)
| Vector of integers showing to which cluster each item belongs. If
``clusterid`` is ``None``, then all items are assumed to belong to
the same cluster.
- | ``method`` (default: ``'a'``)
| Specifies whether the arithmetic mean (``method=='a'``) or the
median (``method=='m'``) is used to calculate the cluster center.
- | ``transpose`` (default: ``0``)
| Determines if the centroids of the rows of ``data`` are to be
calculated (``transpose`` is ``False``), or the centroids of the
columns of ``data`` (``transpose`` is ``True``).
This function returns the tuple ``(cdata, cmask)``. The centroid data
are stored in the 2D Numerical Python array ``cdata``, with missing data
indicated by the 2D Numerical Python integer array ``cmask``. The
dimensions of these arrays are
:math:`\left(\textrm{number of clusters}, \textrm{number of columns}\right)`
if ``transpose`` is ``0``, or
:math:`\left(\textrm{number of rows}, \textrm{number of clusters}\right)`
if ``transpose`` is ``1``. Each row (if ``transpose`` is ``0``) or
column (if ``transpose`` is ``1``) contains the averaged data
corresponding to the centroid of each cluster.
Calculating the distance between clusters
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Given a distance function between *items*, we can define the distance
between two *clusters* in several ways. The distance between the
arithmetic means of the two clusters is used in pairwise
centroid-linkage clustering and in :math:`k`-means clustering. In
:math:`k`-medoids clustering, the distance between the medians of the
two clusters is used instead. The shortest pairwise distance between
items of the two clusters is used in pairwise single-linkage clustering,
while the longest pairwise distance is used in pairwise maximum-linkage
clustering. In pairwise average-linkage clustering, the distance between
two clusters is defined as the average over the pairwise distances.
To calculate the distance between two clusters, use
.. code:: pycon
>>> from Bio.Cluster import clusterdistance
>>> distance = clusterdistance(data)
where the following arguments are defined:
- | ``data`` (required)
| Array containing the data for the items.
- | ``mask`` (default: ``None``)
| Array of integers showing which data are missing. If
``mask[i, j] == 0``, then ``data[i, j]`` is missing. If ``mask`` is
``None``, then all data are present.
- | ``weight`` (default: ``None``)
| The weights to be used when calculating distances. If ``weight`` is
``None``, then equal weights are assumed.
- | ``index1`` (default: ``0``)
| A list containing the indices of the items belonging to the first
cluster. A cluster containing only one item :math:`i` can be
represented either as a list ``[i]``, or as an integer ``i``.
- | ``index2`` (default: ``0``)
| A list containing the indices of the items belonging to the second
cluster. A cluster containing only one items :math:`i` can be
represented either as a list ``[i]``, or as an integer ``i``.
- | ``method`` (default: ``'a'``)
| Specifies how the distance between clusters is defined:
- ``'a'``: Distance between the two cluster centroids (arithmetic
mean);
- ``'m'``: Distance between the two cluster centroids (median);
- ``'s'``: Shortest pairwise distance between items in the two
clusters;
- ``'x'``: Longest pairwise distance between items in the two
clusters;
- ``'v'``: Average over the pairwise distances between items in the
two clusters.
- | ``dist`` (default: ``'e'``, Euclidean distance)
| Defines the distance function to be used (see
:ref:`sec:distancefunctions`).
- | ``transpose`` (default: ``0``)
| If ``transpose`` is ``False``, calculate the distance between the
rows of ``data``. If ``transpose`` is ``True``, calculate the
distance between the columns of ``data``.
Partitioning algorithms
-----------------------
Partitioning algorithms divide items into :math:`k` clusters such that
the sum of distances over the items to their cluster centers is minimal.
The number of clusters :math:`k` is specified by the user. Three
partitioning algorithms are available in ``Bio.Cluster``:
- :math:`k`-means clustering
- :math:`k`-medians clustering
- :math:`k`-medoids clustering
These algorithms differ in how the cluster center is defined. In
:math:`k`-means clustering, the cluster center is defined as the mean
data vector averaged over all items in the cluster. Instead of the mean,
in :math:`k`-medians clustering the median is calculated for each
dimension in the data vector. Finally, in :math:`k`-medoids clustering
the cluster center is defined as the item which has the smallest sum of
distances to the other items in the cluster. This clustering algorithm
is suitable for cases in which the distance matrix is known but the
original data matrix is not available, for example when clustering
proteins based on their structural similarity.
The expectation-maximization (EM) algorithm is used to find this
partitioning into :math:`k` groups. In the initialization of the EM
algorithm, we randomly assign items to clusters. To ensure that no empty
clusters are produced, we use the binomial distribution to randomly
choose the number of items in each cluster to be one or more. We then
randomly permute the cluster assignments to items such that each item
has an equal probability to be in any cluster. Each cluster is thus
guaranteed to contain at least one item.
We then iterate:
- Calculate the centroid of each cluster, defined as either the mean,
the median, or the medoid of the cluster;
- Calculate the distances of each item to the cluster centers;
- For each item, determine which cluster centroid is closest;
- Reassign each item to its closest cluster, or stop the iteration if
no further item reassignments take place.
To avoid clusters becoming empty during the iteration, in
:math:`k`-means and :math:`k`-medians clustering the algorithm keeps
track of the number of items in each cluster, and prohibits the last
remaining item in a cluster from being reassigned to a different
cluster. For :math:`k`-medoids clustering, such a check is not needed,
as the item that functions as the cluster centroid has a zero distance
to itself, and will therefore never be closer to a different cluster.
As the initial assignment of items to clusters is done randomly, usually
a different clustering solution is found each time the EM algorithm is
executed. To find the optimal clustering solution, the :math:`k`-means
algorithm is repeated many times, each time starting from a different
initial random clustering. The sum of distances of the items to their
cluster center is saved for each run, and the solution with the smallest
value of this sum will be returned as the overall clustering solution.
How often the EM algorithm should be run depends on the number of items
being clustered. As a rule of thumb, we can consider how often the
optimal solution was found; this number is returned by the partitioning
algorithms as implemented in this library. If the optimal solution was
found many times, it is unlikely that better solutions exist than the
one that was found. However, if the optimal solution was found only
once, there may well be other solutions with a smaller within-cluster
sum of distances. If the number of items is large (more than several
hundreds), it may be difficult to find the globally optimal solution.
The EM algorithm terminates when no further reassignments take place. We
noticed that for some sets of initial cluster assignments, the EM
algorithm fails to converge due to the same clustering solution
reappearing periodically after a small number of iteration steps. We
therefore check for the occurrence of such periodic solutions during the
iteration. After a given number of iteration steps, the current
clustering result is saved as a reference. By comparing the clustering
result after each subsequent iteration step to the reference state, we
can determine if a previously encountered clustering result is found. In
such a case, the iteration is halted. If after a given number of
iterations the reference state has not yet been encountered, the current
clustering solution is saved to be used as the new reference state.
Initially, ten iteration steps are executed before resaving the
reference state. This number of iteration steps is doubled each time, to
ensure that periodic behavior with longer periods can also be detected.
:math:`k`-means and :math:`k`-medians
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The :math:`k`-means and :math:`k`-medians algorithms are implemented as
the function ``kcluster`` in ``Bio.Cluster``:
.. code:: pycon
>>> from Bio.Cluster import kcluster
>>> clusterid, error, nfound = kcluster(data)
where the following arguments are defined:
- | ``data`` (required)
| Array containing the data for the items.
- | ``nclusters`` (default: ``2``)
| The number of clusters :math:`k`.
- | ``mask`` (default: ``None``)
| Array of integers showing which data are missing. If
``mask[i, j] == 0``, then ``data[i, j]`` is missing. If ``mask`` is
``None``, then all data are present.
- | ``weight`` (default: ``None``)
| The weights to be used when calculating distances. If ``weight`` is
``None``, then equal weights are assumed.
- | ``transpose`` (default: ``0``)
| Determines if rows (``transpose`` is ``0``) or columns
(``transpose`` is ``1``) are to be clustered.
- | ``npass`` (default: ``1``)
| The number of times the :math:`k`-means/-medians clustering
algorithm is performed, each time with a different (random) initial
condition. If ``initialid`` is given, the value of ``npass`` is
ignored and the clustering algorithm is run only once, as it
behaves deterministically in that case.
- | ``method`` (default: ``a``)
| describes how the center of a cluster is found:
- ``method=='a'``: arithmetic mean (:math:`k`-means clustering);
- ``method=='m'``: median (:math:`k`-medians clustering).
For other values of ``method``, the arithmetic mean is used.
- | ``dist`` (default: ``'e'``, Euclidean distance)
| Defines the distance function to be used (see
:ref:`sec:distancefunctions`). Whereas all eight distance
measures are accepted by ``kcluster``, from a theoretical viewpoint
it is best to use the Euclidean distance for the :math:`k`-means
algorithm, and the city-block distance for :math:`k`-medians.
- | ``initialid`` (default: ``None``)
| Specifies the initial clustering to be used for the EM algorithm.
If ``initialid`` is ``None``, then a different random initial
clustering is used for each of the ``npass`` runs of the EM
algorithm. If ``initialid`` is not ``None``, then it should be
equal to a 1D array containing the cluster number (between ``0``
and ``nclusters-1``) for each item. Each cluster should contain at
least one item. With the initial clustering specified, the EM
algorithm is deterministic.
This function returns a tuple ``(clusterid, error, nfound)``, where
``clusterid`` is an integer array containing the number of the cluster
to which each row or cluster was assigned, ``error`` is the
within-cluster sum of distances for the optimal clustering solution, and
``nfound`` is the number of times this optimal solution was found.
:math:`k`-medoids clustering
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The ``kmedoids`` routine performs :math:`k`-medoids clustering on a
given set of items, using the distance matrix and the number of clusters
passed by the user:
.. code:: pycon
>>> from Bio.Cluster import kmedoids
>>> clusterid, error, nfound = kmedoids(distance)
where the following arguments are defined: , nclusters=2, npass=1,
initialid=None)\|
- | ``distance`` (required)
| The matrix containing the distances between the items; this matrix
can be specified in three ways:
- as a 2D Numerical Python array (in which only the left-lower part
of the array will be accessed):
.. code:: python
distance = array([[0.0, 1.1, 2.3], [1.1, 0.0, 4.5], [2.3, 4.5, 0.0]])
- as a 1D Numerical Python array containing consecutively the
distances in the left-lower part of the distance matrix:
.. code:: python
distance = array([1.1, 2.3, 4.5])
- as a list containing the rows of the left-lower part of the
distance matrix:
.. code:: python
distance = [array([]), array([1.1]), array([2.3, 4.5])]
These three expressions correspond to the same distance matrix.
- | ``nclusters`` (default: ``2``)
| The number of clusters :math:`k`.
- | ``npass`` (default: ``1``)
| The number of times the :math:`k`-medoids clustering algorithm is
performed, each time with a different (random) initial condition.
If ``initialid`` is given, the value of ``npass`` is ignored, as
the clustering algorithm behaves deterministically in that case.
- | ``initialid`` (default: ``None``)
| Specifies the initial clustering to be used for the EM algorithm.
If ``initialid`` is ``None``, then a different random initial
clustering is used for each of the ``npass`` runs of the EM
algorithm. If ``initialid`` is not ``None``, then it should be
equal to a 1D array containing the cluster number (between ``0``
and ``nclusters-1``) for each item. Each cluster should contain at
least one item. With the initial clustering specified, the EM
algorithm is deterministic.
This function returns a tuple ``(clusterid, error, nfound)``, where
``clusterid`` is an array containing the number of the cluster to which
each item was assigned, ``error`` is the within-cluster sum of distances
for the optimal :math:`k`-medoids clustering solution, and ``nfound`` is
the number of times the optimal solution was found. Note that the
cluster number in ``clusterid`` is defined as the item number of the
item representing the cluster centroid.
Hierarchical clustering
-----------------------
Hierarchical clustering methods are inherently different from the
:math:`k`-means clustering method. In hierarchical clustering, the
similarity in the expression profile between genes or experimental
conditions are represented in the form of a tree structure. This tree
structure can be shown graphically by programs such as Treeview and Java
Treeview, which has contributed to the popularity of hierarchical
clustering in the analysis of gene expression data.
The first step in hierarchical clustering is to calculate the distance
matrix, specifying all the distances between the items to be clustered.
Next, we create a node by joining the two closest items. Subsequent
nodes are created by pairwise joining of items or nodes based on the
distance between them, until all items belong to the same node. A tree
structure can then be created by retracing which items and nodes were
merged. Unlike the EM algorithm, which is used in :math:`k`-means
clustering, the complete process of hierarchical clustering is
deterministic.
Several flavors of hierarchical clustering exist, which differ in how
the distance between subnodes is defined in terms of their members. In
``Bio.Cluster``, pairwise single, maximum, average, and centroid linkage
are available.
- In pairwise single-linkage clustering, the distance between two nodes
is defined as the shortest distance among the pairwise distances
between the members of the two nodes.
- In pairwise maximum-linkage clustering, alternatively known as
pairwise complete-linkage clustering, the distance between two nodes
is defined as the longest distance among the pairwise distances
between the members of the two nodes.
- In pairwise average-linkage clustering, the distance between two
nodes is defined as the average over all pairwise distances between
the items of the two nodes.
- In pairwise centroid-linkage clustering, the distance between two
nodes is defined as the distance between their centroids. The
centroids are calculated by taking the mean over all the items in a
cluster. As the distance from each newly formed node to existing
nodes and items need to be calculated at each step, the computing
time of pairwise centroid-linkage clustering may be significantly
longer than for the other hierarchical clustering methods. Another
peculiarity is that (for a distance measure based on the Pearson
correlation), the distances do not necessarily increase when going up
in the clustering tree, and may even decrease. This is caused by an
inconsistency between the centroid calculation and the distance
calculation when using the Pearson correlation: Whereas the Pearson
correlation effectively normalizes the data for the distance
calculation, no such normalization occurs for the centroid
calculation.
For pairwise single-, complete-, and average-linkage clustering, the
distance between two nodes can be found directly from the distances
between the individual items. Therefore, the clustering algorithm does
not need access to the original gene expression data, once the distance
matrix is known. For pairwise centroid-linkage clustering, however, the
centroids of newly formed subnodes can only be calculated from the
original data and not from the distance matrix.
The implementation of pairwise single-linkage hierarchical clustering is
based on the SLINK algorithm [Sibson1973]_, which is
much faster and more memory-efficient than a straightforward
implementation of pairwise single-linkage clustering. The clustering
result produced by this algorithm is identical to the clustering
solution found by the conventional single-linkage algorithm. The
single-linkage hierarchical clustering algorithm implemented in this
library can be used to cluster large gene expression data sets, for
which conventional hierarchical clustering algorithms fail due to
excessive memory requirements and running time.
Representing a hierarchical clustering solution
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The result of hierarchical clustering consists of a tree of nodes, in
which each node joins two items or subnodes. Usually, we are not only
interested in which items or subnodes are joined at each node, but also
in their similarity (or distance) as they are joined. To store one node
in the hierarchical clustering tree, we make use of the class ``Node``,
which defined in ``Bio.Cluster``. An instance of ``Node`` has three
attributes:
- ``left``
- ``right``
- ``distance``
Here, ``left`` and ``right`` are integers referring to the two items or
subnodes that are joined at this node, and ``distance`` is the distance
between them. The items being clustered are numbered from 0 to
:math:`\left(\textrm{number of items} - 1\right)`, while clusters are
numbered from -1 to :math:`-\left(\textrm{number of items}-1\right)`.
Note that the number of nodes is one less than the number of items.
To create a new ``Node`` object, we need to specify ``left`` and
``right``; ``distance`` is optional.
.. doctest . lib:numpy
.. code:: pycon
>>> from Bio.Cluster import Node
>>> Node(2, 3)
(2, 3): 0
>>> Node(2, 3, 0.91)
(2, 3): 0.91
The attributes ``left``, ``right``, and ``distance`` of an existing
``Node`` object can be modified directly:
.. cont-doctest
.. code:: pycon
>>> node = Node(4, 5)
>>> node.left = 6
>>> node.right = 2
>>> node.distance = 0.73
>>> node
(6, 2): 0.73
An error is raised if ``left`` and ``right`` are not integers, or if
``distance`` cannot be converted to a floating-point value.
The Python class ``Tree`` represents a full hierarchical clustering
solution. A ``Tree`` object can be created from a list of ``Node``
objects:
.. doctest . lib:numpy
.. code:: pycon
>>> from Bio.Cluster import Node, Tree
>>> nodes = [Node(1, 2, 0.2), Node(0, 3, 0.5), Node(-2, 4, 0.6), Node(-1, -3, 0.9)]
>>> tree = Tree(nodes)
>>> print(tree)
(1, 2): 0.2
(0, 3): 0.5
(-2, 4): 0.6
(-1, -3): 0.9
The ``Tree`` initializer checks if the list of nodes is a valid
hierarchical clustering result:
.. cont-doctest
.. code:: pycon
>>> nodes = [Node(1, 2, 0.2), Node(0, 2, 0.5)]
>>> Tree(nodes)
Traceback (most recent call last):
File "<stdin>", line 1, in ?
ValueError: Inconsistent tree
Individual nodes in a ``Tree`` object can be accessed using square
brackets:
.. cont-doctest
.. code:: pycon
>>> nodes = [Node(1, 2, 0.2), Node(0, -1, 0.5)]
>>> tree = Tree(nodes)
>>> tree[0]
(1, 2): 0.2
>>> tree[1]
(0, -1): 0.5
>>> tree[-1]
(0, -1): 0.5
As a ``Tree`` object is immutable, we cannot change individual nodes in
a ``Tree`` object. However, we can convert the tree to a list of nodes,
modify this list, and create a new tree from this list:
.. cont-doctest
.. code:: pycon
>>> tree = Tree([Node(1, 2, 0.1), Node(0, -1, 0.5), Node(-2, 3, 0.9)])
>>> print(tree)
(1, 2): 0.1
(0, -1): 0.5
(-2, 3): 0.9
>>> nodes = tree[:]
>>> nodes[0] = Node(0, 1, 0.2)
>>> nodes[1].left = 2
>>> tree = Tree(nodes)
>>> print(tree)
(0, 1): 0.2
(2, -1): 0.5
(-2, 3): 0.9
This guarantees that any ``Tree`` object is always well-formed.
To display a hierarchical clustering solution with visualization
programs such as Java Treeview, it is better to scale all node distances
such that they are between zero and one. This can be accomplished by
calling the ``scale`` method on an existing ``Tree`` object:
.. code:: pycon
>>> tree.scale()
This method takes no arguments, and returns ``None``.
Before drawing the tree, you may also want to reorder the tree nodes. A
hierarchical clustering solution of :math:`n` items can be drawn as
:math:`2^{n-1}` different but equivalent dendrograms by switching the
left and right subnode at each node. The ``tree.sort(order)`` method
visits each node in the hierarchical clustering tree and verifies if the
average order value of the left subnode is less than or equal to the
average order value of the right subnode. If not, the left and right
subnodes are exchanged. Here, the order values of the items are given by
the user. In the resulting dendrogram, items in the left-to-right order
will tend to have increasing order values. The method will return the
indices of the elements in the left-to-right order after sorting:
.. code:: pycon
>>> indices = tree.sort(order)
such that item ``indices[i]`` will occur at position :math:`i` in the
dendrogram.
After hierarchical clustering, the items can be grouped into :math:`k`
clusters based on the tree structure stored in the ``Tree`` object by
cutting the tree:
.. code:: pycon
>>> clusterid = tree.cut(nclusters=1)
where ``nclusters`` (defaulting to ``1``) is the desired number of
clusters :math:`k`. This method ignores the top :math:`k-1` linking
events in the tree structure, resulting in :math:`k` separated clusters
of items. The number of clusters :math:`k` should be positive, and less
than or equal to the number of items. This method returns an array
``clusterid`` containing the number of the cluster to which each item is
assigned. Clusters are numbered :math:`0` to :math:`k-1` in their
left-to-right order in the dendrogram.
Performing hierarchical clustering
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
To perform hierarchical clustering, use the ``treecluster`` function in
``Bio.Cluster``.
.. code:: pycon
>>> from Bio.Cluster import treecluster
>>> tree = treecluster(data)
where the following arguments are defined:
- | ``data``
| Array containing the data for the items.
- | ``mask`` (default: ``None``)
| Array of integers showing which data are missing. If
``mask[i, j] == 0``, then ``data[i, j]`` is missing. If ``mask`` is
``None``, then all data are present.
- | ``weight`` (default: ``None``)
| The weights to be used when calculating distances. If ``weight`` is
``None``, then equal weights are assumed.
- | ``transpose`` (default: ``0``)
| Determines if rows (``transpose`` is ``False``) or columns
(``transpose`` is ``True``) are to be clustered.
- | ``method`` (default: ``'m'``)
| defines the linkage method to be used:
- ``method=='s'``: pairwise single-linkage clustering
- ``method=='m'``: pairwise maximum- (or complete-) linkage
clustering
- ``method=='c'``: pairwise centroid-linkage clustering
- ``method=='a'``: pairwise average-linkage clustering
- | ``dist`` (default: ``'e'``, Euclidean distance)
| Defines the distance function to be used (see
:ref:`sec:distancefunctions`).
To apply hierarchical clustering on a precalculated distance matrix,
specify the ``distancematrix`` argument when calling ``treecluster``
function instead of the ``data`` argument:
.. code:: pycon
>>> from Bio.Cluster import treecluster
>>> tree = treecluster(distancematrix=distance)
In this case, the following arguments are defined:
- | ``distancematrix``
| The distance matrix, which can be specified in three ways:
- as a 2D Numerical Python array (in which only the left-lower part
of the array will be accessed):
.. code:: python
distance = array([[0.0, 1.1, 2.3], [1.1, 0.0, 4.5], [2.3, 4.5, 0.0]])
- as a 1D Numerical Python array containing consecutively the
distances in the left-lower part of the distance matrix:
.. code:: python
distance = array([1.1, 2.3, 4.5])
- as a list containing the rows of the left-lower part of the
distance matrix:
.. code:: python
distance = [array([]), array([1.1]), array([2.3, 4.5])]
These three expressions correspond to the same distance matrix. As
``treecluster`` may shuffle the values in the distance matrix as part
of the clustering algorithm, be sure to save this array in a
different variable before calling ``treecluster`` if you need it
later.
- | ``method``
| The linkage method to be used:
- ``method=='s'``: pairwise single-linkage clustering
- ``method=='m'``: pairwise maximum- (or complete-) linkage
clustering
- ``method=='a'``: pairwise average-linkage clustering
While pairwise single-, maximum-, and average-linkage clustering can
be calculated from the distance matrix alone, pairwise
centroid-linkage cannot.
When calling ``treecluster``, either ``data`` or ``distancematrix``
should be ``None``.
This function returns a ``Tree`` object. This object contains
:math:`\left(\textrm{number of items} - 1\right)` nodes, where the
number of items is the number of rows if rows were clustered, or the
number of columns if columns were clustered. Each node describes a
pairwise linking event, where the node attributes ``left`` and ``right``
each contain the number of one item or subnode, and ``distance`` the
distance between them. Items are numbered from 0 to
:math:`\left(\textrm{number of items} - 1\right)`, while clusters are
numbered -1 to :math:`-\left(\textrm{number of items}-1\right)`.
Self-Organizing Maps
--------------------
Self-Organizing Maps (SOMs) were invented by Kohonen to describe neural
networks (see for instance Kohonen, 1997
[Kohonen1997]_). Tamayo (1999) first applied
Self-Organizing Maps to gene expression data
[Tamayo1999]_.
SOMs organize items into clusters that are situated in some topology.
Usually a rectangular topology is chosen. The clusters generated by SOMs
are such that neighboring clusters in the topology are more similar to
each other than clusters far from each other in the topology.
The first step to calculate a SOM is to randomly assign a data vector to
each cluster in the topology. If rows are being clustered, then the
number of elements in each data vector is equal to the number of
columns.
An SOM is then generated by taking rows one at a time, and finding which
cluster in the topology has the closest data vector. The data vector of
that cluster, as well as those of the neighboring clusters, are adjusted
using the data vector of the row under consideration. The adjustment is
given by
.. math:: \Delta \underline{x}_{\textrm{cell}} = \tau \cdot \left(\underline{x}_{\textrm{row}} - \underline{x}_{\textrm{cell}} \right).
The parameter :math:`\tau` is a parameter that decreases at each
iteration step. We have used a simple linear function of the iteration
step:
.. math:: \tau = \tau_{\textrm{init}} \cdot \left(1 - {i \over n}\right),
:math:`\tau_{\textrm{init}}` is the initial value of :math:`\tau` as
specified by the user, :math:`i` is the number of the current iteration
step, and :math:`n` is the total number of iteration steps to be
performed. While changes are made rapidly in the beginning of the
iteration, at the end of iteration only small changes are made.
All clusters within a radius :math:`R` are adjusted to the gene under
consideration. This radius decreases as the calculation progresses as
.. math:: R = R_{\textrm{max}} \cdot \left(1 - {i \over n}\right),
in which the maximum radius is defined as
.. math:: R_{\textrm{max}} = \sqrt{N_x^2 + N_y^2},
where :math:`\left(N_x, N_y\right)` are the dimensions of the rectangle
defining the topology.
The function ``somcluster`` implements the complete algorithm to
calculate a Self-Organizing Map on a rectangular grid. First it
initializes the random number generator. The node data are then
initialized using the random number generator. The order in which genes
or samples are used to modify the SOM is also randomized. The total
number of iterations in the SOM algorithm is specified by the user.
To run ``somcluster``, use
.. code:: pycon
>>> from Bio.Cluster import somcluster
>>> clusterid, celldata = somcluster(data)
where the following arguments are defined:
- | ``data`` (required)
| Array containing the data for the items.
- | ``mask`` (default: ``None``)
| Array of integers showing which data are missing. If
``mask[i, j] == 0``, then ``data[i, j]`` is missing. If ``mask`` is
``None``, then all data are present.
- | ``weight`` (default: ``None``)
| contains the weights to be used when calculating distances. If
``weight`` is ``None``, then equal weights are assumed.
- | ``transpose`` (default: ``0``)
| Determines if rows (``transpose`` is ``0``) or columns
(``transpose`` is ``1``) are to be clustered.
- | ``nxgrid, nygrid`` (default: ``2, 1``)
| The number of cells horizontally and vertically in the rectangular
grid on which the Self-Organizing Map is calculated.
- | ``inittau`` (default: ``0.02``)
| The initial value for the parameter :math:`\tau` that is used in
the SOM algorithm. The default value for ``inittau`` is 0.02, which
was used in Michael Eisen’s Cluster/TreeView program.
- | ``niter`` (default: ``1``)
| The number of iterations to be performed.
- | ``dist`` (default: ``'e'``, Euclidean distance)
| Defines the distance function to be used (see
:ref:`sec:distancefunctions`).
This function returns the tuple ``(clusterid, celldata)``:
- | ``clusterid``:
| An array with two columns, where the number of rows is equal to the
number of items that were clustered. Each row contains the
:math:`x` and :math:`y` coordinates of the cell in the rectangular
SOM grid to which the item was assigned.
- | ``celldata``:
| An array with dimensions
:math:`\left(\verb|nxgrid|, \verb|nygrid|, \textrm{number of columns}\right)`
if rows are being clustered, or
:math:`\left(\verb|nxgrid|, \verb|nygrid|, \textrm{number of rows}\right)`
if columns are being clustered. Each element ``[ix][iy]`` of this
array is a 1D vector containing the gene expression data for the
centroid of the cluster in the grid cell with coordinates
``[ix][iy]``.
Principal Component Analysis
----------------------------
Principal Component Analysis (PCA) is a widely used technique for
analyzing multivariate data. A practical example of applying Principal
Component Analysis to gene expression data is presented by Yeung and
Ruzzo (2001) [Yeung2001]_.
In essence, PCA is a coordinate transformation in which each row in the
data matrix is written as a linear sum over basis vectors called
principal components, which are ordered and chosen such that each
maximally explains the remaining variance in the data vectors. For
example, an :math:`n \times 3` data matrix can be represented as an
ellipsoidal cloud of :math:`n` points in three dimensional space. The
first principal component is the longest axis of the ellipsoid, the
second principal component the second longest axis of the ellipsoid, and
the third principal component is the shortest axis. Each row in the data
matrix can be reconstructed as a suitable linear combination of the
principal components. However, in order to reduce the dimensionality of
the data, usually only the most important principal components are
retained. The remaining variance present in the data is then regarded as
unexplained variance.
The principal components can be found by calculating the eigenvectors of
the covariance matrix of the data. The corresponding eigenvalues
determine how much of the variance present in the data is explained by
each principal component.
Before applying principal component analysis, typically the mean is
subtracted from each column in the data matrix. In the example above,
this effectively centers the ellipsoidal cloud around its centroid in 3D
space, with the principal components describing the variation of points
in the ellipsoidal cloud with respect to their centroid.
The function ``pca`` below first uses the singular value decomposition
to calculate the eigenvalues and eigenvectors of the data matrix. The
singular value decomposition is implemented as a translation in C of the
Algol procedure ``svd`` [Golub1971]_, which uses
Householder bidiagonalization and a variant of the QR algorithm. The
principal components, the coordinates of each data vector along the
principal components, and the eigenvalues corresponding to the principal
components are then evaluated and returned in decreasing order of the
magnitude of the eigenvalue. If data centering is desired, the mean
should be subtracted from each column in the data matrix before calling
the ``pca`` routine.
To apply Principal Component Analysis to a rectangular matrix ``data``,
use
.. code:: pycon
>>> from Bio.Cluster import pca
>>> columnmean, coordinates, components, eigenvalues = pca(data)
This function returns a tuple
``columnmean, coordinates, components, eigenvalues``:
- | ``columnmean``
| Array containing the mean over each column in ``data``.
- | ``coordinates``
| The coordinates of each row in ``data`` with respect to the
principal components.
- | ``components``
| The principal components.
- | ``eigenvalues``
| The eigenvalues corresponding to each of the principal components.
The original matrix ``data`` can be recreated by calculating
``columnmean + dot(coordinates, components)``.
Handling Cluster/TreeView-type files
------------------------------------
Cluster/TreeView are GUI-based codes for clustering gene expression
data. They were originally written by `Michael
Eisen <http://rana.lbl.gov>`__ while at Stanford University
[Eisen1998]_. ``Bio.Cluster`` contains functions for
reading and writing data files that correspond to the format specified
for Cluster/TreeView. In particular, by saving a clustering result in
that format, TreeView can be used to visualize the clustering results.
We recommend using Alok Saldanha’s
http://jtreeview.sourceforge.net/\ Java TreeView program
[Saldanha2004]_, which can display hierarchical as well
as :math:`k`-means clustering results.
An object of the class ``Record`` contains all information stored in a
Cluster/TreeView-type data file. To store the information contained in
the data file in a ``Record`` object, we first open the file and then
read it:
.. code:: pycon
>>> from Bio import Cluster
>>> with open("mydatafile.txt") as handle:
... record = Cluster.read(handle)
...
This two-step process gives you some flexibility in the source of the
data. For example, you can use
.. code:: pycon
>>> import gzip # Python standard library
>>> handle = gzip.open("mydatafile.txt.gz", "rt")
to open a gzipped file, or
.. code:: pycon
>>> from urllib.request import urlopen
>>> from io import TextIOWrapper
>>> url = "https://raw.githubusercontent.com/biopython/biopython/master/Tests/Cluster/cyano.txt"
>>> handle = TextIOWrapper(urlopen(url))
to open a file stored on the Internet before calling ``read``.
The ``read`` command reads the tab-delimited text file
``mydatafile.txt`` containing gene expression data in the format
specified for Michael Eisen’s Cluster/TreeView program. In this file
format, rows represent genes and columns represent samples or
observations. For a simple time course, a minimal input file would look
like this:
.. container:: center
======= ========= ========== ====== ======= =======
YORF 0 minutes 30 minutes 1 hour 2 hours 4 hours
YAL001C 1 1.3 2.4 5.8 2.4
YAL002W 0.9 0.8 0.7 0.5 0.2
YAL003W 0.8 2.1 4.2 10.1 10.1
YAL005C 1.1 1.3 0.8 0.4
YAL010C 1.2 1 1.1 4.5 8.3
======= ========= ========== ====== ======= =======
Each row (gene) has an identifier that always goes in the first column.
In this example, we are using yeast open reading frame codes. Each
column (sample) has a label in the first row. In this example, the
labels describe the time at which a sample was taken. The first column
of the first row contains a special field that tells the program what
kind of objects are in each row. In this case, YORF stands for yeast
open reading frame. This field can be any alphanumeric value. The
remaining cells in the table contain data for the appropriate gene and
sample. The 5.8 in row 2 column 4 means that the observed value for gene
YAL001C at 2 hours was 5.8. Missing values are acceptable and are
designated by empty cells (e.g. YAL004C at 2 hours).
The input file may contain additional information. A maximal input file
would look like this:
.. container:: center
======= ========================== ======= ====== === === === ==== ====
YORF NAME GWEIGHT GORDER 0 30 1 2 4
EWEIGHT 1 1 1 1 0
EORDER 5 3 2 1 1
YAL001C TFIIIC 138 KD SUBUNIT 1 1 1 1.3 2.4 5.8 2.4
YAL002W UNKNOWN 0.4 3 0.9 0.8 0.7 0.5 0.2
YAL003W ELONGATION FACTOR EF1-BETA 0.4 2 0.8 2.1 4.2 10.1 10.1
YAL005C CYTOSOLIC HSP70 0.4 5 1.1 1.3 0.8 0.4
======= ========================== ======= ====== === === === ==== ====
The added columns NAME, GWEIGHT, and GORDER and rows EWEIGHT and EORDER
are optional. The NAME column allows you to specify a label for each
gene that is distinct from the ID in column 1.
A ``Record`` object has the following attributes:
- | ``data``
| The data array containing the gene expression data. Genes are
stored row-wise, while samples are stored column-wise.
- | ``mask``
| This array shows which elements in the ``data`` array, if any, are
missing. If ``mask[i, j] == 0``, then ``data[i, j]`` is missing. If
no data were found to be missing, ``mask`` is set to ``None``.
- | ``geneid``
| This is a list containing a unique description for each gene (i.e.,
ORF numbers).
- | ``genename``
| This is a list containing a description for each gene (i.e., gene
name). If not present in the data file, ``genename`` is set to
``None``.
- | ``gweight``
| The weights that are to be used to calculate the distance in
expression profile between genes. If not present in the data file,
``gweight`` is set to ``None``.
- | ``gorder``
| The preferred order in which genes should be stored in an output
file. If not present in the data file, ``gorder`` is set to
``None``.
- | ``expid``
| This is a list containing a description of each sample, e.g.
experimental condition.
- | ``eweight``
| The weights that are to be used to calculate the distance in
expression profile between samples. If not present in the data
file, ``eweight`` is set to ``None``.
- | ``eorder``
| The preferred order in which samples should be stored in an output
file. If not present in the data file, ``eorder`` is set to
``None``.
- | ``uniqid``
| The string that was used instead of UNIQID in the data file.
After loading a ``Record`` object, each of these attributes can be
accessed and modified directly. For example, the data can be
log-transformed by taking the logarithm of ``record.data``.
Calculating the distance matrix
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
To calculate the distance matrix between the items stored in the record,
use
.. code:: pycon
>>> matrix = record.distancematrix()
where the following arguments are defined:
- | ``transpose`` (default: ``0``)
| Determines if the distances between the rows of ``data`` are to be
calculated (``transpose`` is ``False``), or between the columns of
``data`` (``transpose`` is ``True``).
- | ``dist`` (default: ``'e'``, Euclidean distance)
| Defines the distance function to be used (see
:ref:`sec:distancefunctions`).
This function returns the distance matrix as a list of rows, where the
number of columns of each row is equal to the row number (see section
:ref:`sec:distancematrix`).
Calculating the cluster centroids
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
To calculate the centroids of clusters of items stored in the record,
use
.. code:: pycon
>>> cdata, cmask = record.clustercentroids()
- | ``clusterid`` (default: ``None``)
| Vector of integers showing to which cluster each item belongs. If
``clusterid`` is not given, then all items are assumed to belong to
the same cluster.
- | ``method`` (default: ``'a'``)
| Specifies whether the arithmetic mean (``method=='a'``) or the
median (``method=='m'``) is used to calculate the cluster center.
- | ``transpose`` (default: ``0``)
| Determines if the centroids of the rows of ``data`` are to be
calculated (``transpose`` is ``False``), or the centroids of the
columns of ``data`` (``transpose`` is ``True``).
This function returns the tuple ``cdata, cmask``; see section
:ref:`sec:clustercentroids` for a description.
.. _calculating-the-distance-between-clusters-1:
Calculating the distance between clusters
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
To calculate the distance between clusters of items stored in the
record, use
.. code:: pycon
>>> distance = record.clusterdistance()
where the following arguments are defined:
- | ``index1`` (default: ``0``)
| A list containing the indices of the items belonging to the first
cluster. A cluster containing only one item :math:`i` can be
represented either as a list ``[i]``, or as an integer ``i``.
- | ``index2`` (default: ``0``)
| A list containing the indices of the items belonging to the second
cluster. A cluster containing only one item :math:`i` can be
represented either as a list ``[i]``, or as an integer ``i``.
- | ``method`` (default: ``'a'``)
| Specifies how the distance between clusters is defined:
- ``'a'``: Distance between the two cluster centroids (arithmetic
mean);
- ``'m'``: Distance between the two cluster centroids (median);
- ``'s'``: Shortest pairwise distance between items in the two
clusters;
- ``'x'``: Longest pairwise distance between items in the two
clusters;
- ``'v'``: Average over the pairwise distances between items in the
two clusters.
- | ``dist`` (default: ``'e'``, Euclidean distance)
| Defines the distance function to be used (see
:ref:`sec:distancefunctions`).
- | ``transpose`` (default: ``0``)
| If ``transpose`` is ``False``, calculate the distance between the
rows of ``data``. If ``transpose`` is ``True``, calculate the
distance between the columns of ``data``.
.. _performing-hierarchical-clustering-1:
Performing hierarchical clustering
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
To perform hierarchical clustering on the items stored in the record,
use
.. code:: pycon
>>> tree = record.treecluster()
where the following arguments are defined:
- | ``transpose`` (default: ``0``)
| Determines if rows (``transpose`` is ``False``) or columns
(``transpose`` is ``True``) are to be clustered.
- | ``method`` (default: ``'m'``)
| defines the linkage method to be used:
- ``method=='s'``: pairwise single-linkage clustering
- ``method=='m'``: pairwise maximum- (or complete-) linkage
clustering
- ``method=='c'``: pairwise centroid-linkage clustering
- ``method=='a'``: pairwise average-linkage clustering
- | ``dist`` (default: ``'e'``, Euclidean distance)
| Defines the distance function to be used (see
:ref:`sec:distancefunctions`).
- | ``transpose``
| Determines if genes or samples are being clustered. If
``transpose`` is ``False``, genes (rows) are being clustered. If
``transpose`` is ``True``, samples (columns) are clustered.
This function returns a ``Tree`` object. This object contains
:math:`\left(\textrm{number of items} - 1\right)` nodes, where the
number of items is the number of rows if rows were clustered, or the
number of columns if columns were clustered. Each node describes a
pairwise linking event, where the node attributes ``left`` and ``right``
each contain the number of one item or subnode, and ``distance`` the
distance between them. Items are numbered from 0 to
:math:`\left(\textrm{number of items} - 1\right)`, while clusters are
numbered -1 to :math:`-\left(\textrm{number of items}-1\right)`.
Performing :math:`k`-means or :math:`k`-medians clustering
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
To perform :math:`k`-means or :math:`k`-medians clustering on the items
stored in the record, use
.. code:: pycon
>>> clusterid, error, nfound = record.kcluster()
where the following arguments are defined:
- | ``nclusters`` (default: ``2``)
| The number of clusters :math:`k`.
- | ``transpose`` (default: ``0``)
| Determines if rows (``transpose`` is ``0``) or columns
(``transpose`` is ``1``) are to be clustered.
- | ``npass`` (default: ``1``)
| The number of times the :math:`k`-means/-medians clustering
algorithm is performed, each time with a different (random) initial
condition. If ``initialid`` is given, the value of ``npass`` is
ignored and the clustering algorithm is run only once, as it
behaves deterministically in that case.
- | ``method`` (default: ``a``)
| describes how the center of a cluster is found:
- ``method=='a'``: arithmetic mean (:math:`k`-means clustering);
- ``method=='m'``: median (:math:`k`-medians clustering).
For other values of ``method``, the arithmetic mean is used.
- | ``dist`` (default: ``'e'``, Euclidean distance)
| Defines the distance function to be used (see
:ref:`sec:distancefunctions`).
This function returns a tuple ``(clusterid, error, nfound)``, where
``clusterid`` is an integer array containing the number of the cluster
to which each row or cluster was assigned, ``error`` is the
within-cluster sum of distances for the optimal clustering solution, and
``nfound`` is the number of times this optimal solution was found.
Calculating a Self-Organizing Map
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
To calculate a Self-Organizing Map of the items stored in the record,
use
.. code:: pycon
>>> clusterid, celldata = record.somcluster()
where the following arguments are defined:
- | ``transpose`` (default: ``0``)
| Determines if rows (``transpose`` is ``0``) or columns
(``transpose`` is ``1``) are to be clustered.
- | ``nxgrid, nygrid`` (default: ``2, 1``)
| The number of cells horizontally and vertically in the rectangular
grid on which the Self-Organizing Map is calculated.
- | ``inittau`` (default: ``0.02``)
| The initial value for the parameter :math:`\tau` that is used in
the SOM algorithm. The default value for ``inittau`` is 0.02, which
was used in Michael Eisen’s Cluster/TreeView program.
- | ``niter`` (default: ``1``)
| The number of iterations to be performed.
- | ``dist`` (default: ``'e'``, Euclidean distance)
| Defines the distance function to be used (see
:ref:`sec:distancefunctions`).
This function returns the tuple ``(clusterid, celldata)``:
- | ``clusterid``:
| An array with two columns, where the number of rows is equal to the
number of items that were clustered. Each row contains the
:math:`x` and :math:`y` coordinates of the cell in the rectangular
SOM grid to which the item was assigned.
- | ``celldata``:
| An array with dimensions
:math:`\left(\verb|nxgrid|, \verb|nygrid|, \textrm{number of columns}\right)`
if rows are being clustered, or
:math:`\left(\verb|nxgrid|, \verb|nygrid|, \textrm{number of rows}\right)`
if columns are being clustered. Each element ``[ix][iy]`` of this
array is a 1D vector containing the gene expression data for the
centroid of the cluster in the grid cell with coordinates
``[ix][iy]``.
Saving the clustering result
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
To save the clustering result, use
.. code:: pycon
>>> record.save(jobname, geneclusters, expclusters)
where the following arguments are defined:
- | ``jobname``
| The string ``jobname`` is used as the base name for names of the
files that are to be saved.
- | ``geneclusters``
| This argument describes the gene (row-wise) clustering result. In
case of :math:`k`-means clustering, this is a 1D array containing
the number of the cluster each gene belongs to. It can be
calculated using ``kcluster``. In case of hierarchical clustering,
``geneclusters`` is a ``Tree`` object.
- | ``expclusters``
| This argument describes the (column-wise) clustering result for the
experimental conditions. In case of :math:`k`-means clustering,
this is a 1D array containing the number of the cluster each
experimental condition belongs to. It can be calculated using
``kcluster``. In case of hierarchical clustering, ``expclusters``
is a ``Tree`` object.
This method writes the text file ``jobname.cdt``, ``jobname.gtr``,
``jobname.atr``, ``jobname*.kgg``, and/or ``jobname*.kag`` for
subsequent reading by the Java TreeView program. If ``geneclusters`` and
``expclusters`` are both ``None``, this method only writes the text file
``jobname.cdt``; this file can subsequently be read into a new
``Record`` object.
Example calculation
-------------------
This is an example of a hierarchical clustering calculation, using
single linkage clustering for genes and maximum linkage clustering for
experimental conditions. As the Euclidean distance is being used for
gene clustering, it is necessary to scale the node distances
``genetree`` such that they are all between zero and one. This is needed
for the Java TreeView code to display the tree diagram correctly. To
cluster the experimental conditions, the uncentered correlation is being
used. No scaling is needed in this case, as the distances in ``exptree``
are already between zero and two.
The example data ``cyano.txt`` can be found in Biopython’s
``Tests/Cluster`` subdirectory and is from the paper
Hihara *et al.* 2001 [Hihara2001]_.
.. doctest ../Tests/Cluster lib:numpy
.. code:: pycon
>>> from Bio import Cluster
>>> with open("cyano.txt") as handle:
... record = Cluster.read(handle)
...
>>> genetree = record.treecluster(method="s")
>>> genetree.scale()
>>> exptree = record.treecluster(dist="u", transpose=1)
>>> record.save("cyano_result", genetree, exptree)
This will create the files ``cyano_result.cdt``, ``cyano_result.gtr``,
and ``cyano_result.atr``.
Similarly, we can save a :math:`k`-means clustering solution:
.. doctest ../Tests/Cluster lib:numpy
.. code:: pycon
>>> from Bio import Cluster
>>> with open("cyano.txt") as handle:
... record = Cluster.read(handle)
...
>>> (geneclusters, error, ifound) = record.kcluster(nclusters=5, npass=1000)
>>> (expclusters, error, ifound) = record.kcluster(nclusters=2, npass=100, transpose=1)
>>> record.save("cyano_result", geneclusters, expclusters)
This will create the files ``cyano_result_K_G2_A2.cdt``,
``cyano_result_K_G2.kgg``, and ``cyano_result_K_A2.kag``.
|