File: KMeans.html

package info (click to toggle)
cluster3 1.59%2Bds-3
  • links: PTS, VCS
  • area: non-free
  • in suites: bookworm, bullseye, sid
  • size: 5,624 kB
  • sloc: ansic: 9,948; python: 2,018; perl: 1,566; makefile: 132; sh: 27
file content (119 lines) | stat: -rw-r--r-- 6,561 bytes parent folder | download
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<html>
<!-- Created by GNU Texinfo 6.6, http://www.gnu.org/software/texinfo/ -->
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<title>KMeans (Cluster 3.0 for Windows, Mac OS X, Linux, Unix)</title>

<meta name="description" content="KMeans (Cluster 3.0 for Windows, Mac OS X, Linux, Unix)">
<meta name="keywords" content="KMeans (Cluster 3.0 for Windows, Mac OS X, Linux, Unix)">
<meta name="resource-type" content="document">
<meta name="distribution" content="global">
<meta name="Generator" content="makeinfo">
<link href="index.html#Top" rel="start" title="Top">
<link href="Contents.html#SEC_Contents" rel="contents" title="Table of Contents">
<link href="Cluster.html#Cluster" rel="up" title="Cluster">
<link href="SOM.html#SOM" rel="next" title="SOM">
<link href="Hierarchical.html#Hierarchical" rel="prev" title="Hierarchical">
<style type="text/css">
<!--
a.summary-letter {text-decoration: none}
blockquote.indentedblock {margin-right: 0em}
div.display {margin-left: 3.2em}
div.example {margin-left: 3.2em}
div.lisp {margin-left: 3.2em}
kbd {font-style: oblique}
pre.display {font-family: inherit}
pre.format {font-family: inherit}
pre.menu-comment {font-family: serif}
pre.menu-preformatted {font-family: serif}
span.nolinebreak {white-space: nowrap}
span.roman {font-family: initial; font-weight: normal}
span.sansserif {font-family: sans-serif; font-weight: normal}
ul.no-bullet {list-style: none}
-->
</style>


</head>

<body lang="en">
<span id="KMeans"></span><div class="header">
<p>
Next: <a href="SOM.html#SOM" accesskey="n" rel="next">SOM</a>, Previous: <a href="Hierarchical.html#Hierarchical" accesskey="p" rel="prev">Hierarchical</a>, Up: <a href="Cluster.html#Cluster" accesskey="u" rel="up">Cluster</a> &nbsp; [<a href="Contents.html#SEC_Contents" title="Table of contents" rel="contents">Contents</a>]</p>
</div>
<hr>

<span id="The-k_002dmeans-Clustering-Algorithm"></span><h3 class="section">4.2 The <em>k</em>-means Clustering Algorithm</h3>

<img src="images/kmeans.png" alt="images/kmeans">

<p>The
<i>k</i>-means
 clustering algorithm is a simple, but popular, form of cluster analysis. The basic idea is that you start with a collection of items (e.g. genes) and some chosen number of clusters
(<i>k</i>)
 you want to find. The items are initially randomly assigned to a cluster. The
<i>k</i>-means
 clustering proceeds by repeated application of a two-step process where
</p><ol>
<li> the mean vector for all items in each cluster is computed;
</li><li> items are reassigned to the cluster whose center is closest to the item.
</li></ol>

<p>Since the initial cluster assignment is random, different runs of the
<i>k</i>-means
 clustering algorithm may not give the same final clustering solution. To deal with this, the 
<i>k</i>-means
 clustering algorithms is repeated many times, each time starting from a different initial clustering. The sum of distances within the clusters is used to compare different clustering solutions. The clustering solution with the smallest sum of within-cluster distances is saved.
</p>
<p>The number of runs that should be done depends on how difficult it is to find the optimal solution, which in turn depends on the number of genes involved. Cluster therefore shows in the status bar how many times the optimal solution has been found. If this number is one, there may be a clustering solution with an even lower sum of within-cluster distances. The
<i>k</i>-means
 clustering algorithm should then be repeated with more trials. If the optimal solution is found many times, the solution that has been found is likely to have the lowest possible within-cluster sum of distances. We can then assume that the
<i>k</i>-means
 clustering procedure has then found the overall optimal clustering solution.
</p>
<p>It should be noted that generally, the
<i>k</i>-means
 clustering algorithm finds a clustering solution with a smaller within-cluster sum of distances than the hierarchical clustering techniques. 
</p>
<p>The parameters that control
<i>k</i>-means
 clustering are
</p><ul>
<li> the number of clusters
(<i>k</i>);
</li><li> the number of trials.
</li></ul>


<p>The output is simply an assignment of items to a cluster. The implementation here simply rearranges the rows and/or columns based on which cluster they were assigned to.
The output data file is <samp><var>JobName</var>_K_GKg_AKa.cdt</samp>, where <samp>_GKg</samp> is included if genes were organized, and <samp>_AKa</samp> is included if arrays were organized. Here, <samp>Kg</samp> and <samp>Ka</samp> represent the number of clusters for gene clustering and array clustering, respectively. This file contains the gene expression data, organized by cluster by rearranging the rows and columns. In addition, the files <samp><var>JobName</var>_K_GKg.kgg</samp> and <samp><var>JobName</var>_K_AKa.kag</samp> are created, containing a list of genes/arrays and the cluster they were assigned to.
</p>
<p>Whereas 
<i>k</i>-means
 clustering as implemented in Cluster 3.0 allows any of the eight distance measures to be used, we recommend using the Euclidean distance or city-block distance instead of the distance measures based on the Pearson correlation, for the same reason as in case of pairwise centroid-linkage hierarchical clustering. The distance measures based on the Pearson correlation effectively normalize the data vectors when calculating the distance, whereas no normalization is used when calculating the cluster centroid. To use
<i>k</i>-means
 clustering with a distance measure based on the Pearson correlation, it is better to first normalize the data appropriately (using the &quot;Adjust Data&quot; tab) before running the
<i>k</i>-means
 algorithm.
</p>
<p>Cluster also implements a slight variation on
<i>k</i>-means
 clustering, known as
<i>k</i>-medians
 clustering, in which the median instead of the mean of items in a node are used. In a theoretical sense, it is best to use
<i>k</i>-means
 with the Euclidean distance, and
<i>k</i>-medoids
 with the city-block distance.
</p>
<hr>
<div class="header">
<p>
Next: <a href="SOM.html#SOM" accesskey="n" rel="next">SOM</a>, Previous: <a href="Hierarchical.html#Hierarchical" accesskey="p" rel="prev">Hierarchical</a>, Up: <a href="Cluster.html#Cluster" accesskey="u" rel="up">Cluster</a> &nbsp; [<a href="Contents.html#SEC_Contents" title="Table of contents" rel="contents">Contents</a>]</p>
</div>



</body>
</html>