1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370
|
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<html>
<!-- Created by GNU Texinfo 6.6, http://www.gnu.org/software/texinfo/ -->
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<title>Hierarchical (Cluster 3.0 for Windows, Mac OS X, Linux, Unix)</title>
<meta name="description" content="Hierarchical (Cluster 3.0 for Windows, Mac OS X, Linux, Unix)">
<meta name="keywords" content="Hierarchical (Cluster 3.0 for Windows, Mac OS X, Linux, Unix)">
<meta name="resource-type" content="document">
<meta name="distribution" content="global">
<meta name="Generator" content="makeinfo">
<link href="index.html#Top" rel="start" title="Top">
<link href="Contents.html#SEC_Contents" rel="contents" title="Table of Contents">
<link href="Cluster.html#Cluster" rel="up" title="Cluster">
<link href="KMeans.html#KMeans" rel="next" title="KMeans">
<link href="Cluster.html#Cluster" rel="prev" title="Cluster">
<style type="text/css">
<!--
a.summary-letter {text-decoration: none}
blockquote.indentedblock {margin-right: 0em}
div.display {margin-left: 3.2em}
div.example {margin-left: 3.2em}
div.lisp {margin-left: 3.2em}
kbd {font-style: oblique}
pre.display {font-family: inherit}
pre.format {font-family: inherit}
pre.menu-comment {font-family: serif}
pre.menu-preformatted {font-family: serif}
span.nolinebreak {white-space: nowrap}
span.roman {font-family: initial; font-weight: normal}
span.sansserif {font-family: sans-serif; font-weight: normal}
ul.no-bullet {list-style: none}
-->
</style>
</head>
<body lang="en">
<span id="Hierarchical"></span><div class="header">
<p>
Next: <a href="KMeans.html#KMeans" accesskey="n" rel="next">KMeans</a>, Up: <a href="Cluster.html#Cluster" accesskey="u" rel="up">Cluster</a> [<a href="Contents.html#SEC_Contents" title="Table of contents" rel="contents">Contents</a>]</p>
</div>
<hr>
<span id="Hierarchical-Clustering"></span><h3 class="section">4.1 Hierarchical Clustering</h3>
<img src="images/hierarchical.png" alt="images/hierarchical">
<p>The <em>Hierarchical Clustering</em> tab allows you to perform hierarchical clustering on your
data. This is a powerful and useful method for analyzing all sorts of large
genomic datasets. Many published applications of this analysis are given in the references
section at the end.
</p>
<p>Cluster currently performs four types of binary, agglomerative, hierarchical clustering.
The basic idea is to assemble a set of items (genes or arrays) into a tree, where items are
joined by very short branches if they are very similar to each other, and by increasingly
longer branches as their similarity decreases.
</p>
<p>The first step in hierarchical clustering is to calculate the distance matrix between the gene expression data. Once this matrix of distances is computed, the clustering begins.
Agglomerative hierarchical processing consists of repeated cycles where
the two closest remaining items (those with the smallest distance) are joined by a
node/branch of a tree, with the length of the branch set to the distance between the joined
items. The two joined items are removed from list of items being processed and replaced by a
item that represents the new branch. The distances between this new item and all other
remaining items are computed, and the process is repeated until only one item remains.
</p>
<p>Note that once clustering commences, we are working with items that are true items (e.g.
a single gene) and items that are pseudo-items that contain a number of true items. There
are a variety of ways to compute distances when we are dealing with pseudo-items, and
Cluster currently provides four choices, which are called centroid linkage, single linkage, complete linkage, and average linkage.
<strong>Note that in older versions of Cluster, centroid linkage was referred to as average linkage.</strong>
</p>
<span id="Centroid-Linkage-Clustering"></span><h4 class="subsection">4.1.1 Centroid Linkage Clustering</h4>
<p>If you click Centroid Linkage Clustering, a vector is assigned to each pseudo-item, and this vector is used to compute the distances between this pseudo-item and all remaining items or pseudo-items using the same similarity metric as was used to calculate the initial similarity matrix. The vector is the average of the vectors of all actual items (e.g. genes) contained within the pseudo-item. Thus, when a new branch of the tree is formed joining together a branch with 5 items and an actual item, the new pseudo-item is assigned a vector that is the average of the 6 vectors it contains, and not the average of the two joined items (note that missing values are not used in the average, and a pseudo-item can have a missing value if all of the items it contains are missing values in the corresponding row/column).
</p>
<p>Note that from a theoretical perspective, Centroid Linkage Clustering is peculiar if it is used in combination with one of the distance measures that are based on the Pearson correlation. For these distance measures, the data vectors are implicitly normalized when calculating the distance (for example, by subtracting the mean and dividing by the standard deviation when calculating the Pearson correlation. However, when two genes are joined and their centroid is calculated by averaging their data vectors, no normalization is applied. This may lead to the surprising result that distances may decrease when we go up in the tree representing the hierarchical clustering result. For example, consider this data set:
<DIV ALIGN=center>
<table cellpadding=5 cellspacing=0 border=1>
<th>
<td>Exp 1</td>
<td>Exp 2</td>
<td>Exp 3</td>
<td>Exp 4</td>
</th>
<tr align=center>
<td>Gene 1</td>
<td>0.96</td>
<td>0.07</td>
<td>0.97</td>
<td>0.98</td>
</tr>
<tr align=center>
<td>Gene 2</td>
<td>0.50</td>
<td>0.28</td>
<td>0.29</td>
<td>0.77</td>
</tr>
<tr align=center>
<td>Gene 3</td>
<td>0.08</td>
<td>0.96</td>
<td>0.51</td>
<td>0.51</td>
</tr>
<tr align=center>
<td>Gene 4</td>
<td>0.14</td>
<td>0.19</td>
<td>0.41</td>
<td>0.51</td>
</tr>
</table>
</DIV>
Performing pairwise centroid-linkage hierarchical clustering on this data set, using the Pearson distance as the distance measure, produces the clustering result
</p><ul>
<li> Gene 1 joins Gene 2 at distance 0.47
</li><li> (Gene 1, Gene 2) joins Gene 4 at distance 0.46
</li><li> (Gene 1, Gene 2, Gene 4) joins Gene 3 at distance 1.62
</li></ul>
<p>This may result in ill-formed dendrograms. For an example, see the Java TreeView manual. A solution is to use the Euclidean or the city-block distance, or to use one of the other hierarchical clustering routines, which don’t suffer from this issue regardless of the distance measure being used.
</p>
<span id="Single-Linkage-Clustering"></span><h4 class="subsection">4.1.2 Single Linkage Clustering</h4>
<p>In Single Linkage Clustering the distance between two items
<i>x</i>
and
<i>y</i>
is the minimum of all pairwise distances between items contained in
<i>x</i>
and
<i>y</i>.
Unlike centroid linkage clustering, in single linkage clustering no further distances need to be calculated once the distance matrix is known.
</p>
<p>In Cluster 3.0, as of version 1.29 the implementation of single linkage clustering is based on the SLINK algorithm (see Sibson, 1973). Whereas this algorithm yields the exact same clustering result as conventional single-linkage hierarchical clustering, it is much faster and more memory-efficient (being linear in the memory requirements, compared to quadratic for the conventional algorithm). Hence, single-linkage hierarchical clustering can be used to cluster large gene expression data sets, for which centroid-, complete-, and average-linkage fail due to lack of memory.
</p>
<span id="Complete-Linkage-Clustering"></span><h4 class="subsection">4.1.3 Complete Linkage Clustering</h4>
<p>In Complete Linkage Clustering the distance between two items
<i>x</i>
and
<i>y</i>
is the maximum of all pairwise distances between items contained in
<i>x</i>
and
<i>y</i>.
As in single linkage clustering, no other distances need to be calculated once the distance matrix is known.
</p>
<span id="Average-Linkage-Clustering"></span><h4 class="subsection">4.1.4 Average Linkage Clustering</h4>
<p>In average linkage clustering, the distance between two items
<i>x</i>
and
<i>y</i>
is the mean of all pairwise distances between items contained in
<i>x</i>
and
<i>y</i>.
</p>
<span id="Weighting"></span><h4 class="subsection">4.1.5 Weighting</h4>
<p>Weighting: By default, all of the observations for a given item are treated equally. In
some cases you may want to give some observations more weight than others. For
example, if you have duplicate copies of a gene on your array, you might want to
downweight each individual copy when computing distances between arrays. You can
specify weights using the ‘<samp>GWEIGHT</samp>’ (gene weight) and ‘<samp>EWEIGHT</samp>’ (experiment weight)
parameters in the input file. By default all weights are set to 1.0. Thus, the actual formula,
with weights included, for the Pearson correlation of
<i>x</i> = {<i>x</i><SUB>1</SUB>, <i>x</i><SUB>2</SUB>, ..., <i>x<SUB>n</SUB></i>}
and
<i>y</i> = {<i>y</i><SUB>1</SUB>, <i>y</i><SUB>2</SUB>, ..., <i>y<SUB>n</SUB></i>}
with observation weights of
<i>w</i> = {<i>w</i><SUB>1</SUB>, <i>w</i><SUB>2</SUB>, ..., <i>w<SUB>n</SUB></i>}
is
<DIV ALIGN=center><TABLE CELLSPACING=0 CELLPADDING=0>
<TR VALIGN=middle><TD NOWRAP><I>r</I> = </TD>
<TD NOWRAP><TABLE CELLSPACING=0 CELLPADDING=0>
<TR>
<TD NOWRAP ALIGN=center><TABLE CELLSPACING=0 CELLPADDING=0>
<TR VALIGN=middle>
<td align="center"><i>n</i><br>
<big><big><big>∑</big></big></big><small><br>
<i>i</i><small> </small>=<small> </small>1</small>
</td>
<TD NOWRAP> <I>w</I><SUB><I>i</I></SUB> </TD>
<TD NOWRAP><BIG><BIG><BIG><BIG><BIG>(</BIG></BIG></BIG></BIG></BIG></TD>
<TD NOWRAP>
<TABLE CELLSPACING=0 CELLPADDING=0>
<TR>
<TD NOWRAP ALIGN=center>
<I>x</I><SUB><I>i</I></SUB> - <span style="text-decoration: overline;"><i>x</i> </span>
</TD>
</TR>
<TR>
<TD BGCOLOR=black><TABLE BORDER=0 WIDTH="100%" CELLSPACING=0 CELLPADDING=1><TR><TD></TD></TR></TABLE></TD>
</TR>
<TR>
<TD NOWRAP ALIGN=center>σ<SUB><I>x</I></SUB></TD>
</TR>
</TABLE>
</TD>
<TD NOWRAP><BIG><BIG><BIG><BIG><BIG>)</BIG></BIG></BIG></BIG></BIG></TD>
<TD NOWRAP><BIG><BIG><BIG><BIG><BIG>(</BIG></BIG></BIG></BIG></BIG></TD>
<TD NOWRAP><TABLE CELLSPACING=0 CELLPADDING=0>
<TR>
<TD NOWRAP ALIGN=center>
<I>y</I><SUB><I>i</I></SUB> - <span style="text-decoration: overline;"><i>y</i> </span>
</TD>
</TR>
<TR>
<TD BGCOLOR=black><TABLE BORDER=0 WIDTH="100%" CELLSPACING=0 CELLPADDING=1><TR><TD></TD></TR></TABLE></TD>
</TR>
<TR>
<TD NOWRAP ALIGN=center>σ<SUB><I>y</I></SUB></TD>
</TR></TABLE>
</TD>
<TD NOWRAP><BIG><BIG><BIG><BIG><BIG>)</BIG></BIG></BIG></BIG></BIG></TD>
</TR>
</TABLE></TD>
</TR>
<TR><TD BGCOLOR=black><TABLE BORDER=0 WIDTH="100%" CELLSPACING=0 CELLPADDING=1><TR><TD></TD></TR></TABLE></TD>
</TR>
<TR>
<TD NOWRAP ALIGN=center><TABLE CELLSPACING=0 CELLPADDING=0>
<TR VALIGN=middle>
<td align="center"><i>n</i><br>
<big><big><big>∑</big></big></big><small><br>
<i>i</i><small> </small>=<small> </small>1</small>
</td>
<TD NOWRAP> <I>w</I><SUB><I>i</I></SUB> </TD>
</TR>
</TABLE></TD>
</TR></TABLE></TD>
</TR></TABLE></DIV>
Note that when you are clustering rows (genes), you are using column (array) weights.
It is possible to compute weights as well based on a not entirely well understood function.
If you want to compute weights for clustering genes, select the check box in the Genes panel of the Hierarchical Clustering tab. <br>
</p>
<img src="images/weight.png" alt="images/weight">
<p>This will expose a Weight Options dialog box in the Arrays panel (I realize this
placement is a bit counterintuitive, but it makes sense as you will see below).
The idea behind the Calculate Weights option is to weight each row (the same idea
applies to columns as well) based on the local density of row vectors in its vicinity, with a
high density vicinity resulting in a low weight and a low density vicinity resulting in a
higher weight. This is implemented by assigning a local density score
<i>L</i>(<i>i</i>)
to each row
<i>i</i>.
<br>
<DIV ALIGN=center>
<table cellpadding=0 cellspacing=0>
<tr>
<td nowrap><i>L</i>(<i>i</i>) = </td>
<td></td>
<td align="center"><br>
<big><big><big>∑</big></big></big><small><br>
<i>j</i> with <i>d</i>(<i>i</i>,<i>j</i>) ≤ <i>k</i></small></td>
<td><big><big><big>(</big></big></big></td>
<td>
<table cellspacing=0 cellpadding=0>
<tr><td nowrap align=center><i>k</i> - <i>d</i>(<i>i</i>,<i>j</i>)
</td></tr>
<tr><td bgcolor=BLACK><table border=0 width="100%" cellspacing=0 cellpadding=1><tr><td></td></tr></table></td></tr>
<tr><td nowrap align=CENTER><i>k</i></td></tr>
</table>
</td>
<td><big><big><big>)</big></big></big></td>
<td align="center"><small><i>n</i><br><br><br></small></td>
</tr>
</table></div>
where the cutoff
<i>k</i>
and the exponent
<i>n</i>
are user supplied parameters. The weight for each row is
1/<i>L</i>.
Note that
<i>L</i>(<i>i</i>)
is always at least 1, since
<i>d</i>(<i>i</i>,<i>i</i>) = 0.
Each other row that is within the distance
<i>k</i>
of row
<i>i</i>
increases
<i>L</i>(<i>i</i>)
and decreases the weight. The larger
<i>d</i>(<i>i</i>,<i>j</i>),
the less
<i>L</i>(<i>i</i>)
is increased. Values of
<i>n</i>
greater than 1 mean that the contribution to
<i>L</i>(<i>i</i>)
drops off rapidly as
<i>d</i>(<i>i</i>,<i>j</i>)
increases.
</p>
<span id="Ordering-of-Output-File"></span><h4 class="subsection">4.1.6 Ordering of Output File</h4>
<p>The result of a clustering run is a tree or pair of trees (one for genes one for arrays).
However, to visualize the results in TreeView, it is necessary to use this tree to reorder the
rows and/or columns in the initial datatable. Note that if you simply draw all of the node
in the tree in the following manner, a natural ordering of items emerges:
<br>
<img src="images/order.png" alt="images/order">
</p>
<p>Thus, any tree can be used to generate an ordering. However, the ordering for any given
tree is not unique. There is a family of
2<SUP><i>N</i>-1</SUP>
ordering consistent with any tree of
<i>N</i>
items;
you can flip any node on the tree (exchange the bottom and top branches) and you will
get a new ordering that is equally consistent with the tree. By default, when Cluster joins
two items, it randomly places one item on the top branch and the other on the bottom
branch. It is possible to guide this process to generate the best ordering consistent with
a given tree. This is done by using the ‘<samp>GORDER</samp>’ (gene order) and ‘<samp>EORDER</samp>’ (experiment
order) parameters in the input file, or by running a self-organizing map (see section
below) prior to clustering. By default, Cluster sets the order parameter for each
row/column to 1. When a new node is created, Cluster compares the order parameters of
the two joined items, and places the item with the smaller order value on the top branch.
The order parameter for a node is the average of the order parameters of its members.
Thus, if you want the gene order produced by Cluster to be as close as possible (without
violating the structure of the tree) to the order in your input file, you use the ‘<samp>GORDER</samp>’
column, and assign a value that increases for each row. Note that order parameters do not
have to be unique.
</p>
<span id="Output-Files"></span><h4 class="subsection">4.1.7 Output Files</h4>
<p>Cluster writes up to three output files for each hierarchical clustering run. The root
filename of each file is whatever text you enter into the Job Name dialog box. When you
load a file, Job Name is set to the root filename of the input file. The three output files are
<samp><var>JobName</var>.cdt</samp>, <samp><var>JobName</var>.gtr</samp>, <samp><var>JobName</var>.atr</samp>
The <samp>.cdt</samp> (for clustered data table) file contains the original data with the rows and
columns reordered based on the clustering result. It is the same format as the input files,
except that an additional column and/or row is added if clustering is performed on genes
and/or arrays. This additional column/row contains a unique identifier for each
row/column that is linked to the description of the tree structure in the <samp>.gtr</samp> and <samp>.atr</samp> files.
The <samp>.gtr</samp> (gene tree) and <samp>.atr</samp> (array tree) files are tab-delimited text files that report on the
history of node joining in the gene or array clustering (note that these files are produced
only when clustering is performed on the corresponding axis). When clustering begins
each item to be clustered is assigned a unique identifier (e.g. ‘<samp>GENE1X</samp>’ or ‘<samp>ARRY42X</samp>’ — the
‘<samp>X</samp>’ is a relic from the days when this was written in Perl and substring searches were
used). These identifiers are added to the .cdt file. As each node is generated, it receives a
unique identifier as well, starting is ‘<samp>NODE1X</samp>’, ‘<samp>NODE2X</samp>’, etc. Each joining event is
stored in the <samp>.gtr</samp> or <samp>.atr</samp> file as a row with the node identifier, the identifiers of the two
joined elements, and the similarity score for the two joined elements. These files look like:<br>
<img src="images/tree.png" alt="images/tree">
</p><table>
<tr><td><code>NODE1X</code></td><td><code>GENE1X</code></td><td><code>GENE4X</code></td><td><code>0.98</code></td></tr>
<tr><td><code>NODE2X</code></td><td><code>GENE5X</code></td><td><code>GENE2X</code></td><td><code>0.80</code></td></tr>
<tr><td><code>NODE3X</code></td><td><code>NODE1X</code></td><td><code>GENE3X</code></td><td><code>0.72</code></td></tr>
<tr><td><code>NODE4X</code></td><td><code>NODE2X</code></td><td><code>NODE3X</code></td><td><code>0.60</code></td></tr>
</table>
<br>
<p>The <samp>.gtr</samp> and/or <samp>.atr</samp> files are automatically read in TreeView when you open the
corresponding <samp>.cdt</samp> file.
</p>
<hr>
<div class="header">
<p>
Next: <a href="KMeans.html#KMeans" accesskey="n" rel="next">KMeans</a>, Up: <a href="Cluster.html#Cluster" accesskey="u" rel="up">Cluster</a> [<a href="Contents.html#SEC_Contents" title="Table of contents" rel="contents">Contents</a>]</p>
</div>
</body>
</html>
|