1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142
|
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<html>
<!-- Created by GNU Texinfo 6.6, http://www.gnu.org/software/texinfo/ -->
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<title>PCA (Cluster 3.0 for Windows, Mac OS X, Linux, Unix)</title>
<meta name="description" content="PCA (Cluster 3.0 for Windows, Mac OS X, Linux, Unix)">
<meta name="keywords" content="PCA (Cluster 3.0 for Windows, Mac OS X, Linux, Unix)">
<meta name="resource-type" content="document">
<meta name="distribution" content="global">
<meta name="Generator" content="makeinfo">
<link href="index.html#Top" rel="start" title="Top">
<link href="Contents.html#SEC_Contents" rel="contents" title="Table of Contents">
<link href="Cluster.html#Cluster" rel="up" title="Cluster">
<link href="Command.html#Command" rel="next" title="Command">
<link href="SOM.html#SOM" rel="prev" title="SOM">
<style type="text/css">
<!--
a.summary-letter {text-decoration: none}
blockquote.indentedblock {margin-right: 0em}
div.display {margin-left: 3.2em}
div.example {margin-left: 3.2em}
div.lisp {margin-left: 3.2em}
kbd {font-style: oblique}
pre.display {font-family: inherit}
pre.format {font-family: inherit}
pre.menu-comment {font-family: serif}
pre.menu-preformatted {font-family: serif}
span.nolinebreak {white-space: nowrap}
span.roman {font-family: initial; font-weight: normal}
span.sansserif {font-family: sans-serif; font-weight: normal}
ul.no-bullet {list-style: none}
-->
</style>
</head>
<body lang="en">
<span id="PCA"></span><div class="header">
<p>
Previous: <a href="SOM.html#SOM" accesskey="p" rel="prev">SOM</a>, Up: <a href="Cluster.html#Cluster" accesskey="u" rel="up">Cluster</a> [<a href="Contents.html#SEC_Contents" title="Table of contents" rel="contents">Contents</a>]</p>
</div>
<hr>
<span id="Principal-Component-Analysis"></span><h3 class="section">4.4 Principal Component Analysis</h3>
<img src="images/pca.png" alt="images/pca">
<p>Principal Component Analysis (PCA) is a widely used technique for analyzing multivariate data. A practical example of applying Principal Component Analysis to gene expression data is presented by Yeung and Ruzzo (2001).
</p>
<p>In essence, PCA is a coordinate transformation in which each row in the data matrix is written as a linear sum over basis vectors called principal components, which are ordered and chosen such that each maximally explains the remaining variance in the data vectors. For example, an <em>n \times 3</em> data matrix can be represented as an ellipsoidal cloud of <em>n</em> points in three dimensional space. The first principal component is the longest axis of the ellipsoid, the second principal component the second longest axis of the ellipsoid, and the third principal component is the shortest axis. Each row in the data matrix can be reconstructed as a suitable linear combination of the principal components. However, in order to reduce the dimensionality of the data, usually only the most important principal components are retained. The remaining variance present in the data is then regarded as unexplained variance.
</p>
<p>The principal components can be found by calculating the eigenvectors of the covariance matrix of the data. The corresponding eigenvalues determine how much of the variance present in the data is explained by each principal component.
</p>
<p>Before applying PCA, typically the mean is subtracted from each column in the data matrix. In the example above, this effectively centers the ellipsoidal cloud around its centroid in 3D space, with the principal components describing the variation of poins in the ellipsoidal cloud with respect to their centroid.
</p>
<p>In Cluster, you can apply PCA to the rows (genes) of the data matrix, or to the columns (microarrays) of the data matrix. In each case, the output consists of two files. When applying PCA to genes, the names of the output files are <samp><var>JobName</var>_pca_gene.pc.txt</samp> and <samp><var>JobName</var>_pca_gene.coords.txt</samp>, where the former contains contains the principal components, and the latter contains the coordinates of each row in the data matrix with respect to the principal components. When applying PCA to the columns in the data matrix, the respective file names are <samp><var>JobName</var>_pca_array.pc.txt</samp> and <samp><var>JobName</var>_pca_array.coords.txt</samp>. The original data matrix can be recovered from the principal components and the coordinates.
</p>
<p>As an example, consider this input file:
</p><table>
<tr><td><code>UNIQID</code></td><td><code>EXP1</code></td><td><code>EXP2</code></td><td><code>EXP3</code></td></tr>
<tr><td><code>GENE1</code></td><td><code>3</code></td><td><code>4</code></td><td><code>-2</code></td></tr>
<tr><td><code>GENE2</code></td><td><code>4</code></td><td><code>1</code></td><td><code>-3</code></td></tr>
<tr><td><code>GENE3</code></td><td><code>1</code></td><td><code>-8</code></td><td><code>7</code></td></tr>
<tr><td><code>GENE4</code></td><td><code>-6</code></td><td><code>6</code></td><td><code>4</code></td></tr>
<tr><td><code>GENE5</code></td><td><code>0</code></td><td><code>-3</code></td><td><code>8</code></td></tr>
</table>
<p>Applying PCA to the rows (genes) of the data in this input file generates a coordinate file containing
</p><table>
<tr><td><code>UNIQID</code></td><td><code>NAME</code></td><td><code>GWEIGHT</code></td><td><code> 13.513398</code></td><td><code>10.162987</code></td><td><code>2.025283</code></td></tr>
<tr><td><code>GENE1 </code></td><td><code>GENE1</code></td><td><code>1.000000</code></td><td><code> 6.280326</code></td><td><code>-2.404095</code></td><td><code>-0.760157</code></td></tr>
<tr><td><code>GENE2 </code></td><td><code>GENE2</code></td><td><code>1.000000</code></td><td><code> 4.720801</code></td><td><code>-4.995230</code></td><td><code> 0.601424</code></td></tr>
<tr><td><code>GENE3 </code></td><td><code>GENE3</code></td><td><code>1.000000</code></td><td><code> -8.755665</code></td><td><code>-2.117608</code></td><td><code> 0.924161</code></td></tr>
<tr><td><code>GENE4 </code></td><td><code>GENE4</code></td><td><code>1.000000</code></td><td><code> 3.443490</code></td><td><code> 8.133673</code></td><td><code> 0.621082</code></td></tr>
<tr><td><code>GENE5 </code></td><td><code>GENE5</code></td><td><code>1.000000</code></td><td><code> -5.688953</code></td><td><code> 1.383261</code></td><td><code>-1.386509</code></td></tr>
</table>
<p>where the first line shows the eigenvalues of the principal components, and a prinpical component file containing
</p><table>
<tr><td><code>EIGVALUE</code></td><td><code>EXP1</code></td><td><code>EXP2</code></td><td><code>EXP3</code></td></tr>
<tr><td><code>MEAN</code></td><td><code> 0.400000</code></td><td><code>0.000000</code></td><td><code> 2.800000</code></td></tr>
<tr><td><code>13.513398</code></td><td><code> 0.045493</code></td><td><code>0.753594</code></td><td><code>-0.655764</code></td></tr>
<tr><td><code>10.162987</code></td><td><code>-0.756275</code></td><td><code>0.454867</code></td><td><code> 0.470260</code></td></tr>
<tr><td><code>2.025283</code></td><td><code>-0.652670</code></td><td><code>-0.474545</code></td><td><code>-0.590617</code></td></tr>
</table>
<p>with the eigenvalues of the principal components shown in the first column. From this principal component decomposition, we can regenerate the original data matrix as follows:
<p>
<table style="display:inline" cellspacing=0 cellpadding=0>
<tr> <td> ⎛ </td> <td align=right> 6.280326 </td> <td width=80 align=right> -2.404095 </td> <td width=80 align=right> -0.760157 </td> <td> ⎞ </td> </tr>
<tr> <td> ⎜ </td> <td align=right> 4.720801 </td> <td align=right> -4.995230 </td> <td align=right> 0.601424 </td> <td> ⎟ </td> </tr>
<tr> <td> ⎜ </td> <td align=right> -8.755665 </td> <td align=right> -2.117608 </td> <td align=right> 0.924161 </td> <td> ⎟ </td> </tr>
<tr> <td> ⎜ </td> <td align=right> 3.443490 </td> <td align=right> 8.133673 </td> <td align=right> 0.621082 </td> <td> ⎟ </td> </tr>
<tr> <td> ⎝ </td> <td align=right> -5.688953 </td> <td align=right> 1.383261 </td> <td align=right> -1.386509 </td> <td> ⎠ </td> </tr>
</table>
<table style="display:inline" cellspacing=0 cellpadding=0>
<tr><td><br></td></tr>
<tr><td><br></td></tr>
<tr><td>·</td></tr>
</table>
<table style="display:inline" cellspacing=0 cellpadding=0>
<tr></tr>
<tr> <td> ⎛ </td> <td align=right> 0.045493 </td> <td width=80 align=right> 0.753594 </td> <td width=80 align=right> -0.655764</td> <td> ⎞ </td></tr>
<tr> <td> ⎜ </td> <td align=right> -0.756275 </td> <td align=right> 0.454867 </td> <td align=right> 0.470260 </td> <td> ⎟ </td> </tr>
<tr> <td> ⎝ </td> <td align=right> -0.652670 </td> <td align=right> -0.474545 </td> <td align=right> -0.590617 </td> <td> ⎠ </td> </tr>
</table>
<table style="display:inline" cellspacing=0 cellpadding=0>
<tr><td><br></td></tr>
<tr><td><br></td></tr>
<tr><td>+</td></tr>
</table>
<table style="display:inline" cellspacing=0 cellpadding=0>
<tr> <td> ⎛ </td> <td align=right> 0.4 </td> <td width=40 align=right> 0.0 </td> <td width=40 align=right> 2.8 </td> <td> ⎞ </td></tr>
<tr> <td> ⎜ </td> <td align=right> 0.4 </td> <td align=right> 0.0 </td> <td align=right> 2.8 </td> <td> ⎟ </td></tr>
<tr> <td> ⎜ </td> <td align=right> 0.4 </td> <td align=right> 0.0 </td> <td align=right> 2.8 </td> <td> ⎟ </td></tr>
<tr> <td> ⎜ </td> <td align=right> 0.4 </td> <td align=right> 0.0 </td> <td align=right> 2.8 </td> <td> ⎟ </td></tr>
<tr> <td> ⎝ </td> <td align=right> 0.4 </td> <td align=right> 0.0 </td> <td align=right> 2.8 </td> <td> ⎠ </td></tr>
</table>
<table style="display:inline" cellspacing=0 cellpadding=0>
<tr><td><br></td></tr>
<tr><td><br></td></tr>
<tr><td>=</td></tr>
</table>
<table style="display:inline" cellspacing=0 cellpadding=0>
<tr> <td> ⎛ </td> <td align=right> 3 </td> <td width=40 align=right> 4 </td> <td width=40 align=right> -2 </td> <td> ⎞ </td></tr>
<tr> <td> ⎜ </td> <td align=right> 4 </td> <td align=right> 1 </td> <td align=right> -3 </td> <td> ⎟ </td></tr>
<tr> <td> ⎜ </td> <td align=right> 1 </td> <td align=right> -8 </td> <td align=right> 7 </td> <td> ⎟ </td></tr>
<tr> <td> ⎜ </td> <td align=right> -6 </td> <td align=right> 6 </td> <td align=right> 4 </td> <td> ⎟ </td></tr>
<tr> <td> ⎝ </td> <td align=right> 0 </td> <td align=right> -3</td> <td align=right> 8 </td> <td> ⎠ </td></tr>
</table>
</p>
Note that the coordinate file <samp><var>JobName</var>_pca_gene.coords.txt</samp> is a valid input file to Cluster 3.0. Hence, it can be loaded into Cluster 3.0 for further analysis, possibly after removing columns with low eigenvalues.
</p>
<hr>
<div class="header">
<p>
Previous: <a href="SOM.html#SOM" accesskey="p" rel="prev">SOM</a>, Up: <a href="Cluster.html#Cluster" accesskey="u" rel="up">Cluster</a> [<a href="Contents.html#SEC_Contents" title="Table of contents" rel="contents">Contents</a>]</p>
</div>
</body>
</html>
|