File: PCA.html

package info (click to toggle)
cluster3 1.59%2Bds-3
  • links: PTS, VCS
  • area: non-free
  • in suites: bullseye
  • size: 5,624 kB
  • sloc: ansic: 9,948; python: 2,018; perl: 1,566; makefile: 132; sh: 27
file content (142 lines) | stat: -rw-r--r-- 11,289 bytes parent folder | download | duplicates (2)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<html>
<!-- Created by GNU Texinfo 6.6, http://www.gnu.org/software/texinfo/ -->
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<title>PCA (Cluster 3.0 for Windows, Mac OS X, Linux, Unix)</title>

<meta name="description" content="PCA (Cluster 3.0 for Windows, Mac OS X, Linux, Unix)">
<meta name="keywords" content="PCA (Cluster 3.0 for Windows, Mac OS X, Linux, Unix)">
<meta name="resource-type" content="document">
<meta name="distribution" content="global">
<meta name="Generator" content="makeinfo">
<link href="index.html#Top" rel="start" title="Top">
<link href="Contents.html#SEC_Contents" rel="contents" title="Table of Contents">
<link href="Cluster.html#Cluster" rel="up" title="Cluster">
<link href="Command.html#Command" rel="next" title="Command">
<link href="SOM.html#SOM" rel="prev" title="SOM">
<style type="text/css">
<!--
a.summary-letter {text-decoration: none}
blockquote.indentedblock {margin-right: 0em}
div.display {margin-left: 3.2em}
div.example {margin-left: 3.2em}
div.lisp {margin-left: 3.2em}
kbd {font-style: oblique}
pre.display {font-family: inherit}
pre.format {font-family: inherit}
pre.menu-comment {font-family: serif}
pre.menu-preformatted {font-family: serif}
span.nolinebreak {white-space: nowrap}
span.roman {font-family: initial; font-weight: normal}
span.sansserif {font-family: sans-serif; font-weight: normal}
ul.no-bullet {list-style: none}
-->
</style>


</head>

<body lang="en">
<span id="PCA"></span><div class="header">
<p>
Previous: <a href="SOM.html#SOM" accesskey="p" rel="prev">SOM</a>, Up: <a href="Cluster.html#Cluster" accesskey="u" rel="up">Cluster</a> &nbsp; [<a href="Contents.html#SEC_Contents" title="Table of contents" rel="contents">Contents</a>]</p>
</div>
<hr>
<span id="Principal-Component-Analysis"></span><h3 class="section">4.4 Principal Component Analysis</h3>

<img src="images/pca.png" alt="images/pca">

<p>Principal Component Analysis (PCA) is a widely used technique for analyzing multivariate data.  A practical example of applying Principal Component Analysis to gene expression data is presented by Yeung and Ruzzo (2001).
</p>
<p>In essence, PCA is a coordinate transformation in which each row in the data matrix is written as a linear sum over basis vectors called principal components, which are ordered and chosen such that each maximally explains the remaining variance in the data vectors. For example, an <em>n \times 3</em> data matrix can be represented as an ellipsoidal cloud of <em>n</em> points in three dimensional space. The first principal component is the longest axis of the ellipsoid, the second principal component the second longest axis of the ellipsoid, and the third principal component is the shortest axis. Each row in the data matrix can be reconstructed as a suitable linear combination of the principal components. However, in order to reduce the dimensionality of the data, usually only the most important principal components are retained. The remaining variance present in the data is then regarded as unexplained variance.
</p>
<p>The principal components can be found by calculating the eigenvectors of the covariance matrix of the data. The corresponding eigenvalues determine how much of the variance present in the data is explained by each principal component.
</p>
<p>Before applying PCA, typically the mean is subtracted from each column in the data matrix. In the example above, this effectively centers the ellipsoidal cloud around its centroid in 3D space, with the principal components describing the variation of poins in the ellipsoidal cloud with respect to their centroid.
</p>
<p>In Cluster, you can apply PCA to the rows (genes) of the data matrix, or to the columns (microarrays) of the data matrix. In each case, the output consists of two files. When applying PCA to genes, the names of the output files are <samp><var>JobName</var>_pca_gene.pc.txt</samp> and <samp><var>JobName</var>_pca_gene.coords.txt</samp>, where the former contains contains the principal components, and the latter contains the coordinates of each row in the data matrix with respect to the principal components. When applying PCA to the columns in the data matrix, the respective file names are <samp><var>JobName</var>_pca_array.pc.txt</samp> and <samp><var>JobName</var>_pca_array.coords.txt</samp>. The original data matrix can be recovered from the principal components and the coordinates.
</p>
<p>As an example, consider this input file:
</p><table>
<tr><td><code>UNIQID</code></td><td><code>EXP1</code></td><td><code>EXP2</code></td><td><code>EXP3</code></td></tr>
<tr><td><code>GENE1</code></td><td><code>3</code></td><td><code>4</code></td><td><code>-2</code></td></tr>
<tr><td><code>GENE2</code></td><td><code>4</code></td><td><code>1</code></td><td><code>-3</code></td></tr>
<tr><td><code>GENE3</code></td><td><code>1</code></td><td><code>-8</code></td><td><code>7</code></td></tr>
<tr><td><code>GENE4</code></td><td><code>-6</code></td><td><code>6</code></td><td><code>4</code></td></tr>
<tr><td><code>GENE5</code></td><td><code>0</code></td><td><code>-3</code></td><td><code>8</code></td></tr>
</table>
<p>Applying PCA to the rows (genes) of the data in this input file generates a coordinate file containing
</p><table>
<tr><td><code>UNIQID</code></td><td><code>NAME</code></td><td><code>GWEIGHT</code></td><td><code> 13.513398</code></td><td><code>10.162987</code></td><td><code>2.025283</code></td></tr>
<tr><td><code>GENE1 </code></td><td><code>GENE1</code></td><td><code>1.000000</code></td><td><code>  6.280326</code></td><td><code>-2.404095</code></td><td><code>-0.760157</code></td></tr>
<tr><td><code>GENE2 </code></td><td><code>GENE2</code></td><td><code>1.000000</code></td><td><code>  4.720801</code></td><td><code>-4.995230</code></td><td><code> 0.601424</code></td></tr>
<tr><td><code>GENE3 </code></td><td><code>GENE3</code></td><td><code>1.000000</code></td><td><code> -8.755665</code></td><td><code>-2.117608</code></td><td><code> 0.924161</code></td></tr>
<tr><td><code>GENE4 </code></td><td><code>GENE4</code></td><td><code>1.000000</code></td><td><code>  3.443490</code></td><td><code> 8.133673</code></td><td><code> 0.621082</code></td></tr>
<tr><td><code>GENE5 </code></td><td><code>GENE5</code></td><td><code>1.000000</code></td><td><code> -5.688953</code></td><td><code> 1.383261</code></td><td><code>-1.386509</code></td></tr>
</table>
<p>where the first line shows the eigenvalues of the principal components, and a prinpical component file containing
</p><table>
<tr><td><code>EIGVALUE</code></td><td><code>EXP1</code></td><td><code>EXP2</code></td><td><code>EXP3</code></td></tr>
<tr><td><code>MEAN</code></td><td><code> 0.400000</code></td><td><code>0.000000</code></td><td><code> 2.800000</code></td></tr>
<tr><td><code>13.513398</code></td><td><code> 0.045493</code></td><td><code>0.753594</code></td><td><code>-0.655764</code></td></tr>
<tr><td><code>10.162987</code></td><td><code>-0.756275</code></td><td><code>0.454867</code></td><td><code> 0.470260</code></td></tr>
<tr><td><code>2.025283</code></td><td><code>-0.652670</code></td><td><code>-0.474545</code></td><td><code>-0.590617</code></td></tr>
</table>
<p>with the eigenvalues of the principal components shown in the first column.  From this principal component decomposition, we can regenerate the original data matrix as follows:
<p>
<table style="display:inline" cellspacing=0 cellpadding=0>
<tr> <td> &#X239B; </td> <td align=right>  6.280326 </td> <td width=80 align=right> -2.404095 </td> <td width=80 align=right> -0.760157 </td> <td> &#X239E; </td> </tr>
<tr> <td> &#X239C; </td> <td align=right>  4.720801 </td> <td align=right> -4.995230 </td> <td align=right>  0.601424 </td> <td> &#X239F; </td> </tr>
<tr> <td> &#X239C; </td> <td align=right> -8.755665 </td> <td align=right> -2.117608 </td> <td align=right>  0.924161 </td> <td> &#X239F; </td> </tr>
<tr> <td> &#X239C; </td> <td align=right>  3.443490 </td> <td align=right>  8.133673 </td> <td align=right>  0.621082 </td> <td> &#X239F; </td> </tr>
<tr> <td> &#X239D; </td> <td align=right> -5.688953 </td> <td align=right>  1.383261 </td> <td align=right> -1.386509 </td> <td> &#X23A0; </td> </tr>
</table>
<table style="display:inline" cellspacing=0 cellpadding=0>
<tr><td><br></td></tr>
<tr><td><br></td></tr>
<tr><td>&middot;</td></tr>
</table>
<table style="display:inline" cellspacing=0 cellpadding=0>
<tr></tr>
<tr> <td> &#X239B; </td> <td align=right>  0.045493 </td> <td width=80 align=right>  0.753594 </td> <td width=80 align=right>  -0.655764</td> <td> &#X239E; </td></tr>
<tr> <td> &#X239C; </td> <td align=right> -0.756275 </td> <td align=right> 0.454867 </td> <td align=right>  0.470260 </td>  <td> &#X239F; </td> </tr>
<tr> <td> &#X239D; </td> <td align=right> -0.652670 </td> <td align=right> -0.474545 </td> <td align=right> -0.590617 </td>  <td> &#X23A0; </td> </tr>
</table>
<table style="display:inline" cellspacing=0 cellpadding=0>
<tr><td><br></td></tr>
<tr><td><br></td></tr>
<tr><td>+</td></tr>
</table>
<table style="display:inline" cellspacing=0 cellpadding=0>
<tr> <td> &#X239B; </td> <td align=right>  0.4 </td> <td width=40 align=right>  0.0 </td> <td width=40 align=right> 2.8 </td> <td> &#X239E; </td></tr>
<tr> <td> &#X239C; </td> <td align=right>  0.4 </td> <td align=right>  0.0 </td> <td align=right> 2.8 </td> <td> &#X239F; </td></tr>
<tr> <td> &#X239C; </td> <td align=right>  0.4 </td> <td align=right>  0.0 </td> <td align=right> 2.8 </td> <td> &#X239F; </td></tr>
<tr> <td> &#X239C; </td> <td align=right>  0.4 </td> <td align=right>  0.0 </td> <td align=right> 2.8 </td> <td> &#X239F; </td></tr>
<tr> <td> &#X239D; </td> <td align=right>  0.4 </td> <td align=right>  0.0 </td> <td align=right> 2.8 </td> <td> &#X23A0; </td></tr>
</table>
<table style="display:inline" cellspacing=0 cellpadding=0>
<tr><td><br></td></tr>
<tr><td><br></td></tr>
<tr><td>=</td></tr>
</table>
<table style="display:inline" cellspacing=0 cellpadding=0>
<tr> <td> &#X239B; </td> <td align=right>  3 </td> <td width=40 align=right>  4 </td> <td width=40 align=right> -2 </td> <td> &#X239E; </td></tr>
<tr> <td> &#X239C; </td> <td align=right>  4 </td> <td align=right> 1 </td> <td align=right> -3 </td> <td> &#X239F; </td></tr>
<tr> <td> &#X239C; </td> <td align=right>  1 </td> <td align=right> -8 </td> <td align=right> 7 </td> <td> &#X239F; </td></tr>
<tr> <td> &#X239C; </td> <td align=right>  -6 </td> <td align=right> 6 </td> <td align=right> 4 </td> <td> &#X239F; </td></tr>
<tr> <td> &#X239D; </td> <td align=right>  0 </td> <td align=right> -3</td> <td align=right> 8 </td> <td> &#X23A0; </td></tr>
</table>
</p>
Note that the coordinate file <samp><var>JobName</var>_pca_gene.coords.txt</samp> is a valid input file to Cluster 3.0. Hence, it can be loaded into Cluster 3.0 for further analysis, possibly after removing columns with low eigenvalues.
</p>
<hr>
<div class="header">
<p>
Previous: <a href="SOM.html#SOM" accesskey="p" rel="prev">SOM</a>, Up: <a href="Cluster.html#Cluster" accesskey="u" rel="up">Cluster</a> &nbsp; [<a href="Contents.html#SEC_Contents" title="Table of contents" rel="contents">Contents</a>]</p>
</div>



</body>
</html>