File: Data.html

package info (click to toggle)
cluster3 1.59%2Bds-3
  • links: PTS, VCS
  • area: non-free
  • in suites: bookworm, bullseye, sid
  • size: 5,624 kB
  • sloc: ansic: 9,948; python: 2,018; perl: 1,566; makefile: 132; sh: 27
file content (225 lines) | stat: -rw-r--r-- 12,785 bytes parent folder | download
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<html>
<!-- Created by GNU Texinfo 6.6, http://www.gnu.org/software/texinfo/ -->
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<title>Data (Cluster 3.0 for Windows, Mac OS X, Linux, Unix)</title>

<meta name="description" content="Data (Cluster 3.0 for Windows, Mac OS X, Linux, Unix)">
<meta name="keywords" content="Data (Cluster 3.0 for Windows, Mac OS X, Linux, Unix)">
<meta name="resource-type" content="document">
<meta name="distribution" content="global">
<meta name="Generator" content="makeinfo">
<link href="index.html#Top" rel="start" title="Top">
<link href="Contents.html#SEC_Contents" rel="contents" title="Table of Contents">
<link href="index.html#Top" rel="up" title="Top">
<link href="Distance.html#Distance" rel="next" title="Distance">
<link href="Introduction.html#Introduction" rel="prev" title="Introduction">
<style type="text/css">
<!--
a.summary-letter {text-decoration: none}
blockquote.indentedblock {margin-right: 0em}
div.display {margin-left: 3.2em}
div.example {margin-left: 3.2em}
div.lisp {margin-left: 3.2em}
kbd {font-style: oblique}
pre.display {font-family: inherit}
pre.format {font-family: inherit}
pre.menu-comment {font-family: serif}
pre.menu-preformatted {font-family: serif}
span.nolinebreak {white-space: nowrap}
span.roman {font-family: initial; font-weight: normal}
span.sansserif {font-family: sans-serif; font-weight: normal}
ul.no-bullet {list-style: none}
-->
</style>


</head>

<body lang="en">
<span id="Data"></span><div class="header">
<p>
Next: <a href="Distance.html#Distance" accesskey="n" rel="next">Distance</a>, Previous: <a href="Introduction.html#Introduction" accesskey="p" rel="prev">Introduction</a>, Up: <a href="index.html#Top" accesskey="u" rel="up">Top</a> &nbsp; [<a href="Contents.html#SEC_Contents" title="Table of contents" rel="contents">Contents</a>]</p>
</div>
<hr>
<span id="Loading_002c-filtering_002c-and-adjusting-data"></span><h2 class="chapter">2 Loading, filtering, and adjusting data</h2>

<div class="float">
<img src="images/cluster.png" alt="images/cluster">
</div>
<p>Data can be loaded into Cluster by choosing Load data file under the File menu.
A number of options are provided for adjusting and
filtering the data you have loaded. These functions are accessed via the Filter Data and
Adjust Data tabs.
</p>
<span id="Loading-Data"></span><h3 class="section">2.1 Loading Data</h3>

<p>The first step in using Cluster is to import data. Currently, Cluster only reads tab-delimited text files in a particular format, described below. Such tab-delimited text files can be created and exported in any standard spreadsheet program, such as Microsoft Excel. An example datafile can be found under the File format help item in the Help menu. This contains all the information you need for making a Cluster input file.
</p>
<p>By convention, in Cluster input tables rows represent genes and columns represent samples or observations (e.g. a single microarray hybridization). For a simple timecourse, a minimal Cluster input file would look like this:<br>
</p>
<img src="images/minifile.png" alt="images/minifile">
<br>
<p>Each row (gene) has an identifier (in green) that always goes in the first column. Here we are using yeast open reading frame codes. Each column (sample) has a label (in blue) that is always in the first row; here the labels describe the time at which a sample was taken. The first column of the first row contains a special field (in red) that tells the program what kind of objects are in each row. In this case, YORF stands for yeast open reading frame. This field can be any alpha-numeric value. It is used in TreeView to specify how rows are linked to external websites.
</p>
<p>The remaining cells in the table contain data for the appropriate gene and sample. The 5.8 in row 2 column 4 means that the observed data value for gene YAL001C at 2 hours was 5.8. Missing values are acceptable and are designated by empty cells (e.g.  YAL005C at 2 hours).
</p>
<p>It is possible to have additional information in the input file. A maximal Cluster input file would look like this:<br>
</p>
<img src="images/maxifile.png" alt="images/maxifile">
<br>
<p>The yellow columns and rows are optional. By default, TreeView uses the ID in column
1 as a label for each gene. The NAME column allows you to specify a label for each gene
that is distinct from the ID in column 1. The other rows and columns will be described
later in this text.
</p>
<p>When Cluster 3.0 opens the data file, the number of columns in each row is checked. If a given row contains less or more columns than needed, an error message is displayed.<br>
</p>
<img src="images/fileerror.png" alt="images/fileerror">

<span id="Demo-data"></span><h3 class="heading">Demo data</h3>

<p>A demo datafile, which will be used in all of the examples here, is available
at <a href="http://rana.lbl.gov/downloads/data/demo.txt">http://rana.lbl.gov/downloads/data/demo.txt</a> and is mirrored at <a href="http://bonsai.hgc.jp/~mdehoon/software/cluster/demo.txt">http://bonsai.hgc.jp/~mdehoon/software/cluster/demo.txt</a>.
</p>
<p>The datafile contains yeast gene expression data
described in Eisen <em>et al.</em> (1998) [see references at end]. Download this data and load it
into Cluster. Cluster will give you information about the loaded datafile. <br>
</p>
<img src="images/filemanager.png" alt="images/filemanager">

<span id="Filtering-Data"></span><h3 class="section">2.2 Filtering Data</h3>

<img src="images/filter.png" alt="images/filter">

<p>The Filter Data tab allows you to remove genes that do not have certain desired
properties from your dataset. The currently available properties that can be used to filter
data are
</p><ul>
<li> <strong>% Present &gt;= X</strong>. This removes all genes that have missing values in greater than
(100-<i>X</i>)
 percent of the columns.
</li><li> <strong>SD (Gene Vector) &gt;= X</strong>. This removes all genes that have standard deviations of
observed values less than
<i>X</i>.
</li><li> <strong>At least X Observations with abs(Val) &gt;= Y</strong>. This removes all genes that do not have at least
<i>X</i>
 observations with absolute values greater than
<i>Y</i>.
</li><li> <strong>MaxVal-MinVal &gt;= X</strong>. This removes all genes whose maximum minus minimum values
are less than
<i>X</i>.
</li></ul>
<p>These are fairly self-explanatory. When you press filter, the filters are not immediately
applied to the dataset. You are first told how many genes would have passed the filter. If
you want to accept the filter, you press Accept, otherwise no changes are made.
<br>
</p>
<img src="images/accept.png" alt="images/accept">

<span id="Adjusting-Data"></span><h3 class="section">2.3 Adjusting Data</h3>

<img src="images/adjust.png" alt="images/adjust">

<p>From the Adjust Data tab, you can perform a number of operations that alter the
underlying data in the imported table. These operations are
</p><ul>
<li> <strong>Log Transform Data</strong>: replace all data values
<i>x</i>
 by
log<SUB>2</SUB> (<i>x</i>).
</li><li> <strong>Center genes [mean or median]</strong>: Subtract the row-wise mean or median from the values in each row of data, so that the mean or median value of each row is 0.
</li><li> <strong>Center arrays [mean or median]</strong>: Subtract the column-wise mean or median from the values in each column of data, so that the mean or median value of each column is 0.
</li><li> <strong>Normalize genes</strong>: Multiply all values in each row of data by a scale factor
<i>S</i>
 so that the sum of the squares of the values in each row is 1.0 (a separate
<i>S</i>
 is computed for each row).
</li><li> <strong>Normalize arrays</strong>: Multiply all values in each column of data by a scale factor
<i>S</i>
 so that the sum of the squares of the values in each column is 1.0 (a separate
<i>S</i>
 is computed for each column).
</li></ul>
<p>These operations are not associative, so the order in which these operations is applied is
very important, and you should consider it carefully before you apply these operations.
The order of operations is (only checked operations are performed):
</p><ul>
<li> Log transform all values.
</li><li> Center rows by subtracting the mean or median.
</li><li> Normalize rows.
</li><li> Center columns by subtracting the mean or median.
</li><li> Normalize columns.
</li></ul>

<span id="Log-transformation"></span><h4 class="subsection">2.3.1 Log transformation</h4>

<p>The results of many DNA microarray experiments are fluorescent
ratios. Ratio measurements are most naturally processed in log space. Consider an
experiment where you are looking at gene expression over time, and the results are
relative expression levels compared to time 0. Assume at timepoint 1, a gene is
unchanged, at timepoint 2 it is up 2-fold and at timepoint three is down 2-fold relative to
time 0. The raw ratio values are 1.0, 2.0 and 0.5. In most applications, you want to think
of 2-fold up and 2-fold down as being the same magnitude of change, but in an opposite
direction. In raw ratio space, however, the difference between timepoint 1 and 2 is +1.0,
while between timepoint 1 and 3 is -0.5. Thus mathematical operations that use the
difference between values would think that the 2-fold up change was twice as significant
as the 2-fold down change. Usually, you do not want this. In log space (we use log base 2
for simplicity) the data points become 0,1.0,-1.0.With these values, 2-fold up and 2-fold
down are symmetric about 0. For most applications, we recommend you work in log
space.
</p>
<span id="Mean_002fMedian-Centering"></span><h4 class="subsection">2.3.2 Mean/Median Centering</h4>

<p>Consider a now common experimental design where you are
looking at a large number of tumor samples all compared to a common reference sample
made from a collection of cell-lines. For each gene, you have a series of ratio values that
are relative to the expression level of that gene in the reference sample. Since the
reference sample really has nothing to do with your experiment, you want your analysis
to be independent of the amount of a gene present in the reference sample. This is
achieved by adjusting the values of each gene to reflect their variation from some
property of the series of observed values such as the mean or median. This is what mean
and/or median centering of genes does. Centering makes less sense in experiments where
the reference sample is part of the experiment, as it is many timecourses.
Centering the data for columns/arrays can also be used to remove certain types of biases.
The results of many two-color fluorescent hybridization experiments are not corrected for
systematic biases in ratios that are the result of differences in RNA amounts, labeling
efficiency and image acquisition parameters. Such biases have the effect of multiplying
ratios for all genes by a fixed scalar. Mean or median centering the data in log-space has
the effect of correcting this bias, although it should be noted that an assumption is being
made in correcting this bias, which is that the average gene in a given experiment is
expected to have a ratio of 1.0 (or log-ratio of 0).
</p>
<p>In general, I recommend the use of median rather than mean centering, as it is more robust against outliers.
</p>
<span id="Normalization"></span><h4 class="subsection">2.3.3 Normalization</h4>

<p>Normalization sets the magnitude (sum of the squares of the values) of a row/column
vector to 1.0. Most of the distance metrics used by Cluster work with internally
normalized data vectors, but the data are output as they were originally entered. If you
want to output normalized vectors, you should select this option.
A sample series of operations for raw data would be:
</p><ul>
<li> Adjust Cycle 1) log transform
</li><li> Adjust Cycle 2) median center genes and arrays
</li><li> repeat (2) five to ten times
</li><li> Adjust Cycle 3) normalize genes and arrays
</li><li> repeat (3) five to ten times
</li></ul>

<p>This results in a log-transformed, median polished (i.e. all row-wise and column-wise
median values are close to zero) and normal (i.e. all row and column magnitudes are
close to 1.0) dataset.
After performing these operations you should save the dataset.
</p>
<hr>
<div class="header">
<p>
Next: <a href="Distance.html#Distance" accesskey="n" rel="next">Distance</a>, Previous: <a href="Introduction.html#Introduction" accesskey="p" rel="prev">Introduction</a>, Up: <a href="index.html#Top" accesskey="u" rel="up">Top</a> &nbsp; [<a href="Contents.html#SEC_Contents" title="Table of contents" rel="contents">Contents</a>]</p>
</div>



</body>
</html>