1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225
|
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<html>
<!-- Created by GNU Texinfo 6.6, http://www.gnu.org/software/texinfo/ -->
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<title>Data (Cluster 3.0 for Windows, Mac OS X, Linux, Unix)</title>
<meta name="description" content="Data (Cluster 3.0 for Windows, Mac OS X, Linux, Unix)">
<meta name="keywords" content="Data (Cluster 3.0 for Windows, Mac OS X, Linux, Unix)">
<meta name="resource-type" content="document">
<meta name="distribution" content="global">
<meta name="Generator" content="makeinfo">
<link href="index.html#Top" rel="start" title="Top">
<link href="Contents.html#SEC_Contents" rel="contents" title="Table of Contents">
<link href="index.html#Top" rel="up" title="Top">
<link href="Distance.html#Distance" rel="next" title="Distance">
<link href="Introduction.html#Introduction" rel="prev" title="Introduction">
<style type="text/css">
<!--
a.summary-letter {text-decoration: none}
blockquote.indentedblock {margin-right: 0em}
div.display {margin-left: 3.2em}
div.example {margin-left: 3.2em}
div.lisp {margin-left: 3.2em}
kbd {font-style: oblique}
pre.display {font-family: inherit}
pre.format {font-family: inherit}
pre.menu-comment {font-family: serif}
pre.menu-preformatted {font-family: serif}
span.nolinebreak {white-space: nowrap}
span.roman {font-family: initial; font-weight: normal}
span.sansserif {font-family: sans-serif; font-weight: normal}
ul.no-bullet {list-style: none}
-->
</style>
</head>
<body lang="en">
<span id="Data"></span><div class="header">
<p>
Next: <a href="Distance.html#Distance" accesskey="n" rel="next">Distance</a>, Previous: <a href="Introduction.html#Introduction" accesskey="p" rel="prev">Introduction</a>, Up: <a href="index.html#Top" accesskey="u" rel="up">Top</a> [<a href="Contents.html#SEC_Contents" title="Table of contents" rel="contents">Contents</a>]</p>
</div>
<hr>
<span id="Loading_002c-filtering_002c-and-adjusting-data"></span><h2 class="chapter">2 Loading, filtering, and adjusting data</h2>
<div class="float">
<img src="images/cluster.png" alt="images/cluster">
</div>
<p>Data can be loaded into Cluster by choosing Load data file under the File menu.
A number of options are provided for adjusting and
filtering the data you have loaded. These functions are accessed via the Filter Data and
Adjust Data tabs.
</p>
<span id="Loading-Data"></span><h3 class="section">2.1 Loading Data</h3>
<p>The first step in using Cluster is to import data. Currently, Cluster only reads tab-delimited text files in a particular format, described below. Such tab-delimited text files can be created and exported in any standard spreadsheet program, such as Microsoft Excel. An example datafile can be found under the File format help item in the Help menu. This contains all the information you need for making a Cluster input file.
</p>
<p>By convention, in Cluster input tables rows represent genes and columns represent samples or observations (e.g. a single microarray hybridization). For a simple timecourse, a minimal Cluster input file would look like this:<br>
</p>
<img src="images/minifile.png" alt="images/minifile">
<br>
<p>Each row (gene) has an identifier (in green) that always goes in the first column. Here we are using yeast open reading frame codes. Each column (sample) has a label (in blue) that is always in the first row; here the labels describe the time at which a sample was taken. The first column of the first row contains a special field (in red) that tells the program what kind of objects are in each row. In this case, YORF stands for yeast open reading frame. This field can be any alpha-numeric value. It is used in TreeView to specify how rows are linked to external websites.
</p>
<p>The remaining cells in the table contain data for the appropriate gene and sample. The 5.8 in row 2 column 4 means that the observed data value for gene YAL001C at 2 hours was 5.8. Missing values are acceptable and are designated by empty cells (e.g. YAL005C at 2 hours).
</p>
<p>It is possible to have additional information in the input file. A maximal Cluster input file would look like this:<br>
</p>
<img src="images/maxifile.png" alt="images/maxifile">
<br>
<p>The yellow columns and rows are optional. By default, TreeView uses the ID in column
1 as a label for each gene. The NAME column allows you to specify a label for each gene
that is distinct from the ID in column 1. The other rows and columns will be described
later in this text.
</p>
<p>When Cluster 3.0 opens the data file, the number of columns in each row is checked. If a given row contains less or more columns than needed, an error message is displayed.<br>
</p>
<img src="images/fileerror.png" alt="images/fileerror">
<span id="Demo-data"></span><h3 class="heading">Demo data</h3>
<p>A demo datafile, which will be used in all of the examples here, is available
at <a href="http://rana.lbl.gov/downloads/data/demo.txt">http://rana.lbl.gov/downloads/data/demo.txt</a> and is mirrored at <a href="http://bonsai.hgc.jp/~mdehoon/software/cluster/demo.txt">http://bonsai.hgc.jp/~mdehoon/software/cluster/demo.txt</a>.
</p>
<p>The datafile contains yeast gene expression data
described in Eisen <em>et al.</em> (1998) [see references at end]. Download this data and load it
into Cluster. Cluster will give you information about the loaded datafile. <br>
</p>
<img src="images/filemanager.png" alt="images/filemanager">
<span id="Filtering-Data"></span><h3 class="section">2.2 Filtering Data</h3>
<img src="images/filter.png" alt="images/filter">
<p>The Filter Data tab allows you to remove genes that do not have certain desired
properties from your dataset. The currently available properties that can be used to filter
data are
</p><ul>
<li> <strong>% Present >= X</strong>. This removes all genes that have missing values in greater than
(100-<i>X</i>)
percent of the columns.
</li><li> <strong>SD (Gene Vector) >= X</strong>. This removes all genes that have standard deviations of
observed values less than
<i>X</i>.
</li><li> <strong>At least X Observations with abs(Val) >= Y</strong>. This removes all genes that do not have at least
<i>X</i>
observations with absolute values greater than
<i>Y</i>.
</li><li> <strong>MaxVal-MinVal >= X</strong>. This removes all genes whose maximum minus minimum values
are less than
<i>X</i>.
</li></ul>
<p>These are fairly self-explanatory. When you press filter, the filters are not immediately
applied to the dataset. You are first told how many genes would have passed the filter. If
you want to accept the filter, you press Accept, otherwise no changes are made.
<br>
</p>
<img src="images/accept.png" alt="images/accept">
<span id="Adjusting-Data"></span><h3 class="section">2.3 Adjusting Data</h3>
<img src="images/adjust.png" alt="images/adjust">
<p>From the Adjust Data tab, you can perform a number of operations that alter the
underlying data in the imported table. These operations are
</p><ul>
<li> <strong>Log Transform Data</strong>: replace all data values
<i>x</i>
by
log<SUB>2</SUB> (<i>x</i>).
</li><li> <strong>Center genes [mean or median]</strong>: Subtract the row-wise mean or median from the values in each row of data, so that the mean or median value of each row is 0.
</li><li> <strong>Center arrays [mean or median]</strong>: Subtract the column-wise mean or median from the values in each column of data, so that the mean or median value of each column is 0.
</li><li> <strong>Normalize genes</strong>: Multiply all values in each row of data by a scale factor
<i>S</i>
so that the sum of the squares of the values in each row is 1.0 (a separate
<i>S</i>
is computed for each row).
</li><li> <strong>Normalize arrays</strong>: Multiply all values in each column of data by a scale factor
<i>S</i>
so that the sum of the squares of the values in each column is 1.0 (a separate
<i>S</i>
is computed for each column).
</li></ul>
<p>These operations are not associative, so the order in which these operations is applied is
very important, and you should consider it carefully before you apply these operations.
The order of operations is (only checked operations are performed):
</p><ul>
<li> Log transform all values.
</li><li> Center rows by subtracting the mean or median.
</li><li> Normalize rows.
</li><li> Center columns by subtracting the mean or median.
</li><li> Normalize columns.
</li></ul>
<span id="Log-transformation"></span><h4 class="subsection">2.3.1 Log transformation</h4>
<p>The results of many DNA microarray experiments are fluorescent
ratios. Ratio measurements are most naturally processed in log space. Consider an
experiment where you are looking at gene expression over time, and the results are
relative expression levels compared to time 0. Assume at timepoint 1, a gene is
unchanged, at timepoint 2 it is up 2-fold and at timepoint three is down 2-fold relative to
time 0. The raw ratio values are 1.0, 2.0 and 0.5. In most applications, you want to think
of 2-fold up and 2-fold down as being the same magnitude of change, but in an opposite
direction. In raw ratio space, however, the difference between timepoint 1 and 2 is +1.0,
while between timepoint 1 and 3 is -0.5. Thus mathematical operations that use the
difference between values would think that the 2-fold up change was twice as significant
as the 2-fold down change. Usually, you do not want this. In log space (we use log base 2
for simplicity) the data points become 0,1.0,-1.0.With these values, 2-fold up and 2-fold
down are symmetric about 0. For most applications, we recommend you work in log
space.
</p>
<span id="Mean_002fMedian-Centering"></span><h4 class="subsection">2.3.2 Mean/Median Centering</h4>
<p>Consider a now common experimental design where you are
looking at a large number of tumor samples all compared to a common reference sample
made from a collection of cell-lines. For each gene, you have a series of ratio values that
are relative to the expression level of that gene in the reference sample. Since the
reference sample really has nothing to do with your experiment, you want your analysis
to be independent of the amount of a gene present in the reference sample. This is
achieved by adjusting the values of each gene to reflect their variation from some
property of the series of observed values such as the mean or median. This is what mean
and/or median centering of genes does. Centering makes less sense in experiments where
the reference sample is part of the experiment, as it is many timecourses.
Centering the data for columns/arrays can also be used to remove certain types of biases.
The results of many two-color fluorescent hybridization experiments are not corrected for
systematic biases in ratios that are the result of differences in RNA amounts, labeling
efficiency and image acquisition parameters. Such biases have the effect of multiplying
ratios for all genes by a fixed scalar. Mean or median centering the data in log-space has
the effect of correcting this bias, although it should be noted that an assumption is being
made in correcting this bias, which is that the average gene in a given experiment is
expected to have a ratio of 1.0 (or log-ratio of 0).
</p>
<p>In general, I recommend the use of median rather than mean centering, as it is more robust against outliers.
</p>
<span id="Normalization"></span><h4 class="subsection">2.3.3 Normalization</h4>
<p>Normalization sets the magnitude (sum of the squares of the values) of a row/column
vector to 1.0. Most of the distance metrics used by Cluster work with internally
normalized data vectors, but the data are output as they were originally entered. If you
want to output normalized vectors, you should select this option.
A sample series of operations for raw data would be:
</p><ul>
<li> Adjust Cycle 1) log transform
</li><li> Adjust Cycle 2) median center genes and arrays
</li><li> repeat (2) five to ten times
</li><li> Adjust Cycle 3) normalize genes and arrays
</li><li> repeat (3) five to ten times
</li></ul>
<p>This results in a log-transformed, median polished (i.e. all row-wise and column-wise
median values are close to zero) and normal (i.e. all row and column magnitudes are
close to 1.0) dataset.
After performing these operations you should save the dataset.
</p>
<hr>
<div class="header">
<p>
Next: <a href="Distance.html#Distance" accesskey="n" rel="next">Distance</a>, Previous: <a href="Introduction.html#Introduction" accesskey="p" rel="prev">Introduction</a>, Up: <a href="index.html#Top" accesskey="u" rel="up">Top</a> [<a href="Contents.html#SEC_Contents" title="Table of contents" rel="contents">Contents</a>]</p>
</div>
</body>
</html>
|