1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359
|
<!DOCTYPE book [
<!ENTITY % sgml.features "IGNORE">
<!ENTITY % xml.features "INCLUDE">
<!ENTITY % ent-mmlalias
PUBLIC "-//W3C//ENTITIES Aiases for MathML 2.0//EN"
"/usr/share/xml/schema/w3c/mathml/dtd/mmlalias.ent" >
%ent-mmlalias;
]>
<article xmlns="http://docbook.org/ns/docbook" version="5.0"
xmlns:mml="http://www.w3.org/1998/Math/MathML"
xml:lang="en">
<info><title><application>BAli-Phy</application> Tutorial</title>
<author><personname><firstname>Benjamin</firstname><surname>Redelings</surname></personname></author>
</info>
<section xml:id="intro"><info><title>Introduction</title></info>
<para>
Before you start this tutorial, please <link xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="http://www.bali-phy.org/download.php">download</link> and install bali-phy, following the installation instructions in the <link xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="http://www.bali-phy.org/README.html">manual</link>.</para>
</section>
<section xml:id="work_directory"><info><title>Setting up the <filename>~/alignment_files</filename> directory</title></info>
<para>Go to your home directory:
% cd ~
Make a directory called alignment_files inside it:
% mkdir alignment_files
Go into the <filename>alignment_files</filename> directory:
% cd alignment_files
Download the example alignment files:
% wget http://www.bali-phy.org/examples.tgz
Alternatively, you can use <command>curl</command>
% curl -O http://www.bali-phy.org/examples.tgz
Extract the compressed archive:
% tar -zxf examples.tgz
Take a look inside the <filename>examples</filename> directory:
% ls examples
Take a look at an input file:
% less examples/5S-rRNA/5d.fasta
Get some information about the alignment:
% alignment-info examples/5S-rRNA/5d.fasta
</para>
</section>
<section xml:id="command_line_options"><info><title>Command line options</title></info>
<para>
What version of bali-phy are you running? When was it compiled? Which compiler? For what computer type?
% bali-phy -v
Look at the list of command line options:
% bali-phy --help
Look at them with the ability to scroll back:
% bali-phy --help | less
Some options have a short form which is a single letter:
% bali-phy -h | less
</para>
<section ><info><title>DNA and RNA</title></info>
<para>
Analyze a data set, but don't begin MCMC. (This is useful to know if the analysis works, what model will be used,
compute likelihoods, etc.)
% cd ~/alignment_files/examples
% bali-phy --test 5S-rRNA/5d.fasta
Finally, run an analysis! (This is just 50 iterations, so its not a real run.)
% bali-phy 5S-rRNA/5d.fasta --iterations=50
If you specify <parameter>--imodel=none</parameter>, then the alignment won't be estimated, and indels will be ignored (just like <application>MrBayes</application>).
% bali-phy 5S-rRNA/5d.fasta --iterations=50 --imodel=none
You can specify the alphabet, substitution model, insertion/deletion model, etc.
Defaults are used if you don't specify.
% bali-phy 5S-rRNA/5d.fasta --iterations=50 --alphabet=DNA --smodel=TN --imodel=RS07
You can change this to the GTR, if you want:
% bali-phy 5S-rRNA/5d.fasta --iterations=50 --alphabet=DNA --smodel=GTR --imodel=RS07
You can add gamma[4]+INV rate heterogeneity:
% bali-phy 5S-rRNA/5d.fasta --iterations=50 --alphabet=DNA --smodel=GTR+gamma_inv[4] --imodel=RS07
</para>
</section>
<section ><info><title>Amino Acids</title></info>
<para>
When the data set contains amino acids, the default substitution model is the LG model:
% bali-phy EF-Tu/12d.fasta --iterations=50
</para>
</section>
<section ><info><title>Codons</title></info>
<para>
What alphabet is used here? What substitution model?
% bali-phy HIV/chain-2005/env-clustal-codons.fasta --test
What happens when trying to use the Nielsen and Yang (1998) M0 model (e.g. dN/dS)?
% bali-phy HIV/chain-2005/env-clustal-codons.fasta --test --smodel=M0
The M0 model requires a codon alphabet:
% bali-phy HIV/chain-2005/env-clustal-codons.fasta --test --smodel=M0 --alphabet=Codons
The M0 model takes a <emphasis>nucleotide</emphasis> exchange model as a parameter. This parameter is optional, and the default is HKY, which you could specify as <userinput>M0[HKY]</userinput>. You can change this to be more flexible:
% bali-phy HIV/chain-2005/env-clustal-codons.fasta --test --smodel=M0[GTR] --alphabet=Codons
The M7 model is a mixture of M0 codon models:
% bali-phy Globins/bglobin.fasta --test --smodel=M7 --alphabet=Codons
The M7 model has parameters as well. Here are the defaults:
% bali-phy Globins/bglobin.fasta --test --smodel=M7[4,HKY,F61] --alphabet=Codons
It is possible to specify some of the parameters and leave others at their default value:
% bali-phy Globins/bglobin.fasta --test --smodel=M7[,TN] --alphabet=Codons
</para>
</section>
<section><info><title>Multiple Genes</title></info>
<para>
You can also run analysis of multiple genes simultaneously:
% bali-phy ITS/ITS1-trimmed.fasta ITS/5.8S.fasta ITS/ITS2-trimmed.fasta --test
It is assumed that all genes evolve on the same tree, but with different rates.
By default, each gene gets an default substitution model based on whether it
contains DNA/RNA or amino acids.</para>
<para><xref linkend="multigene"/> describes multigene analyses in more
detail. It describes how to specify different models and rates for
different partitions, and how to fix the alignment for some genes.
It also describes how to specify that some partitions should share the
same parameter values.
</para>
</section>
</section>
<section ><info><title>Output</title></info>
<para>
See <link xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="http://www.bali-phy.org/README.html#output">Section 6: Output</link> of the manual for more information about this section.
</para>
<para>
Try running an analysis with a few more iterations.
% bali-phy 5S-rRNA/25-muscle.fasta &
Run another copy of the same analysis:
% bali-phy 5S-rRNA/25-muscle.fasta &
You can take a look at your running jobs:
% jobs
</para>
<section ><info><title>Inspecting output files</title></info>
<para>
Look at the directories that were created to store the output files:
% ls
% ls 25-muscle-1/
% ls 25-muscle-2/
See how many iterations have been completed so far:
% wc -l 25-muscle-1/C1.p 25-muscle-2/C1.p
Wait a second, and repeat the command.
% wc -l 25-muscle-1/C1.p 25-muscle-2/C1.p
See if you can determine the following information from the beginning of the C1.out file:
<orderedlist>
<listitem>What command was run?</listitem>
<listitem>When was it run?</listitem>
<listitem>Which computer was it run on?</listitem>
<listitem>Which directory was it run in?</listitem>
<listitem>Which directory contains the output files?</listitem>
<listitem>What was the process id (PID) of the running bali-phy program?</listitem>
<listitem>What random seed was used?</listitem>
<listitem>What was the input file?</listitem>
<listitem>What alphabet was used to read in the sequence data?</listitem>
<listitem>What substitution model was used to analyze the sequence data?</listitem>
<listitem>What insertion/deletion model was used to analyze the sequence data?</listitem>
</orderedlist>
% less 25-muscle-1/C1.out
Examine the file containing the sampled trees:
% less 25-muscle-1/C1.trees
Examine the file containing the sampled alignments:
% less 25-muscle-1/C1.P1.fastas
Examine the file containing the successive best alignment/tree pairs visited:
% less 25-muscle-1/C1.MAP
</para>
</section>
<section ><info><title>Summarizing the output</title></info>
<para>
Try summarizing the sampled numerical parameters (e.g. not trees and alignments):
% statreport --help
% statreport 25-muscle-1/C1.p 25-muscle-2/C1.p > Report
% statreport 25-muscle-1/C1.p 25-muscle-2/C1.p --mean > Report
% statreport 25-muscle-1/C1.p 25-muscle-2/C1.p --mean --median > Report
% statreport 25-muscle-1/C1.p 25-muscle-2/C1.p --mean --median --mode > Report
% less Report
Now lets examine the summaries using a graphical program. If you are using Windows or Mac, run Tracer, and press the <guilabel>+</guilabel> button to add these files. What kind of ESS do you get? If you are using Linux, do
% tracer 25-muscle-1/C1.p 25-muscle-2/C1.p &
Now lets compute the consensus tree for these two runs:
% trees-consensus --help
% trees-consensus 25-muscle-1/C1.trees 25-muscle-2/C1.trees > c50.PP.tree
% trees-consensus 25-muscle-1/C1.trees 25-muscle-2/C1.trees --report=consensus > c50.PP.tree
% less consensus
% figtree c50.PP.tree &
Now lets see if there is evidence that the two runs have not converged yet.
% trees-bootstrap --help
% trees-bootstrap 25-muscle-1/C1.trees 25-muscle-2/C1.trees > partitions.bs
% less partitions.bs
</para>
</section>
<section ><info><title>Generating an HTML Report</title></info>
<para>
Now lets use the analysis script to run all the summaries and make a report:
% bp-analyze.pl 25-muscle-1/ 25-muscle-2/
% firefox Results/index.html &
This PERL script runs <application>statreport</application> and <application>trees-consensus</application> for you. Take a look at what commands were run:
% less Results/bp-analyze.log
</para>
</section>
</section>
<section><info><title>Starting and stopping the program</title></info>
<para>We didn't specify the number of iterations to run in the section above, so the two analyses will run for
100,000 iterations, or until you stop them yourself. See <link xmlns:xlink="http://www.w3.org/1999/xlink"
xlink:href="http://www.bali-phy.org/README.html#mixing_and_convergence">Section
10: Convergence and Mixing: Is it done yet?</link> of the manual for
more information about when to stop an analysis.</para>
<para>In order to stop a running job, you need to kill it. One way of stopping
bali-phy analyses is this:
% killall bali-phy
However, beware: if you are running multiple analyses, this will
terminate all of them.
</para>
</section>
<section xml:id="multigene"><info><title>Multi-gene analyses</title></info>
<para>In this section we'll practice running analyses with multiple
partitions. Dividing the data into multiple partitions is useful
because different partitions can have different models, or can have
different parameters for the same model. This is described in more
detail in section 4.3 of the <link xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="http://www.bali-phy.org/README.html">manual</link>.</para>
<section><info><title>A simple multi-gene analysis</title></info>
<para>Let's look at a data set that is divided into three partitions:
% alignment-info ITS/ITS1-trimmed.fasta
% alignment-info ITS/5.8S.fasta
% alignment-info ITS/ITS2-trimmed.fasta
</para>
<section><info><title>Running the analysis</title></info>
<para>
We can run an analysis of this partitioned data simply by supplying a
number of different alignment files as input to bali-phy. Let's run an analysis
of these three alignment files:
% bali-phy ITS/ITS1-trimmed.fasta ITS/5.8S.fasta ITS/ITS2-trimmed.fasta --smodel=TN --imodel=RS07 &
% bali-phy ITS/ITS1-trimmed.fasta ITS/5.8S.fasta ITS/ITS2-trimmed.fasta --smodel=TN --imodel=RS07 &
You could leave off the <userinput>--smodel=TN --imodel=RS07</userinput>
part of the command line:
% bali-phy ITS/ITS1-trimmed.fasta ITS/5.8S.fasta ITS/ITS2-trimmed.fasta &
This would give the same output, since TN and RS07 are the defaults.
</para>
</section>
<section><info><title>What did the analysis do?</title></info>
<para>Now, lets look at sampled continuous parameters:
% statreport ITS1-trimmed-5.8S-ITS2-trimmed-1/C1.p | less
% tracer ITS1-trimmed-5.8S-ITS2-trimmed-1/C1.p &
You'll see that each partition has a TN (Tamura-Nei) substitution
model, as well as an RS07 indel model. Each partition has its own
copy of the TN parameters and the RS07 parameters.</para>
</section>
<section><info><title>Question</title></info>
<para>The partitions share a common tree shape, including the same
relative branch lengths. However, the size of the tree for each
partition is different. We scale the whole shared tree by
mu1 in partition 1, mu2 in partition 2, etc. The mu parameters give
the average branch length in that partition. Thus, partitions with a
smaller mu value have slower evolution.</para>
<para>Do the different partitions of this data set have the same
evolutionary rates? Do the different partitions of this data set have
the same base frequencies?</para>
</section>
</section>
<section><info><title>Using different models in different
partitions</title></info>
<section><info><title>Using different substitution models</title></info>
<para>
Now lets try to specify different models for different partitions.
Here we've used a command-line trick with curly braces {} to avoid
typing some things multiple times.
% bali-phy ITS/{ITS1-trimmed,5.8S,ITS2-trimmed}.fasta --smodel=1:GTR --smodel=2:HKY --smodel=3:TN &
</para>
<para>
We've also specified different substitution models for each
partition. Take a look at the <filename>C1.p</filename> file for
this analysis to see what parameters appear.
</para>
</section>
</section>
<section><info><title>Using different indel models models</title></info>
<para>
We can also specify different indel models for each partition:
% bali-phy ITS/{ITS1-trimmed,5.8S,ITS2-trimmed}.fasta --imodel=1:RS07 --imodel=2:none --imodel=3:RS07 --test
There are only two indel models: RS07, and none. Specifying
<userinput>--imodel=none</userinput> removes the insertion-deletion
model and parameters for a partition. It also disables alignment estimation for that partition.
</para>
</section>
<section><info><title>Sharing model parameter between partitions</title></info>
We can also specify that some partitions with the same model also share the
same parameters for the model:
% bali-phy ITS/{ITS1-trimmed,5.8S,ITS2-trimmed}.fasta --smodel=1,3:GTR --imodel=1,3:RS07 --smodel=2:TN --imodel=2:none --test
This means that the information is pooled between the partitions to
estimate the shared parameters.
</section>
<section><info><title>Sharing substitution rates between partitions</title></info>
We can also specify that some partitions with the same model also share the
same parameters for the model:
% bali-phy ITS/{ITS1-trimmed,5.8S,ITS2-trimmed}.fasta --smodel=1,3:GTR --imodel=1,3:RS07 --smodel=2:TN --imodel=2:none --same-scale=1,3:mu1 --test
This means that the branch lengths for partitions 1 and 3 are the
same, instead of being independently estimated.
</section>
</section>
<section><info><title>Option files</title></info>
You can collect command line options into a file for later use. Make
a text file called <filename>analysis1.script</filename>:
<programlisting>
align = ITS/ITS1-trimmed.fasta
align = ITS/5.8S.fasta
align = ITS/ITS2-trimmed.fasta
smodel = 1,3:TN+DP[3]
smodel = 2:TN
imodel = 2:none
same-scale = 1,3:mu1
</programlisting>
You can run the analysis like this:
% bali-phy -c analysis1.script &
</section>
<section><info><title>Dataset preparation</title></info>
<section><info><title>Splitting and Merging Alignments</title></info>
<para>BAli-Phy generally wants you to split concatenated gene regions in order to analyze them.
% cd ~/alignment-files/examples/ITS/
% alignment-cat -c1-223 ITS-region.fasta > 1.fasta
% alignment-cat -c224-379 ITS-region.fasta > 2.fasta
% alignment-cat -c378-551 ITS-region.fasta > 3.fasta
</para>
Later you might want to put them back together again:
% alignment-cat 1.fasta 2.fasta 3.fasta > 123.fasta
</section>
<section><info><title>Shrinking the data set</title></info>
<para>
You might want to reduce the number of taxa while attempting to preserve phylogenetic diversity:
% alignment-thin --down-to=30 ITS-region.fasta > ITS-region-thinned.fasta
</para>
<para>
You can specify that certain sequences should not be removed:
% alignment-thin --down-to=30 --keep=Csaxicola420 ITS-region.fasta > ITS-region-thinned.fasta
</para>
</section>
<section><info><title>Cleaning the data set</title></info>
Keep only columns with a minimum number of residues:
% alignment-thin --min-letters=5 ITS-region.fasta > ITS-region-censored.fasta
Keep only sequences that are not too short:
% alignment-thin --longer-than=250 ITS-region.fasta > ITS-region-long.fasta
Remove 10 sequences with the smallest number of conserved residues:
% alignment-thin --remove-crazy=10 ITS-region.fasta > ITS-region-sane.fasta
</section>
</section>
</article>
|