File: Tutorial2.tut.xml

package info (click to toggle)
bali-phy 4.0-1
  • links: PTS, VCS
  • area: main
  • in suites: forky, sid, trixie
  • size: 15,392 kB
  • sloc: cpp: 120,442; xml: 13,966; haskell: 9,975; python: 2,936; yacc: 1,328; perl: 1,169; lex: 912; sh: 343; makefile: 26
file content (359 lines) | stat: -rw-r--r-- 16,397 bytes parent folder | download | duplicates (6)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
<!DOCTYPE book [
<!ENTITY % sgml.features "IGNORE">
<!ENTITY % xml.features "INCLUDE">

<!ENTITY % ent-mmlalias
      PUBLIC "-//W3C//ENTITIES Aiases for MathML 2.0//EN"
             "/usr/share/xml/schema/w3c/mathml/dtd/mmlalias.ent" >
%ent-mmlalias;
]>
<article xmlns="http://docbook.org/ns/docbook" version="5.0" 
         xmlns:mml="http://www.w3.org/1998/Math/MathML"
	 xml:lang="en">
  <info><title><application>BAli-Phy</application> Tutorial</title>
    <author><personname><firstname>Benjamin</firstname><surname>Redelings</surname></personname></author>
  </info>

  <section xml:id="intro"><info><title>Introduction</title></info>
<para>
Before you start this tutorial, please <link xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="http://www.bali-phy.org/download.php">download</link> and install bali-phy, following the installation instructions in the <link xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="http://www.bali-phy.org/README.html">manual</link>.</para>
  </section>
  <section xml:id="work_directory"><info><title>Setting up the <filename>~/alignment_files</filename> directory</title></info>

<para>Go to your home directory:
% cd ~
Make a directory called alignment_files inside it:
% mkdir alignment_files
Go into the <filename>alignment_files</filename> directory:
% cd alignment_files
Download the example alignment files:
% wget http://www.bali-phy.org/examples.tgz
Alternatively, you can use <command>curl</command>
% curl -O http://www.bali-phy.org/examples.tgz
Extract the compressed archive:
% tar -zxf examples.tgz
Take a look inside the <filename>examples</filename> directory:
% ls examples
Take a look at an input file:
% less examples/5S-rRNA/5d.fasta
Get some information about the alignment:
% alignment-info examples/5S-rRNA/5d.fasta
</para>
  </section>

  <section xml:id="command_line_options"><info><title>Command line options</title></info>

<para>
What version of bali-phy are you running?  When was it compiled?  Which compiler?  For what computer type?
% bali-phy -v
Look at the list of command line options:
% bali-phy --help 
Look at them with the ability to scroll back:
% bali-phy --help | less
Some options have a short form which is a single letter:
% bali-phy -h | less
</para>

<section ><info><title>DNA and RNA</title></info>
<para>
Analyze a data set, but don't begin MCMC.  (This is useful to know if the analysis works, what model will be used,
compute likelihoods, etc.)
% cd ~/alignment_files/examples
% bali-phy --test 5S-rRNA/5d.fasta
Finally, run an analysis!  (This is just 50 iterations, so its not a real run.)
% bali-phy 5S-rRNA/5d.fasta --iterations=50
If you specify <parameter>--imodel=none</parameter>, then the alignment won't be estimated, and indels will be ignored (just like <application>MrBayes</application>).
% bali-phy 5S-rRNA/5d.fasta --iterations=50 --imodel=none
You can specify the alphabet, substitution model, insertion/deletion model, etc.
Defaults are used if you don't specify.
% bali-phy 5S-rRNA/5d.fasta --iterations=50 --alphabet=DNA --smodel=TN --imodel=RS07 
You can change this to the GTR, if you want:
% bali-phy 5S-rRNA/5d.fasta --iterations=50 --alphabet=DNA --smodel=GTR --imodel=RS07 
You can add gamma[4]+INV rate heterogeneity:
% bali-phy 5S-rRNA/5d.fasta --iterations=50 --alphabet=DNA --smodel=GTR+gamma_inv[4] --imodel=RS07
</para>
</section>

<section ><info><title>Amino Acids</title></info>
<para>
When the data set contains amino acids, the default substitution model is the LG model:
% bali-phy EF-Tu/12d.fasta --iterations=50
</para>
</section>

<section ><info><title>Codons</title></info>
<para>
What alphabet is used here?  What substitution model?
% bali-phy HIV/chain-2005/env-clustal-codons.fasta --test
What happens when trying to use the Nielsen and Yang (1998) M0 model (e.g. dN/dS)?
% bali-phy HIV/chain-2005/env-clustal-codons.fasta --test --smodel=M0 
The M0 model requires a codon alphabet:
% bali-phy HIV/chain-2005/env-clustal-codons.fasta --test --smodel=M0 --alphabet=Codons 
The M0 model takes a <emphasis>nucleotide</emphasis> exchange model as a parameter.  This parameter is optional, and the default is HKY, which you could specify as <userinput>M0[HKY]</userinput>.  You can change this to be more flexible:
% bali-phy HIV/chain-2005/env-clustal-codons.fasta --test --smodel=M0[GTR] --alphabet=Codons 
The M7 model is a mixture of M0 codon models:
% bali-phy Globins/bglobin.fasta --test --smodel=M7 --alphabet=Codons 
The M7 model has parameters as well.  Here are the defaults:
% bali-phy Globins/bglobin.fasta --test --smodel=M7[4,HKY,F61] --alphabet=Codons 
It is possible to specify some of the parameters and leave others at their default value:
% bali-phy Globins/bglobin.fasta --test --smodel=M7[,TN] --alphabet=Codons 
</para>
</section>

<section><info><title>Multiple Genes</title></info>
<para>
You can also run analysis of multiple genes simultaneously:
% bali-phy ITS/ITS1-trimmed.fasta ITS/5.8S.fasta ITS/ITS2-trimmed.fasta --test
It is assumed that all genes evolve on the same tree, but with different rates.  
By default, each gene gets an default substitution model based on whether it
contains DNA/RNA or amino acids.</para>

<para><xref linkend="multigene"/> describes multigene analyses in more
detail.  It describes how to specify different models and rates for
different partitions, and how to fix the alignment for some genes.
It also describes how to specify that some partitions should share the
same parameter values.
</para>
</section>

  </section>

<section ><info><title>Output</title></info>
<para>
See <link xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="http://www.bali-phy.org/README.html#output">Section 6: Output</link> of the manual for more information about this section.
</para>
<para>
Try running an analysis with a few more iterations.
% bali-phy 5S-rRNA/25-muscle.fasta &#38;
Run another copy of the same analysis:
% bali-phy 5S-rRNA/25-muscle.fasta &#38;
You can take a look at your running jobs:
% jobs
</para>
<section ><info><title>Inspecting output files</title></info>
<para>
Look at the directories that were created to store the output files:
% ls
% ls 25-muscle-1/
% ls 25-muscle-2/
See how many iterations have been completed so far:
% wc -l 25-muscle-1/C1.p 25-muscle-2/C1.p
Wait a second, and repeat the command.
% wc -l 25-muscle-1/C1.p 25-muscle-2/C1.p
See if you can determine the following information from the beginning of the C1.out file:
<orderedlist>
  <listitem>What command was run?</listitem>
  <listitem>When was it run?</listitem>
  <listitem>Which computer was it run on?</listitem>
  <listitem>Which directory was it run in?</listitem>
  <listitem>Which directory contains the output files?</listitem>
  <listitem>What was the process id (PID) of the running bali-phy program?</listitem>
  <listitem>What random seed was used?</listitem>
  <listitem>What was the input file?</listitem>
  <listitem>What alphabet was used to read in the sequence data?</listitem>
  <listitem>What substitution model was used to analyze the sequence data?</listitem>
  <listitem>What insertion/deletion model was used to analyze the sequence data?</listitem>
</orderedlist>
% less 25-muscle-1/C1.out
Examine the file containing the sampled trees:
% less 25-muscle-1/C1.trees
Examine the file containing the sampled alignments:
% less 25-muscle-1/C1.P1.fastas
Examine the file containing the successive best alignment/tree pairs visited:
% less 25-muscle-1/C1.MAP
</para>
</section>

<section ><info><title>Summarizing the output</title></info>
<para>
Try summarizing the sampled numerical parameters (e.g. not trees and alignments):
% statreport --help
% statreport 25-muscle-1/C1.p 25-muscle-2/C1.p > Report
% statreport 25-muscle-1/C1.p 25-muscle-2/C1.p --mean > Report
% statreport 25-muscle-1/C1.p 25-muscle-2/C1.p --mean --median > Report
% statreport 25-muscle-1/C1.p 25-muscle-2/C1.p --mean --median --mode > Report
% less Report
Now lets examine the summaries using a graphical program.  If you are using Windows or Mac, run Tracer, and press the <guilabel>+</guilabel> button to add these files.  What kind of ESS do you get?  If you are using Linux, do 
% tracer 25-muscle-1/C1.p 25-muscle-2/C1.p &#38;
Now lets compute the consensus tree for these two runs:
% trees-consensus --help
% trees-consensus 25-muscle-1/C1.trees 25-muscle-2/C1.trees > c50.PP.tree
% trees-consensus 25-muscle-1/C1.trees 25-muscle-2/C1.trees --report=consensus > c50.PP.tree
% less consensus
% figtree c50.PP.tree &#38;
Now lets see if there is evidence that the two runs have not converged yet.
% trees-bootstrap --help
% trees-bootstrap 25-muscle-1/C1.trees 25-muscle-2/C1.trees > partitions.bs
% less partitions.bs
</para>
</section>
<section ><info><title>Generating an HTML Report</title></info>
<para>
Now lets use the analysis script to run all the summaries and make a report:
% bp-analyze.pl 25-muscle-1/ 25-muscle-2/
% firefox Results/index.html &#38;
This PERL script runs <application>statreport</application> and <application>trees-consensus</application> for you.  Take a look at what commands were run:
% less Results/bp-analyze.log
</para>
</section>
</section>

<section><info><title>Starting and stopping the program</title></info>
<para>We didn't specify the number of iterations to run in the section above, so the two analyses will run for
100,000 iterations, or until you stop them yourself.  See <link xmlns:xlink="http://www.w3.org/1999/xlink"
xlink:href="http://www.bali-phy.org/README.html#mixing_and_convergence">Section
10: Convergence and Mixing: Is it done yet?</link> of the manual for
more information about when to stop an analysis.</para>
<para>In order to stop a running job, you need to kill it. One way of stopping
bali-phy analyses is this:
% killall bali-phy
However, beware: if you are running multiple analyses, this will
terminate all of them.
</para>
</section>

<section xml:id="multigene"><info><title>Multi-gene analyses</title></info>
<para>In this section we'll practice running analyses with multiple
partitions.  Dividing the data into multiple partitions is useful
because different partitions can have different models, or can have
different parameters for the same model. This is described in more
detail in section 4.3 of the <link xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="http://www.bali-phy.org/README.html">manual</link>.</para> 

<section><info><title>A simple multi-gene analysis</title></info>
<para>Let's look at a data set that is divided into three partitions:
% alignment-info ITS/ITS1-trimmed.fasta
% alignment-info ITS/5.8S.fasta
% alignment-info ITS/ITS2-trimmed.fasta
</para>
<section><info><title>Running the analysis</title></info>
<para>
We can run an analysis of this partitioned data simply by supplying a
number of different alignment files as input to bali-phy.  Let's run an analysis
of these three alignment files:
% bali-phy ITS/ITS1-trimmed.fasta ITS/5.8S.fasta ITS/ITS2-trimmed.fasta --smodel=TN --imodel=RS07 &#38;
% bali-phy ITS/ITS1-trimmed.fasta ITS/5.8S.fasta ITS/ITS2-trimmed.fasta --smodel=TN --imodel=RS07 &#38;
You could leave off the <userinput>--smodel=TN --imodel=RS07</userinput>
part of the command line:
% bali-phy ITS/ITS1-trimmed.fasta ITS/5.8S.fasta ITS/ITS2-trimmed.fasta &#38;
This would give the same output, since TN and RS07 are the defaults.
</para>
</section>
<section><info><title>What did the analysis do?</title></info>
<para>Now, lets look at sampled continuous parameters:
% statreport ITS1-trimmed-5.8S-ITS2-trimmed-1/C1.p | less
% tracer ITS1-trimmed-5.8S-ITS2-trimmed-1/C1.p &#38;
You'll see that each partition has a TN (Tamura-Nei) substitution
model, as well as an RS07 indel model.  Each partition has its own
copy of the TN parameters and the RS07 parameters.</para>

</section>
<section><info><title>Question</title></info>
<para>The partitions share a common tree shape, including the same
relative branch lengths.  However, the size of the tree for each
partition is different.  We scale the whole shared tree by
mu1 in partition 1, mu2 in partition 2, etc.  The mu parameters give
the average branch length in that partition.  Thus, partitions with a
smaller mu value have slower evolution.</para>

<para>Do the different partitions of this data set have the same
evolutionary rates? Do the different partitions of this data set have
the same base frequencies?</para>
</section>

</section>

<section><info><title>Using different models in different
partitions</title></info>

<section><info><title>Using different substitution models</title></info>
<para>
Now lets try to specify different models for different partitions.
Here we've used a command-line trick with curly braces {} to avoid
typing some things multiple times.
% bali-phy ITS/{ITS1-trimmed,5.8S,ITS2-trimmed}.fasta --smodel=1:GTR --smodel=2:HKY --smodel=3:TN &#38;
</para>
<para>
We've also specified different substitution models for each
partition.  Take a look at the <filename>C1.p</filename> file for
this analysis to see what parameters appear.
</para>
</section>
</section>

<section><info><title>Using different indel models models</title></info>
<para>
We can also specify different indel models for each partition:
% bali-phy ITS/{ITS1-trimmed,5.8S,ITS2-trimmed}.fasta --imodel=1:RS07 --imodel=2:none --imodel=3:RS07 --test
There are only two indel models: RS07, and none.  Specifying
<userinput>--imodel=none</userinput> removes the insertion-deletion
model and parameters for a partition.  It also disables alignment estimation for that partition.
</para>
</section>
<section><info><title>Sharing model parameter between partitions</title></info>
We can also specify that some partitions with the same model also share the
same parameters for the model:
% bali-phy ITS/{ITS1-trimmed,5.8S,ITS2-trimmed}.fasta --smodel=1,3:GTR --imodel=1,3:RS07 --smodel=2:TN --imodel=2:none --test
This means that the information is pooled between the partitions to
estimate the shared parameters.
</section>
<section><info><title>Sharing substitution rates between partitions</title></info>
We can also specify that some partitions with the same model also share the
same parameters for the model:
% bali-phy ITS/{ITS1-trimmed,5.8S,ITS2-trimmed}.fasta --smodel=1,3:GTR --imodel=1,3:RS07 --smodel=2:TN --imodel=2:none --same-scale=1,3:mu1 --test
This means that the branch lengths for partitions 1 and 3 are the
same, instead of being independently estimated.
</section>
</section>

<section><info><title>Option files</title></info>
You can collect command line options into a file for later use.  Make
a text file called <filename>analysis1.script</filename>:
<programlisting>
align = ITS/ITS1-trimmed.fasta
align = ITS/5.8S.fasta
align = ITS/ITS2-trimmed.fasta
smodel = 1,3:TN+DP[3]
smodel = 2:TN
imodel = 2:none
same-scale = 1,3:mu1
</programlisting>
You can run the analysis like this:
% bali-phy -c analysis1.script &#38;
</section>

<section><info><title>Dataset preparation</title></info>
<section><info><title>Splitting and Merging Alignments</title></info>
<para>BAli-Phy generally wants you to split concatenated gene regions in order to analyze them.
% cd ~/alignment-files/examples/ITS/
% alignment-cat -c1-223 ITS-region.fasta > 1.fasta
% alignment-cat -c224-379 ITS-region.fasta > 2.fasta
% alignment-cat -c378-551 ITS-region.fasta > 3.fasta
</para>
Later you might want to put them back together again:
% alignment-cat 1.fasta 2.fasta 3.fasta > 123.fasta
</section>

<section><info><title>Shrinking the data set</title></info>
<para>
You might want to reduce the number of taxa while attempting to preserve phylogenetic diversity:
% alignment-thin --down-to=30 ITS-region.fasta > ITS-region-thinned.fasta
</para>
<para>
You can specify that certain sequences should not be removed:
% alignment-thin --down-to=30 --keep=Csaxicola420 ITS-region.fasta > ITS-region-thinned.fasta
</para>
</section>

<section><info><title>Cleaning the data set</title></info>
Keep only columns with a minimum number of residues:
% alignment-thin --min-letters=5 ITS-region.fasta > ITS-region-censored.fasta
Keep only sequences that are not too short:
% alignment-thin --longer-than=250 ITS-region.fasta > ITS-region-long.fasta
Remove 10 sequences with the smallest number of conserved residues:
% alignment-thin --remove-crazy=10 ITS-region.fasta > ITS-region-sane.fasta
</section>


</section>
</article>