File: Tutorial3.tut.xml

package info (click to toggle)
bali-phy 4.0-1
  • links: PTS, VCS
  • area: main
  • in suites: forky, sid, trixie
  • size: 15,392 kB
  • sloc: cpp: 120,442; xml: 13,966; haskell: 9,975; python: 2,936; yacc: 1,328; perl: 1,169; lex: 912; sh: 343; makefile: 26
file content (383 lines) | stat: -rw-r--r-- 19,847 bytes parent folder | download | duplicates (6)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
<!DOCTYPE book [
<!ENTITY % sgml.features "IGNORE">
<!ENTITY % xml.features "INCLUDE">

<!ENTITY % ent-mmlalias
      PUBLIC "-//W3C//ENTITIES Aiases for MathML 2.0//EN"
             "/usr/share/xml/schema/w3c/mathml/dtd/mmlalias.ent" >
%ent-mmlalias;
]>
<article xmlns="http://docbook.org/ns/docbook" version="5.0" 
         xmlns:mml="http://www.w3.org/1998/Math/MathML"
	 xml:lang="en">
  <info><title><application>BAli-Phy</application> Tutorial (for version 3.1)</title>
    <author><personname><firstname>Benjamin</firstname><surname>Redelings</surname></personname></author>
  </info>

  <section xml:id="intro"><info><title>Introduction</title></info>
<para>
Before you start this tutorial, please <link xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="http://www.bali-phy.org/download.php">download</link> and install bali-phy, following the installation instructions in the <link xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="http://www.bali-phy.org/README.html">manual</link>.</para>
  </section>
  <section xml:id="work_directory"><info><title>Setting up the <filename>~/alignment_files</filename> directory</title></info>

<para>Go to your home directory:
% cd ~
Make a directory called alignment_files inside it:
% mkdir alignment_files
Go into the <filename>alignment_files</filename> directory:
% cd alignment_files
Download the example alignment files:
% wget http://www.bali-phy.org/examples.tgz
Alternatively, you can use <command>curl</command>
% curl -O http://www.bali-phy.org/examples.tgz
Extract the compressed archive:
% tar -zxf examples.tgz
Take a look inside the <filename>examples</filename> directory:
% ls examples
Take a look at an input file (you can press 'q' to exit 'less'):
% less examples/5S-rRNA/5d.fasta
Get some information about the alignment:
% alignment-info examples/5S-rRNA/5d.fasta
</para>
  </section>

<section xml:id="command_line_options"><info><title>Command line options</title></info>

<para>
What version of bali-phy are you running?  When was it compiled?  Which compiler?  For what computer type?
% bali-phy -v
Look at the list of command line options:
% bali-phy --help | less
Some options have a short form which is a single letter:
% bali-phy -h | less
</para>

<para>
You can also show help for advanced options:
% bali-phy --help=advanced | less
% bali-phy --help=expert | less
You can get help on the command line options:
% bali-phy --help=iterations | less
You can also get help on models, distributions, and functions:
% bali-phy --help=tn93 | less
% bali-phy --help=normal | less
% bali-phy --help=quantile
You can also get help on a topic using a less verbose syntax:
% bali-phy help tn93
% bali-phy help normal
% bali-phy help quantile
</para>

</section>

<section ><info><title>Analysis 1: variable alignment</title></info>
<para>Let's do an analysis of intergenic transcribed spacer (ITS) genes from 20 specimens of lichenized fungi (Gaya et al, 2011 analyzes 68). This analysis will estimate the alignment, phylogeny, and evolutionary parameters using MCMC.</para>

<para>This data set is divided into three gene regions, or partitions.  It is assumed that all genes evolve on the same tree, but may have different rates and evolutionary parameters. Let's look at the sequences.  How long are they?
% cd ~/alignment_files/examples/ITS
% alignment-info ITS1.fasta
% alignment-info 5.8S.fasta
% alignment-info ITS2.fasta
</para>

<para>By default, each gene gets a default substitution model based on whether it contains DNA/RNA or amino acids. By running bali-phy with the <userinput>--test</userinput> option, we can reveal what substitution models and priors will be used, without actually starting a run.
% bali-phy ITS1.fasta 5.8S.fasta ITS2.fasta --test
</para>

<para>Now start a run.  BAli-Phy will create an directory called <filename>ITS1-5.8S-ITS2-1</filename> to hold output files from this analysis.  
% bali-phy ITS1.fasta 5.8S.fasta ITS2.fasta &#38;

Run another copy of the same analysis, which will create a new directory called <filename>ITS-2</filename> for its own output files.  This additional run will take advantage of a second processor, and will also help detect when the runs have performed enough iterations.
% bali-phy ITS1.fasta 5.8S.fasta ITS2.fasta &#38;

You can take a look at your running jobs.  There should be two bali-phy jobs running.
% jobs
After each iteration, one line containing values for numerical parameters (such as the nucleotide frequencies) is written to the files <filename>ITS1-5.8S-ITS2-1/C1.log</filename> and <filename>ITS1-5.8S-ITS2-2/C1.log</filename>.  So we can check the number of iterations completed by looking at the number of lines in these files:
% wc -l ITS1-5.8S-ITS2-*/C1.log

The mean or the median of these values can be used as an estimate of the parameter.  The variance indicates the degree of uncertainty.  Let's look at the initial parameter estimates:
% statreport ITS1-5.8S-ITS2-1/C1.log ITS1-5.8S-ITS2-2/C1.log | less
% statreport ITS1-5.8S-ITS2-1/C1.log ITS1-5.8S-ITS2-2/C1.log --mean | less

The program Tracer graphically displays the posterior probability distribution for each parameter.  (If you are using Windows or Mac, first run Tracer, and then press the <guilabel>+</guilabel> button to add the files.)
% tracer ITS1-5.8S-ITS2-1/C1.log ITS1-5.8S-ITS2-2/C1.log &#38;

How does the evolutionary process for these genes differ in:
<orderedlist>
  <listitem>substitution rate? (<userinput>Scale[1]</userinput>, <userinput>Scale[2]</userinput>, ...)</listitem>
  <listitem>insertion-deletion rate? (<userinput>I1/rs07:log_rate</userinput>, <userinput>I2/rs07:log_rate</userinput>, ...)</listitem>
  <listitem>nucleotide frequencies? (<userinput>S1/tn93:pi[A]</userinput>, <userinput>S1/tn93:pi[C]</userinput>, ... )</listitem>
  <listitem>number of indels? (<userinput>#indels</userinput>)</listitem>
</orderedlist>
</para>
</section>

<section><info><title>Analysis 2: fixed alignment</title></info>
<para>
Let's also start a <emphasis>fixed alignment</emphasis> analysis.  This will estimate the tree and evolutionary parameters, but keep the alignment constant, similar to running MrBayes, BEAST, PhyloBayes, or RevBayes.
% bali-phy ITS1.fasta 5.8S.fasta ITS2.fasta --name ITS-fixed -I none &#38;
The <userinput>-I none</userinput> is a short form of <userinput>--imodel=none</userinput>, where <parameter>imodel</parameter> means the insertion-deletion model.  When there's no model of insertions and deletions, then the alignment must be kept fixed.
</para>
<para>The <userinput>--name</userinput> option means that output files will be created in the directory <filename>ITS-fixed-1</filename>.
</para>
</section>

<section xml:id="substitution_models"><info><title>Complex substitution models</title></info>
<para>While those analyses are running, let's look at how to specify more complex substitution models in bali-phy.</para>
<section><info><title>Defaults</title></info>
<para>When you don't specify values for parameters like <parameter>imodel</parameter>, bali-phy uses sensible defaults.  For example, these two commands are equivalent:
% cd ~/alignment_files/examples/
% bali-phy 5S-rRNA/25-muscle.fasta --test
% bali-phy 5S-rRNA/25-muscle.fasta --test --alphabet=RNA --smodel=tn93 --imodel=rs07
You can change the substitution model from the Tamura-Nei model to the General Time-Reversible model:
% bali-phy 5S-rRNA/25-muscle.fasta --test -S gtr
Here the <userinput>-S gtr</userinput> is a short form of <userinput>--smodel=gtr</userinput>, where <parameter>smodel</parameter> means the substitution model.
</para>
</section>
<section><info><title>Rate variation</title></info>
<para>
You can also allow different sites to evolve at 5 different rates using the gamma[4]+INV model of rate heterogeneity:
% bali-phy 5S-rRNA/25-muscle.fasta --test -S gtr+Rates.gamma[4]+inv
You can allow 5 different rates that are all independently estimated:
% bali-phy 5S-rRNA/25-muscle.fasta --test -S gtr+Rates.free[n=5]
</para>
</section>
<section><info><title>Codon models</title></info>
<para>
We can also conduct codon-based analyses using the Nielsen and Yang (1998) model of diversifying positive selection (dN/dS):
% bali-phy Globins/bglobin.fasta --test -S gy94+f1x4
The gy94 model takes a nucleotide exchange model as a parameter. This parameter is optional, and the default is hky85, which you could specify as gy94[,hky85_sym]. You can change this to be more flexible: 
% bali-phy Globins/bglobin.fasta --test -S gy94[,gtr_sym]+f1x4
You can make the codon frequencies to be generated from a single set of nucleotide frequencies:
% bali-phy Globins/bglobin.fasta --test -S gy94[,gtr_sym]+mg94
The M7 model allows different sites to have different dN/dS values, where the probability of dN/dS values follows a beta distribution:
% bali-phy Globins/bglobin.fasta --test -S m7
The M7 model has parameters as well. Here are the defaults:
% bali-phy Globins/bglobin.fasta --test -S m7[4,hky85_sym,f61]
The M3 model allows different sites to have different dN/dS values, but directly estimates what these values are:
% bali-phy Globins/bglobin.fasta --test -S m3[n=3]
The M8a_Test model allows testing for positive selection in some fraction of the sites:
% bali-phy Globins/bglobin.fasta --test -S m8a_test[4,hky85_sym,f3x4]
</para>
</section>

<section><info><title>Fixing parameter values</title></info>
<para>
We can use the TN93+Gamma[4]+INV model without specifying parameters:
% bali-phy Globins/bglobin.fasta --test -S tn93+Rates.gamma+inv
However, we can also fix parameter values:
% bali-phy Globins/bglobin.fasta --test -S tn93+Rates.gamma[n=4,alpha=1]+inv[p_inv=0.2]
Here we have set the shape parameter for the Gamma distribution to 1, and the
fraction of invariant sites to 20%.  Since these parameters are fixed, they will
not be estimated and their values will not be shown in the log file.
</para>
<para>
You can see the parameters for a model by using the <userinput>help</userinput> command, as in:
% bali-phy help Rates.gamma
This will show the default value or default prior for each parameter, if there is one.
</para>
</section>

<section><info><title>Priors</title></info>
<para>
By default the fraction of invariant sites follows a uniform[0,1] distribution:
% bali-phy help inv
However, we can specify an alternative prior:
% bali-phy Globins/bglobin.fasta --test -S tn93+Rates.gamma[n=4]+inv[p_inv~uniform[0,0.2]]
We can also specify parameters as positional arguments instead of using variable names:
% bali-phy Globins/bglobin.fasta --test -S tn93+Rates.gamma[4]+inv[~uniform[0,0.2]]
Here "<userinput>~</userinput>" indicates a sample from the uniform distribution instead of the distribution
itself.
</para>
<para>
The insertion-deletion model also has parameters.
% bali-phy help rs07
Here the default value for rs07:mean_length is exponential[10,1].  This indicates
a random value that is obtained by sampling an Exponential random variable with mean 10
and then adding 1 to it.
</para>

</section>
</section>

<section><info><title>Generating an HTML Report</title></info>
<para>
See <link xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="http://www.bali-phy.org/README.html#output">Section 6: Output</link> of the manual for more information about this section.
</para>
<section ><info><title>Inspecting output files</title></info>
<para>
Look at the directories that were created to store the output files:
% cd ~/alignment_files/examples/ITS/
% ls
Here is a shell trick with curly braces {} to avoid typing some things multiple times, illustrated with the <userinput>echo</userinput> command.  Let's see how many more iterations have been completed since we last checked:
% echo "hi!"
% echo ITS1-5.8S-ITS2-{1,2}/C1.log
% wc -l ITS1-5.8S-ITS2-{1,2}/C1.log
Examine the file containing the sampled trees:
% less ITS1-5.8S-ITS2-1/C1.trees
Examine the file containing the sampled alignments:
% less ITS1-5.8S-ITS2-1/C1.P1.fastas
Examine the file containing the successive best alignment/tree pairs visited:
% less ITS1-5.8S-ITS2-1/C1.MAP
The beginning of the <filename>C1.out</filename> file also contains a lot of information
about what command was run, when it was run, and what the process ID (PID) of the running program is.
% less ITS1-5.8S-ITS2-1/C1.out
</para>
</section>
<section ><info><title>Command-line summary of output files</title></info>
<para>
Try summarizing the sampled numerical parameters (e.g. not trees and alignments).
% statreport ITS1-5.8S-ITS2-{1,2}/C1.log --mean --median > Report
% less Report

Now lets compute the consensus tree for these two runs.  When figtree asks for a name for the branch supports, type <userinput>PP</userinput> in the box.
% trees-consensus ITS1-5.8S-ITS2-{1,2}/C1.trees > c50.PP.tree
% figtree c50.PP.tree &#38;

To look at the posterior probability of individual splits in the tree:
% trees-consensus ITS1-5.8S-ITS2-{1,2}/C1.trees --report=consensus > c50.PP.tree
% less consensus

Now lets see if there is evidence that the two runs have not converged yet:
% trees-bootstrap ITS1-5.8S-ITS2-{1,2}/C1.trees > partitions.bs
% grep ASDSF partitions.bs
The ASDSF (Average Standard-Deviation of Split Frequences) is a measure of how much the estimated posterior probability of splits differ between the two runs.  If it is greater than 0.01 then you should probably accumulate more iterations.  The MSDSF (Maximum SDSF) indicates the variation between runs for the worst split.
</para>

</section>

<section ><info><title>Generating an HTML Report</title></info>
<para>
Now lets use the analysis script to run all the summaries and make an HTML report:
% bp-analyze ITS1-5.8S-ITS2-{1,2}/
% firefox Results/index.html &#38;
This PERL script runs <application>statreport</application> and <application>trees-consensus</application> for you.  Take a look at what commands were run:
% less Results/bp-analyze.log
The report should give us an indication of
<orderedlist>
  <listitem>What is the majority consensus tree?</listitem>
  <listitem>What is consensus alignment for each partition?</listitem>
  <listitem>How much alignment uncertainty is there in each partition?</listitem>
  <listitem>How much do the split frequencies differ between runs?</listitem>
  <listitem>What is the effective sample size (Ne) for the Scale[1]?  For |A1|?</listitem>
</orderedlist>
</para>

</section>
</section>

<section><info><title>Starting and stopping the program</title></info>
<para>We didn't specify the number of iterations to run in the section above, so the two analyses will run for
100,000 iterations, or until you stop them yourself.  See <link xmlns:xlink="http://www.w3.org/1999/xlink"
xlink:href="http://www.bali-phy.org/README.html#mixing_and_convergence">Section
10: Convergence and Mixing: Is it done yet?</link> of the manual for
more information about when to stop an analysis.</para>
<para>
Let's stop the bali-phy runs now.  In order to stop a running job, you need to kill it.
We can use the PID mentioned in C1.out to kill a specific process. Let's kill the fixed-alignment
analysis first.
% less ITS-fixed-1/C1.out
% kill <replaceable>PID</replaceable>
% jobs
We can also kill all running bali-phy processes:
% killall bali-phy
However, beware: if you are running multiple analyses, this will terminate all of them.
</para>
</section>

<section xml:id="multigene"><info><title>Specifying the model for each partition</title></info>
<para>For analyses with multiple partitions, we might want to use different models for different
partitions.  When two partitions have the same model, we might also want them to have the same
parameters.  This is described in more detail in section 4.3 of the <link xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="http://www.bali-phy.org/README.html">manual</link>.</para> 

<section><info><title>Using different substitution models</title></info>
<para>
Now lets specify different substitution models for different partitions.

% cd ~/alignment_files/examples/ITS
% bali-phy {ITS1,5.8S,ITS2}.fasta -S 1:gtr -S 2:hky85 -S 3:tn93 --test
</para>
<para>
</para>
</section>

<section><info><title>Disabling alignment estimation for some partitions</title></info>
<para>
We can also disable alignment estimation for some, but not all, partitions:
% bali-phy {ITS1,5.8S,ITS2}.fasta -I 1:rs07 -I 2:none -I 3:rs07 --test
Specifying <userinput>-I none</userinput> removes the insertion-deletion
model and parameters for partition 2 and also disables alignment estimation for that partition.</para>
<para>Note that there is no longer an I3 indel model.  Partition #3 now has the I2 indel model.
</para>
</section>
<section><info><title>Sharing model parameters between partitions</title></info>
<para>We can also specify that some partitions with the same model also share the
same parameters for that model:
% bali-phy {ITS1,5.8S,ITS2}.fasta -S 1,3:gtr -I 1,3:rs07 -S 2:tn93 -I 2:none --test
This means that the information is <emphasis>pooled</emphasis> between the partitions to better estimate the shared parameters.</para>

<para>Take a look at the model parameters, and the parentheticals after the model descriptions.  You should see that there is no longer an S3 substitution model or an I3 indel model.  Instead, partitions #1 and #3 share the S1 substitution model and the I1 indel model.
</para>

</section>
<section><info><title>Sharing substitution rates between partitions</title></info>
<para>We can also specify that some partitions share the same scaling factor for branch lengths:
% bali-phy {ITS1,5.8S,ITS2}.fasta -S 1,3:gtr -I 1,3:rs07 -S 2:tn93 -I 2:none --scale=1,3: --test
This means that the branch lengths for partitions 1 and 3 are the same, instead of being independently estimated.</para>
<para>Take a look at the model parameters.  There is no longer a Scale[3] parameter.  Instead, partitions 1 and 3 share Scale[1].</para>
</section>
</section>

<section><info><title>Option files</title></info>
You can collect command line options into a file for later use.  Make
a text file called <filename>analysis1.script</filename>:
<programlisting>align = ITS1.fasta
align = 5.8S.fasta
align = ITS2.fasta
smodel = 1,3:tn93+Rates.free[n=3]
smodel = 2:tn93
imodel = 2:none
scale = 1,3:
</programlisting>
You can test the analysis like this:
% bali-phy -c analysis1.script --test
You can run it like this:
% bali-phy -c analysis1.script &#38;
</section>

<section><info><title>Dataset preparation</title></info>
<section><info><title>Splitting and Merging Alignments</title></info>
<para>BAli-Phy generally wants you to split concatenated gene regions in order to analyze them.
% cd ~/alignment_files/examples/ITS-many/
% alignment-cat -c1-223 ITS-region.fasta > 1.fasta
% alignment-cat -c224-379 ITS-region.fasta > 2.fasta
% alignment-cat -c378-551 ITS-region.fasta > 3.fasta
</para>
Later you might want to put them back together again:
% alignment-cat 1.fasta 2.fasta 3.fasta > 123.fasta
</section>

<section><info><title>Shrinking the data set</title></info>
<para>
You might want to reduce the number of taxa while attempting to preserve phylogenetic diversity:
% alignment-thin --down-to=30 ITS-region.fasta > ITS-region-thinned.fasta
</para>
<para>
You can specify that certain sequences should not be removed:
% alignment-thin --down-to=30 --keep=Csaxicola420 ITS-region.fasta > ITS-region-thinned.fasta
</para>
</section>

<section><info><title>Cleaning the data set</title></info>
Keep only sequences that are not too short:
% alignment-thin --longer-than=250 ITS-region.fasta > ITS-region-long.fasta
Remove 10 sequences with the smallest number of conserved residues:
% alignment-thin --remove-crazy=10 ITS-region.fasta > ITS-region-sane.fasta
Keep only columns with a minimum number of residues:
% alignment-thin --min-letters=5 ITS-region.fasta > ITS-region-censored.fasta
</section>


</section>
</article>