1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585
|
<!DOCTYPE book [
<!ENTITY % sgml.features "IGNORE">
<!ENTITY % xml.features "INCLUDE">
<!ENTITY % ent-mmlalias
PUBLIC "-//W3C//ENTITIES Aiases for MathML 2.0//EN"
"/usr/share/xml/schema/w3c/mathml/dtd/mmlalias.ent" >
%ent-mmlalias;
]>
<article xmlns="http://docbook.org/ns/docbook" version="5.0"
xmlns:mml="http://www.w3.org/1998/Math/MathML"
xml:lang="en">
<info><title><application>BAli-Phy</application> Tutorial (for version 4.0-beta11)</title>
<author><personname><firstname>Benjamin</firstname><surname>Redelings</surname></personname></author>
</info>
<section xml:id="intro"><info><title>Introduction</title></info>
<para>This tutorial offers a quick overview of bali-phy with an emphasis on concrete analysis of real datasets. Other documentation includes:
<itemizedlist>
<listitem>The <link xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="http://www.bali-phy.org/README.xhtml">Users Guide</link> is much more complete.</listitem>
<listitem>Run <userinput><replaceable>tool</replaceable> --help</userinput> to see command line options for <replaceable>tool</replaceable>.</listitem>
<listitem>Run <userinput>man <replaceable>tool</replaceable></userinput> to see the manual page for <replaceable>tool</replaceable>.</listitem>
<listitem>Manual pages for bali-phy and tools are also available <link xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="http://www.bali-phy.org/docs.php#manpages">online</link>.</listitem>
</itemizedlist>
</para>
<para>
Before you start this tutorial, please <link xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="http://www.bali-phy.org/download.php">download</link> and install bali-phy, following the installation instructions in the <link xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="http://www.bali-phy.org/README.html">User Guide</link>.</para>
</section>
<section xml:id="work_directory"><info><title>Setting up the <filename>~/alignment_files</filename> directory</title></info>
<para>Go to your home directory:
% cd ~
Make a directory called alignment_files inside it:
% mkdir alignment_files
Go into the <filename>alignment_files</filename> directory:
% cd alignment_files
Download the example alignment files:
% wget http://www.bali-phy.org/examples.tgz
Alternatively, you can use <command>curl</command>
% curl -O http://www.bali-phy.org/examples.tgz
Extract the compressed archive:
% tar -zxf examples.tgz
Take a look inside the <filename>examples</filename> directory:
% ls examples
Take a look at an input file (you can press 'q' to exit 'less'):
% less examples/5S-rRNA/5d.fasta
Get some information about the alignment:
% alignment-info examples/5S-rRNA/5d.fasta
</para>
</section>
<section xml:id="command_line_options"><info><title>Command line options</title></info>
<para>
What version of bali-phy are you running? When was it compiled? Which compiler? For what computer type?
% bali-phy -v
% bali-phy --version
Look at the list of command line options:
% bali-phy -h | less
% bali-phy --help | less
% bali-phy help | less
By default, not all the options are shown. To see more options, try:
% bali-phy help advanced | less
% bali-phy help expert | less
</para>
<para>All bali-phy tools also take <userinput>--help</userinput>.
% alignment-thin --help | less
% alignment-cat --help | less
</para>
</section>
<section xml:id="help"><info><title>Help</title></info>
<para>
You can also get help on models, distributions, functions, and command-line options:
% bali-phy help tn93 | less // Tamura-Nei (1993) <emphasis>model</emphasis>
% bali-phy help normal | less // Normal <emphasis>distribution</emphasis>
% bali-phy help quantile | less // quantile <emphasis>function</emphasis>
You can find out which models, distributions, and functions are available?
% bali-phy help models | less // What models are available?
% bali-phy help distributions | less // What distributions are available?
% bali-phy help functions | less // What functions are available?
You can get extended help on each command line option:
% bali-phy help iterations | less // --iterations=<num> <emphasis>command line option</emphasis>
% bali-phy help alphabet | less // --alphabet=<a> <emphasis>command line option</emphasis>
</para>
</section>
<section ><info><title>Analysis 1: 5S rRNA -- free alignment versus fixed alignment</title></info>
<!-- Talk about: jobs, fg, bg-->
<!-- Talk about: bali-phy doesn't stop itself-->
<!-- Talk about: MCMC? Make some slides-->
<para>Let's estimate a phylogeny on a very small data set under both a free alignment and a fixed alignment to see how the alignment affects our confidence in the estimated topology.</para>
<section ><info><title>Free alignment</title></info>
Let's get started by going to the examples directory:
% cd ~/alignment_files/examples/5S-rRNA
Now lets start a run.
% bali-phy 5d-clustalw.fasta --smodel 'gtr +> Rates.gamma(4) +> inv' --name 5d-freeA &
<para>Yay! Now you have started your first copy of the analysis. BAli-Phy will create a directory called <filename>5d-freeA-1</filename> to hold output files from this analysis, as directed by the <userinput>--name 5d-freeA</userinput> option.</para>
<para>Let's run another copy of the same analysis:</para>
% bali-phy 5d-clustalw.fasta --smodel 'gtr +> Rates.gamma(4) +> inv' --name 5d-freeA &
<para>
This command which will create another directory called <filename>5d-freeA-2</filename> for its own output files.
This additional run will take advantage of a second processor. It will also allow us to determine if the two runs have performed enough iterations to agree.
</para>
You can take a look at jobs running in the background. There should be two bali-phy jobs running:
% jobs
The previous command only shows jobs started by the shell you are typing in. You can also examine
all processes running on your computer. bali-phy should show up near the top since it using a lot
of the CPU:
% top // press 'q' to exit top
In order to see how many iterations have completed, check the number of lines in the log files:
% wc -l 5d-*/C1.log
To summarize the results generated so far, run the summary script and then view the results in a web browser:
% bp-analyze 5d-freeA-1/ 5d-freeA-2/
% firefox Results/index.html &
When you have about 2000 samples from each run, summarize the latest results:
% wc -l 5d-*/C1.log
% bp-analyze 5d-freeA-1/ 5d-freeA-2/
% mv Results 5d-freeA.Results
% firefox 5d-freeA.Results/index.html &
Now lets kill the bali-phy processes:
% top // use top to find the PID (process id) for each of the bali-phy processes
% kill <replaceable>pid1</replaceable>
% kill <replaceable>pid2</replaceable>
% top // check that the jobs have stopped.
Now let's look at the tree in figtree:
% figtree 5d-freeA.Results/c50.PP.tree &
<orderedlist>
<listitem>When it says "Please select a name for these values", enter <userinput>PP</userinput>.</listitem>
<listitem>Then, enable "Branch Labels".</listitem>
<listitem>Under "Branch Labels / Display", select "PP".</listitem>
<listitem>You can increase the font size to make the figure more readable.</listitem>
<listitem>You can also increase the font size of the tip labels.</listitem>
</orderedlist>
<section><info><title>Parameter questions:</title></info>
<itemizedlist>
<listitem>What is the meaning of <userinput>inv:p_inv</userinput> parameter? (Run <userinput>bali-phy help inv</userinput>)</listitem>
<listitem>What is the prior distribution on <userinput>inv:p_inv</userinput>? (See <emphasis>Model and priors / Substitution model</emphasis>)</listitem>
<listitem>What is the posterior median for <userinput>inv:p_inv</userinput>? (See <emphasis>Scalar Variables</emphasis>)</listitem>
</itemizedlist>
</section>
<section><info><title>Alignment questions:</title></info>
<itemizedlist>
<listitem>Which part of the alignment is most uncertain? (See <emphasis>Alignment Distribution / Best (WPD) / AU</emphasis>)</listitem>
</itemizedlist>
</section>
<section><info><title>Tree questions (free alignment):</title></info>
<itemizedlist>
<listitem>How many internal branches are in the 50% consensus tree?</listitem>
<listitem>What is the posterior probability for each of these internal branches?</listitem>
</itemizedlist>
</section>
</section>
<section ><info><title>Fixed alignment</title></info>
% cd ~/alignment_files/examples
% bali-phy 5d-clustalw.fasta -S 'gtr +> Rates.gamma(4) +> inv' -I none -n 5d-fixedA &
% bali-phy 5d-clustalw.fasta -S 'gtr +> Rates.gamma(4) +> inv' -I none -n 5d-fixedA &
The <userinput>-I none</userinput> is a short form of <userinput>--imodel=none</userinput>, where <parameter>imodel</parameter> means the insertion-deletion model. When there's no model of insertions and deletions, then the alignment must be kept fixed.
<para>Summarize the analysis after about 2000 iterations:</para>
% wc -l 5d-*/C1.log
% bp-analyze 5d-fixedA-1/ 5d-fixedA-2
<para>Examine the majority consensus tree for the fixed alignment analysis:</para>
% figtree Results/c50.PP.tree &
<section><info><title>Tree questions (fixed alignment):</title></info>
<itemizedlist>
<listitem>How many internal branches are in the 50% consensus tree?</listitem>
<listitem>What is the posterior probability for each of these internal branches?</listitem>
<listitem>For this data set, how does fixing the alignment affect our confidence in the tree? Why might this be?</listitem>
</itemizedlist>
You can kill the remaining runs after you answer these questions:
% killall bali-phy
However, beware: if you are running multiple analyses, this will terminate all of them.
</section>
</section>
</section>
<section ><info><title>Analysis 2: ITS sequences -- multi-gene analysis</title></info>
<para>Let's do an analysis of intergenic transcribed spacer (ITS) genes from 20 specimens of lichenized fungi (Gaya et al, 2011 analyzes 68 specimens). The ITS genes contain a lot of variation, but are hard to align. This analysis will show how to estimate the alignment, phylogeny, and evolutionary parameters using MCMC. Averaging over all alignments makes it safe to use the ITS region without throwing out ambiguously-aligned regions.</para>
<para>This data set is divided into three gene regions, or partitions. It is assumed that all genes evolve on the same tree, but may have different rates and evolutionary parameters. Let's look at the sequences. How long are they?
% cd ~/alignment_files/examples/ITS
% alignment-info ITS1.fasta
% alignment-info 5.8S.fasta
% alignment-info ITS2.fasta
</para>
<para>By default, each gene gets a default substitution model based on whether it contains DNA/RNA or amino acids. By running bali-phy with the <userinput>--test</userinput> option, we can reveal what substitution models and priors will be used, without actually starting a run.
% bali-phy ITS1.fasta 5.8S.fasta ITS2.fasta --test
</para>
Let's run two copies of this analysis:
% bali-phy ITS1.fasta 5.8S.fasta ITS2.fasta &
% bali-phy ITS1.fasta 5.8S.fasta ITS2.fasta &
Since we did not specify a name for the analysis, bali-phy creates the name <filename>ITS1-5.8S-ITS2</filename> from the input file names.
Let's analyze the results, even if the analysis has not yet converged:
% bp-analyze ITS1-5.8S-ITS2-1/ ITS1-5.8S-ITS2-2/
% firefox Results/index.html &
The program Tracer graphically displays the posterior probability distribution for each parameter.
% tracer ITS1-5.8S-ITS2-1/C1.log ITS1-5.8S-ITS2-2/C1.log &
If you are using Windows or Mac, first run Tracer, and then press the <guilabel>+</guilabel> button to add the files.
<section><info><title>Questions about parameter differences between genes</title></info>
These genes have different evolutionary processes. How does the evolutionary process for these genes differ in:
<orderedlist>
<listitem>substitution rate? (<userinput>Scale[1]</userinput>, <userinput>Scale[2]</userinput>, ...)</listitem>
<listitem>insertion-deletion rate? (<userinput>I1/rs07:log_rate</userinput>, <userinput>I2/rs07:log_rate</userinput>, ...)</listitem>
<listitem>nucleotide frequencies? (<userinput>S1/tn93:pi[A]</userinput>, <userinput>S1/tn93:pi[C]</userinput>, ... )</listitem>
<listitem>number of indels? (<userinput>#indels</userinput>)</listitem>
</orderedlist>
</section>
</section>
<section><info><title>Analysis 3: Exons and Introns</title></info>
Let's look at a data set containing both exons and introns. The pattern of dividing a sequence into high-rate and low-rate regions also applies to stems and loops in RNA sequences. It is helpful to allow different insertion-deletion rates in conserved and non-conserved regions of a gene.
<section><info><title>Extracting parts of a gene</title></info>
First, lets split the gene into separate FASTA files for each intron and exon:
% cd ~/alignment_files/examples/ferns/
% alignment-cat -c 5-8 -e cleaned.fasta > exon1.fasta
% alignment-cat -c 9-453 -e cleaned.fasta > intron1.fasta
% alignment-cat -c 454-596 -e cleaned.fasta > exon2.fasta
% alignment-cat -c 597-864 -e cleaned.fasta > intron2.fasta
% alignment-cat -c 865-948 -e cleaned.fasta > exon3.fasta
% alignment-cat -c 949-1328 -e cleaned.fasta > intron3.fasta
% alignment-cat -c 1329-1393 -e cleaned.fasta > exon4.fasta
Now, let's put the pieces back together in case we need the complete gene in the future:
% alignment-cat exon1.fasta intron1.fasta exon2.fasta intron2.fasta exon3.fasta intron3.fasta exon4.fasta > combined.fasta
It is also possible to refer to character sets without creating a file for each one, as mentioned below.
</section>
<section><info><title>The Rates.free model</title></info>
<para>We will add <userinput>-S 'gtr+>Rates.free(n=4)'</userinput> to use the free-rates model of substitution with 4 categories:
% bali-phy exon1.fasta intron1.fasta exon2.fasta intron2.fasta exon3.fasta intron3.fasta exon4.fasta -S 'gtr+>Rates.free(n=4)' --test
This model is more general than Rates.gamma, since it estimates 4 free rates and the frequencies
of the 4 rates. These 8 parameters have 6 degrees of freedom, which is a lot more than the 1
degree of freedom for the <userinput>Rates.gamma</userinput> model. Furthermore, each partition
by default gets a separate copy of the model, which leads to 42 degrees of freedom
just for the <userinput>Rates.free</userinput> part of the model!
</para>
</section>
<section><info><title>Fixing the alignment of some partitions</title></info>
<para>
We will now add <userinput>-I 1,3,5,7:none</userinput> to disable alignment estimation, but only for the exons:
% bali-phy exon1.fasta intron1.fasta exon2.fasta intron2.fasta exon3.fasta intron3.fasta exon4.fasta -S 'gtr+>Rates.free(n=4)' -I 1,3,5,7:none --test
</para>
</section>
<section><info><title>Linking parameters between partitions</title></info>
We can reduce the number of parameters by using one set of parameters for the exons, and one set
for the introns. This is called <emphasis>linking</emphasis> the models.
% bali-phy exon1.fasta intron1.fasta exon2.fasta intron2.fasta exon3.fasta intron3.fasta exon4.fasta -I 1,3,5,7:none -S '1,3,5,7:gtr+>Rates.free(n=4)' -S '2,4,6:gtr+>Rates.free(n=4)' --test
We would also like to share the indel model between introns:
% bali-phy exon1.fasta intron1.fasta exon2.fasta intron2.fasta exon3.fasta intron3.fasta exon4.fasta -I 1,3,5,7:none -S '1,3,5,7:gtr+>Rates.free(n=4)' -S '2,4,6:gtr +> Rates.free(n=4)' -I 2,4,6: --test
</section>
<section><info><title>Character sets</title></info>
<para>Instead of creating separate files for each intron and exon, we can also refer to
character sets of the combined file:
% bali-phy cleaned.fasta:5-8 cleaned.fasta:454-596 cleaned.fasta:865-948 cleaned.fasta:1329-1393 cleaned.fasta:9-453 cleaned.fasta:597-864 cleaned.fasta:949-1328 -I 1,2,3,4:none -S '1,2,3,4:gtr +> Rates.free(n=4)' -S '5,6,7:gtr +> Rates.free(n=4)' -I 5,6,7: --test
However, the command line is getting so long that it is difficult to manage.
</para>
</section>
<section><info><title>Option files</title></info>
<para>Instead of writing all options on the command line, We can put some of them into a
text file that is easier to manage and edit. Let's make a text file called <filename>options1.txt</filename>:
<programlisting>align = cleaned.fasta:5-8 # exon1
align = cleaned.fasta:9-453 # intron1
align = cleaned.fasta:454-596 # exon2
align = cleaned.fasta:597-864 # intron2
align = cleaned.fasta:865-948 # exon3
align = cleaned.fasta:949-1328 # intron3
align = cleaned.fasta:1329-1393 # exon4
# exons
smodel = 1,3,5,7:gtr +> Rates.free(n=4)
imodel = 1,3,5,7:none
# introns
smodel = 2,4,6:gtr +> Rates.free(n=4)
imodel = 2,4,6:
</programlisting>
</para>
We can then run
% bali-phy -c options1.txt --test
</section>
Let's run two copies of this analysis:
% bali-phy -c options1.txt &
% bali-phy -c options1.txt &
<section><info><title>Questions about intron alignments</title></info>
<itemizedlist>
<listitem>How do bali-phy alignments of the introns differ from muscle and mafft alignments of the introns?</listitem>
<listitem>How much certainty or uncertainty do the intron alignments have?</listitem>
<listitem>How do the evolutionary rates of the exons compare to the evolutionary rates of the introns?</listitem>
<listitem>Would it be reasonable to link the intron rates? How about the exon rates?</listitem>
</itemizedlist>
</section>
<section><info><title>Exons as codons + introns as nucleotides</title></info>
Another thing that we can do is to use a codon model on the combined exons, while using a nucleotide model on the introns.
In order to load nucleotide sequences as codons, three conditions must be met:
<orderedlist>
<listitem>
<para><emphasis role="strong">Sequence lengths must be multiples of three.</emphasis></para>
<para><code>AAA --- TTT</code> is OK</para>
<para><code>AAA --- T--</code> is not OK.</para>
</listitem>
<listitem>
<para><emphasis role="strong">Gaps must be codon-aligned.</emphasis></para>
<para><code>AAA --- TTT</code> is OK</para>
<para><code>AA- --A TT-</code> is not OK.</para>
</listitem>
<listitem>
<para><emphasis role="strong">The sequences must be in-frame and not have stop codons.</emphasis></para>
<para><code>ATG AAT AAA</code> is OK, because it translates to <code>MNK</code>.</para>
<para><code>AAT GAA TAA</code> is not OK, because it translates to <code>NE*</code>, where<code>*</code> is a stop codon.</para>
</listitem>
</orderedlist>
Note that spaces in FASTA files are allowed, but not required.
<para>Let's first extract the coding data:
% alignment-cat exon{1,2,3,4}.fasta -c2-295 > coding.fasta // Concatenate exons and chop off partial codons at beginning and end
% sed -i 's/-/N/g' coding.fasta // Change codons like A-- to ANN at end of sequence
We can then construct an option file <filename>options2.txt</filename>:</para>
<programlisting>
align = intron1.fasta
align = intron2.fasta
align = intron3.fasta
align = coding.fasta
smodel = 1,2,3:gtr +> Rates.free(n=4)
imodel = 1,2,3:rs07
smodel = 4:gy94
#smodel = 4:gy94_ext(gtr_sym) # A GTR-based version of GY94 a.k.a. M0.
#smodel = 4:gtr +> fMutSel0
#smodel = 4:function(w:fMutSel0(omega=w)) +> m3
imodel = 4:none
</programlisting>
We can choose any one of the three codon models for the coding sequences in the 4th partition.
% bali-phy -c options2.txt --test
Try changing the model to <userinput>fMutSel0</userinput>. Does the number of parameters increase or decrease?
</section>
<section><info><title>Exons as amino acids + introns as nucleotides</title></info>
We can also translate the coding sequence into a protein sequence, while using a nucleotide model on the introns.
Let's first extract the coding data:
% alignment-translate < coding.fasta > protein.fasta
<para>We can then construct an option file <filename>options3.txt</filename>:</para>
<programlisting>
align = intron1.fasta
align = intron2.fasta
align = intron3.fasta
align = protein.fasta
smodel = 1,2,3:gtr +> Rates.free(n=4)
imodel = 1,2,3:rs07
smodel = 4:lg08 +> Rates.logNormal(n=4)
imodel = 4:none
</programlisting>
% bali-phy -c options3.txt --test
</section>
</section>
<section><info><title>Analysis 4: ITS sequences - with a better model</title></info>
<para>Let's revisit the ITS sequences used in analysis 2. Now that you've been exposed to different models in different partitions, lets use a more complex substitution model for the ITS partitions (<userinput>--smodel 1,3:tn93 +> Rates.free(n=3)</userinput>). Note that this also links the substitution models for the ITS partitions.</para>
<para>We can also fix the alignment for 5.8S partition, since it has almost no indels.
<programlisting>align = ITS1.fasta
align = 5.8S.fasta
align = ITS2.fasta
smodel = 1,3:tn93 +> Rates.free(n=3)
smodel = 2:tn93
imodel = 2:none
scale = 1,3:
</programlisting>
</para>
<para>This file additionally links the <emphasis>scale</emphasis> for the two ITS partitions (<userinput>--scale=1,3:</userinput>). This forces them to share the same evolutionary rate, and allows more precise estimates of the shared scale. You can run the analysis with unlinked scales by removing this option in order to determine if (i) the scales are similar enough to share and (ii) the scales have a high enough variance that they need to be linked.</para>
</section>
<section xml:id="substitution_models"><info><title>Complex substitution models</title></info>
<para>While those analyses are running, let's look at how to specify more complex substitution models in bali-phy.</para>
<section><info><title>Defaults</title></info>
<para>When you don't specify values for parameters like <parameter>imodel</parameter>, bali-phy uses sensible defaults. For example, these two commands are equivalent:
% cd ~/alignment_files/examples/
% bali-phy 5S-rRNA/25-muscle.fasta --test
% bali-phy 5S-rRNA/25-muscle.fasta --test --alphabet=RNA --smodel=tn93 --imodel=rs07
You can change the substitution model from the Tamura-Nei model to the General Time-Reversible model:
% bali-phy 5S-rRNA/25-muscle.fasta --test -S gtr
Here the <userinput>-S gtr</userinput> is a short form of <userinput>--smodel=gtr</userinput>, where <parameter>smodel</parameter> means the substitution model.
</para>
</section>
<section><info><title>Rate variation</title></info>
<para>
You can also allow different sites to evolve at 5 different rates using the gamma[4] + INV model of rate heterogeneity:
% bali-phy 5S-rRNA/25-muscle.fasta --test -S 'gtr +> Rates.gamma(4) +> inv'
You can also use a logNormal instead of a gamma. logNormal+INV is sometimes better behaved than gamma+INV, because the smallest rate bin under logNormal is not quite as close to 0.
% bali-phy 5S-rRNA/25-muscle.fasta --test -S 'gtr +> Rates.logNormal(4) +> inv'
You can allow 5 different rates that are all independently estimated:
% bali-phy 5S-rRNA/25-muscle.fasta --test -S 'gtr +> Rates.free(n=5)'
</para>
</section>
<section><info><title>Heterotachy</title></info>
<para>
Heterotachy occurs when the evolutionary rate changes across the tree. One version of heterotachy is the Tuffley-Steel (1998) model, where each state alternatives between an ON and OFF state.
% bali-phy 5S-rRNA/25-muscle.fasta --test -S 'gtr +> Covarion.ts98'
The Huelsenbeck (2002) model allows combining rate variation with Tuffley-Steel heterotachy:
% bali-phy 5S-rRNA/25-muscle.fasta --test -S 'gtr +> Rates.gamma +> Covarion.hb02'
You can also use the Huelsenbeck model with the free-rates model:
% bali-phy 5S-rRNA/25-muscle.fasta --test -S 'gtr +> Rates.free +> Covarion.hb02'
</para>
</section>
<section><info><title>Codon models</title></info>
<para>
We can also conduct codon-based analyses using the Nielsen and Yang (1998) model of diversifying positive selection (dN/dS):
% bali-phy Globins/bglobin.fasta --test -S 'gy94(pi=f1x4)'
The gy94_ext model extends the gy94 model by taking a nucleotide exchange model as a parameter. The gy94 model is equivalent to gy94_ext(hky85_sym). However, if you want a more flexible codon model, you could use gtr_sym:
% bali-phy Globins/bglobin.fasta --test -S 'gy94_ext(gtr_sym,pi=f1x4)'
You can make the codon frequencies to be generated from a single set of nucleotide frequencies:
% bali-phy Globins/bglobin.fasta --test -S 'mg94_ext(gtr)'
The M7 model allows different sites to have different dN/dS values, where the probability of dN/dS values follows a beta distribution:
% bali-phy Globins/bglobin.fasta --test -S m7
The M7 model has parameters as well. Here are the defaults:
% bali-phy Globins/bglobin.fasta --test -S 'function(w:gy94(omega=w)) +> m7'
The M3 model allows different sites to have different dN/dS values, but directly estimates what these values are:
% bali-phy Globins/bglobin.fasta --test -S 'm3(n=3)'
The M8a_Test model allows testing for positive selection in some fraction of the sites:
% bali-phy Globins/bglobin.fasta --test -S 'function(w:gy94(omega=w,pi=f3x4)) +> m8a_test'
</para>
</section>
<section><info><title>Fixing parameter values</title></info>
<para>
We can use the TN93+Gamma(4)+INV model without specifying parameters:
% bali-phy Globins/bglobin.fasta --test -S 'tn93 +> Rates.gamma +> inv'
However, we can also fix parameter values:
% bali-phy Globins/bglobin.fasta --test -S 'tn93 +> Rates.gamma(n=4,alpha=1) +> inv(p_inv=0.2)'
Here we have set the shape parameter for the Gamma distribution to 1, and the
fraction of invariant sites to 20%. Since these parameters are fixed, they will
not be estimated and their values will not be shown in the log file.
</para>
<para>
You can see the parameters for a model by using the <userinput>help</userinput> command, as in:
% bali-phy help Rates.gamma
This will show the default value or default prior for each parameter, if there is one.
</para>
</section>
<section><info><title>Priors</title></info>
<para>
By default the fraction of invariant sites follows a uniform(0,1) distribution:
% bali-phy help inv
However, we can specify an alternative prior:
% bali-phy Globins/bglobin.fasta --test -S 'tn93 +> Rates.gamma(n=4) +> inv(p_inv~uniform(0,0.2))'
We can also specify parameters as positional arguments instead of using variable names:
% bali-phy Globins/bglobin.fasta --test -S 'tn93 +> Rates.gamma(4) +> inv(~uniform(0,0.2))'
Here "<userinput>~</userinput>" indicates a sample from the uniform distribution instead of the distribution
itself.
</para>
<para>
The insertion-deletion model also has parameters.
% bali-phy help rs07
Here the default value for rs07:mean_length is exponential(10,1). This indicates
a random value that is obtained by sampling an Exponential random variable with mean 10
and then adding 1 to it.
</para>
</section>
</section>
<section xml:id="multigene"><info><title>Specifying the model for each partition</title></info>
<para>For analyses with multiple partitions, we might want to use different models for different
partitions. When two partitions have the same model, we might also want them to have the same
parameters. This is described in more detail in section 4.3 of the <link xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="http://www.bali-phy.org/README.html">manual</link>.</para>
<section><info><title>Using different substitution models</title></info>
<para>
Now lets specify different substitution models for different partitions.
% cd ~/alignment_files/examples/ITS
% bali-phy {ITS1,5.8S,ITS2}.fasta -S 1:gtr -S 2:hky85 -S 3:tn93 --test
</para>
<para>
</para>
</section>
<section><info><title>Disabling alignment estimation for some partitions</title></info>
<para>
We can also disable alignment estimation for some, but not all, partitions:
% bali-phy {ITS1,5.8S,ITS2}.fasta -I 1:rs07 -I 2:none -I 3:rs07 --test
Specifying <userinput>-I none</userinput> removes the insertion-deletion
model and parameters for partition 2 and also disables alignment estimation for that partition.</para>
<para>Note that there is no longer an I3 indel model. Partition #3 now has the I2 indel model.
</para>
</section>
<section><info><title>Sharing model parameters between partitions</title></info>
<para>We can also specify that some partitions with the same model also share the
same parameters for that model:
% bali-phy {ITS1,5.8S,ITS2}.fasta -S 1,3:gtr -I 1,3:rs07 -S 2:tn93 -I 2:none --test
This means that the information is <emphasis>pooled</emphasis> between the partitions to better estimate the shared parameters.</para>
<para>Take a look at the model parameters, and the parentheticals after the model descriptions. You should see that there is no longer an S3 substitution model or an I3 indel model. Instead, partitions #1 and #3 share the S1 substitution model and the I1 indel model.
</para>
</section>
<section><info><title>Sharing substitution rates between partitions</title></info>
<para>We can also specify that some partitions share the same scaling factor for branch lengths:
% bali-phy {ITS1,5.8S,ITS2}.fasta -S 1,3:gtr -I 1,3:rs07 -S 2:tn93 -I 2:none --scale=1,3: --test
This means that the branch lengths for partitions 1 and 3 are the same, instead of being independently estimated.</para>
<para>Take a look at the model parameters. There is no longer a Scale[3] parameter. Instead, partitions 1 and 3 share Scale[1].</para>
</section>
</section>
<section><info><title>Dataset preparation</title></info>
<section><info><title>Splitting and merging Alignments</title></info>
<para>You might want to split concatenated gene regions:
% cd ~/alignment_files/examples/ITS-many/
% alignment-cat -c1-223 ITS-region.fasta > 1.fasta
% alignment-cat -c224-379 ITS-region.fasta > 2.fasta
% alignment-cat -c378-551 ITS-region.fasta > 3.fasta
</para>
Later you might want to put them back together again:
% alignment-cat 1.fasta 2.fasta 3.fasta > 123.fasta
</section>
<section><info><title>Shrinking the data set</title></info>
<para>
You might want to reduce the number of taxa while attempting to preserve phylogenetic diversity:
% alignment-thin --cutoff=5 ITS-region.fasta -v --preserve=Csaxicola420 > ITS-region-thinned.fasta
This removes sequences with 5 or fewer differences to the closest other sequence.
<itemizedlist>
<listitem>The <userinput>-v</userinput> option shows which sequences are removed and the distance to their closest neighbor.</listitem>
<listitem>The <userinput>--preserve</userinput> option keeps specific sequences in the data set.</listitem>
</itemizedlist>
You can also specify that data sets should be shrunk to a specific number of sequences:
% alignment-thin --down-to=30 ITS-region.fasta -v > ITS-region-thinned.fasta
</para>
<para>To see all options, run:
% alignment-thin --help
</para>
</section>
<section><info><title>Cleaning the data set</title></info>
Keep only sequences that are not too short:
% alignment-thin --longer-than=250 ITS-region.fasta > ITS-region-long.fasta
Remove 10 sequences with the smallest number of conserved residues:
% alignment-thin --remove-crazy=10 ITS-region.fasta > ITS-region-sane.fasta
Keep only columns with a minimum number of residues:
% alignment-thin --min-letters=5 ITS-region.fasta > ITS-region-censored.fasta
</section>
</section>
</article>
|