1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422
|
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN">
<html><head><title>Predicting Genes with AUGUSTUS</title>
<link rel="stylesheet" type="text/css" href="augustus.css">
<script src="tutorial.js" type="text/javascript"></script>
</head><body>
<font size=-1>
Navigate to <a href="index.html">Lab Session on AUGUSTUS</a>.
<a href="scipio.html">Using Scipio</a>.
<a href="training.html">Training AUGUSTUS</a>.
<a href="ppx.html">AUGUSTUS-PPX</a>.
</font>
<div align="right">Show <a href="javascript:allOn()">all</a> / <a href="javascript:allOff()">no</a> details.</div>
<h1>Predicting Genes with AUGUSTUS</h1>
This tutorial describes various typical settings for predicting genes with AUGUSTUS.
<h2 id="abinitiopred">1. PREDICT GENES AB INITIO</h2>
<i>Ab initio</i> prediction means that no other input is used than the target genome itself.
Below, you will find examples of predictions that use evidence (hints), here we
use none.
<span class="assignment">Predict the genes in the range 7,000,001-7,500,000 of chr2R of <i>D. melanogaster</i></span>. Use the FASTA format file <a href="data/chr2R.fa"><tt>chr2R.fa</tt></a>, which includes the whole chromosome 2R.
<span class="assignment">To shorten this test run</span> (or when running several
jobs in parallel) you should <span class="assignment">specify a subrange</span> of the
input sequence using the parameters
<tt>--predictionStart</tt> and <tt>--predictionEnd</tt>.
<pre class="code">augustus --species=fly --predictionStart=7000001 --predictionEnd=7500000 chr2R.fa > augustus.abinitio.gff # takes ~1m</pre>
In this example, I am using the <tt>fly</tt> parameters for comparability whith
predictions below. Of course, the self-trained <tt>bug</tt> parameters also work.
The output file <span class="result"><tt>augustus.abinitio.gff</tt></span> now contains
the predicted gene structures in GFF format with additional comments (lines starting with #).
<pre class="code">
# This output was generated with AUGUSTUS (version 2.5).
...
# start gene g1
chr2R AUGUSTUS gene 7007533 7010935 0.02 - . g1
chr2R AUGUSTUS transcript 7007533 7010935 0.02 - . g1.t1
chr2R AUGUSTUS tts 7007533 7007533 . - . transcript_id "g1.t1"; gene_id "g1";
chr2R AUGUSTUS exon 7007533 7008630 . - . transcript_id "g1.t1"; gene_id "g1";
chr2R AUGUSTUS stop_codon 7007610 7007612 . - 0 transcript_id "g1.t1"; gene_id "g1";
chr2R AUGUSTUS intron 7008631 7008694 1 - . transcript_id "g1.t1"; gene_id "g1";
chr2R AUGUSTUS intron 7008812 7008865 0.88 - . transcript_id "g1.t1"; gene_id "g1";
chr2R AUGUSTUS intron 7009192 7009251 0.95 - . transcript_id "g1.t1"; gene_id "g1";
chr2R AUGUSTUS CDS 7007610 7008630 1 - 1 transcript_id "g1.t1"; gene_id "g1";
chr2R AUGUSTUS CDS 7008695 7008811 0.88 - 1 transcript_id "g1.t1"; gene_id "g1";
chr2R AUGUSTUS exon 7008695 7008811 . - . transcript_id "g1.t1"; gene_id "g1";
chr2R AUGUSTUS CDS 7008866 7009191 0.99 - 0 transcript_id "g1.t1"; gene_id "g1";
chr2R AUGUSTUS exon 7008866 7009191 . - . transcript_id "g1.t1"; gene_id "g1";
chr2R AUGUSTUS CDS 7009252 7009353 0.95 - 0 transcript_id "g1.t1"; gene_id "g1";
chr2R AUGUSTUS exon 7009252 7009429 . - . transcript_id "g1.t1"; gene_id "g1";
chr2R AUGUSTUS start_codon 7009351 7009353 . - 0 transcript_id "g1.t1"; gene_id "g1";
chr2R AUGUSTUS exon 7010820 7010935 . - . transcript_id "g1.t1"; gene_id "g1";
chr2R AUGUSTUS tss 7010935 7010935 . - . transcript_id "g1.t1"; gene_id "g1";
# protein sequence = [MNTLSSARSVAIYVGPVRSSRSASVLAHEQAKSSITEEHKTYDEIPRPNKFKFMRAFMPGGEFQNASITEYTSAMRKR
# YGDIYVMPGMFGRKDWVTTFNTKDIEMVFRNEGIWPRRDGLDSIVYFREHVRPDVYGEVQGLVASQNEAWGKLRSAINPIFMQPRGLRMYYEPLSNIN
# NEFIERIKEIRDPKTLEVPEDFTDEISRLVFESLGLVAFDRQMGLIRKNRDNSDALTLFQTSRDIFRLTFKLDIQPSMWKIISTPTYRKMKRTLNDSL
# NVAQKMLKENQDALEKRRQAGEKINSNSMLERLMEIDPKVAVIMSLDILFAGVDATATLLSAVLLCLSKHPDKQAKLREELLSIMPTKDSLLNEENMK
# DMPYLRAVIKETLRYYPNGLGTMRTCQNDVILSGYRVPKGTTVLLGSNVLMKEATYYPRPDEFLPERWLRDPETGKKMQVSPFTFLPFGFGPRMCIGK
# RVVDLEMETTVAKLIRNFHVEFNRDASRPFKTMFVMEPAITFPFKFTDIEQ]
# end gene g1
...
</pre>
For a description of the GFF format see <a href="http://www.sanger.ac.uk/resources/software/gff/spec.html">the GFF definition at the Sanger Centre</a>.<br><br>
If you also want the protein sequences you can retrieve them with
<pre class="code">getAnnoFasta.pl augustus.abinitio.gff</pre>
which extracts the peptide sequences into a file <span class="result"><tt>augustus.abinitio.aa</tt></span>:
<pre class="code">
>g1.t1
MNKLNLVLITEEHKTYDEIPRPNKFKFMRAFMPGGEFQNASITEYTSAMRKRYGDIYVMPGMFGRKDWVTTFNTKDIEMVFRNEGIWPRRDGLDSIVYFR
EHVRPDVYGEVQGLVASQNEAWGKLRSAINPIFMQPRGLRMYYEPLSNINNEFIERIKEIRDPKTLEVPEDFTDEISRLVFESLGLVAFDRQMGLIRKNR
DNSDALTLFQTSRDIFRLTFKLDIQPSMWKIISTPTYRKMKRTLNDSLNVAQKMLKENQDALEKRRQAGEKINSNSMLERLMEIDPKVAVIMSLDILFAG
VDATATLLSAVLLCLSKHPDKQAKLREELLSIMPTKDSLLNEENMKDMPYLRAVIKETLRYYPNGLGTMRTCQNDVILSGYRVPKGTTVLLGSNVLMKEA
TYYPRPDEFLPERWLRDPETGKKMQVSPFTFLPFGFGPRMCIGKRVVDLEMETTVAKLIRNFHVEFNRDASRPFKTMFVMEPAITFPFKFTDIEQ
>g2.t1
MRHRNKGAVKRKGPSAGAEQEQELKKPKSEFSNGFKRYITEEHKTYDEIPRPNKFKFMRAFMPGGEFQNASITEYTSAMRKRYGDIYVMPGMFGRKDWVT
...
</pre>
<h2 id="customtrack">2. MAKE A CUSTOM GENE PREDICTION TRACK ON THE UCSC GENOME BROWSER</h2>
In order to visually inspect our results and to compare with the FlyBase annotation we will
now make a custom track of the gene structures in <span class="result"><tt>augustus.abinitio.gff</tt></span>. We need to create a few header lines in the custom track file which we can either do manually with an editor or like below on the command line (cut and paste).
<pre class="code">
echo -e "browser position chr2R:7000000-7050000\n\
browser hide multiz15way bdtnpChipper\n\
track name=abinitio description=\"Augustus ab initio predictions\" db=dm3 visibility=3" > abinitio.browser
grep -P "AUGUSTUS\tCDS" augustus.abinitio.gff >> abinitio.browser
</pre>
With the <tt>grep</tt> command we just filtered out the lines that specify the coding exon coordinates.<br>
The resulting file <span class="result"><tt>abinitio.browser</tt></span> is now in UCSC custom track format and looks like this:
<pre class="code">
browser position chr2R:7000000-7050000
browser hide multiz15way bdtnpChipper
track name=abinitio description="Augustus ab initio predictions" db=dm3 visibility=3
chr2R AUGUSTUS CDS 7007610 7008630 1 - 1 transcript_id "g1.t1"; gene_id "g1";
chr2R AUGUSTUS CDS 7008695 7008811 0.88 - 1 transcript_id "g1.t1"; gene_id "g1";
chr2R AUGUSTUS CDS 7008866 7009191 0.99 - 0 transcript_id "g1.t1"; gene_id "g1";
chr2R AUGUSTUS CDS 7009252 7009353 0.95 - 0 transcript_id "g1.t1"; gene_id "g1";
chr2R AUGUSTUS CDS 7011376 7012396 1 - 1 transcript_id "g2.t1"; gene_id "g2";
chr2R AUGUSTUS CDS 7012461 7012577 0.92 - 1 transcript_id "g2.t1"; gene_id "g2";
chr2R AUGUSTUS CDS 7012632 7012957 0.99 - 0 transcript_id "g2.t1"; gene_id "g2";
...
</pre>
<p>Now upload this file as a custom track on the UCSC genome browser.</p>
<a href="javascript:onoff('aibr')" class="dlink"><span id="aibr" title="aibrd" class="dcross">[+]</span>
<span class="dtitle">How again?</span></a> <br>
<div id="aibrd" class="details" style="display:none;">
<ol>
<li><span class="assignment">open the <a href="http://genome.ucsc.edu/cgi-bin/hgTracks?db=dm3" target="_blank">browser for <i>Drosophila melanogaster</i></a>,</span></li>
<li> <span class="assignment">click on "add custom tracks" or "manage custom tracks"</span>,</li>
<li> <span class="assignment">upload the file <tt>abinitio.browser</tt></span> and
<li> <span class="assignment">click on the link in the column "Pos" in the table
of custom tracks</tt></span>.
</ol>
</div><br>
The lazy ones can look at the result by clicking this link to a previously
<a href="http://genome.ucsc.edu/cgi-bin/hgTracks?db=dm3&hgt.customText=http://bioinf.uni-greifswald.de/augustus/binaries/tutorial/results/abinitio.browser" target="_blank">prepared custom track</a>.
<h2 id="prephints">3. PREPARE HINTS</h2>
<p>Hints are extrinsic evidence about the location and struture of genes in
a particular GFF format. When predicting genes AUGUSTUS can incorporate these hints,
which will change the likelihood of gene structures candidates. It will tend to
predict gene structures that are in agreement with the hints. </p>
<h4>Sources of Hints</h4>
<table>
<tr>
<td>ESTs or mRNAs</td>
<td>transcriptome reads, long enough to span several exons (454, Sanger)</td>
</tr>
<tr>
<td>RNA-Seq</td>
<td>high coverage short read transcriptome sequences (Illumina, SOLiD)</td>
</tr>
<tr>
<td>genomic conservation </td>
<td></td>
</tr>
<tr>
<td>MSMS</td>
<td></td>
</tr>
</table>
<p> Below, we will practice the preparation of hints from ESTs or mRNA
(<a href="#hest">3.1</a>) and from RNA-Seq (<a href="#hrnaseq">3.2</a>).
</p>
<h3 id="hest">3.1 From ESTs</h2>
As an example, we will use a set of 8458 ESTs which map to
chr2R:7000000-8000000 of of <i>Drosophila</i>: <a href="data/est.chr2R.7M-8M.fa">est.chr2R.7M-8M.fa</a>.
<p><span class="assignment">Align the ESTs against chr2R</span> using BLAT.</p>
<pre class="code">
blat -noHead chr2R.fa est.chr2R.7M-8M.fa est.psl # takes ~3m
</pre>
This creates an alignment file <span class="result"><tt>est.psl</tt></span>
in <a href="http://genome.ucsc.edu/FAQ/FAQformat.html#format2">PSL format</a>:
<pre class="code" style="font-size:small">
440 5 0 2 0 0 1 1281 - gi|1703783 447 0 447 chr2R 21146708 7776697 7778425 2 197,250, 0,197, 7776697,7778175,
494 3 0 1 0 0 2 65 + gi|1703784 498 0 498 chr2R 21146708 7775550 7776113 3 452,12,34, 0,452,464,x 7775550,7776003,7776079,
...
</pre>
However, typically some ESTs align well to very many places in the genome. BLAT also
includes short local alignments starting from 30bp. For this reason, we further
<span class="assignment">filter the alignments</span>:
<pre class="code">cat est.psl | filterPSL.pl --best --minCover=80 > est.f.psl</pre>
<span class="result"><tt>est.f.psl</tt></span> now only contains for each query
the best alginment(s) and that only if it covers at least 80% of the query length.
This reduces the number of alignments:
<pre class="code">
wc -l est.psl est.f.psl
# 10487 est.psl
# 8606 est.f.psl
</pre>
We can have a look at those EST alignments by <span class="assignment">creating another custom browser track</span>:
<pre class="code">
echo -e "browser position chr2R:7000000-7050000\n\
track name=ESTs description=\"EST alignments\" db=dm3 visibility=4" > ests.browser
cat est.f.psl >> ests.browser
gzip ests.browser
</pre>
<p>You can now <span class="assignment">upload <span class="result"><tt>ests.browser.gz</tt></span> as another custom track</span> or <span class="assignment">click on this <a href="http://genome.ucsc.edu/cgi-bin/hgTracks?db=dm3&hgt.customText=http://bioinf.uni-greifswald.de/augustus/binaries/tutorial/results/ests.browser.gz" target="_blank">prepared custom track</a></span></p>
Next, <span class="assignment">generate hints from the EST alignments</span>:
<pre class="code">blat2hints.pl --nomult --in=est.f.psl --out=hints.est.gff</pre>
The script <tt>blat2hints.pl</tt> identifies the positions of likely introns,
exons and exonic regions (termed <i>exonpart</i> or <i>ep</i>) from the alignments.
The file <span class="result"><tt>hints.est.gff</tt></span> now contains these hints sorted by third column. However, they are internally grouped by the the group name following grp= in the last column. An example group of hints may look like this.<br>
E.g.
<pre class="code">grep 15058069 hints.est.gff</pre>
yields
<pre class="code">
chr2R b2h ep 7559574 7559803 0 . . grp=gi|15058069;pri=4;src=E
chr2R b2h ep 7560550 7560814 0 . . grp=gi|15058069;pri=4;src=E
chr2R b2h exon 7560222 7560347 0 . . grp=gi|15058069;pri=4;src=E
chr2R b2h intron 7559804 7560221 0 . . grp=gi|15058069;pri=4;src=E
chr2R b2h intron 7560348 7560549 0 . . grp=gi|15058069;pri=4;src=E
</pre>
<h3 id="hrnaseq">3.2 From RNA-Seq</h2>
Massive amounts of (short) transcriptome reads from next generation sequencing
methods like Illumina first need to be aligned to the genome. Recently, a large
number of short read aligners was developed. For the sake of this tutorial we will
assume that we have already aligned the reads to the genome and that we have
two files:
<ol>
<li>The file <a href="data/chr2R.7M-8M.wig"><tt>chr2R.7M-8M.wig</tt></a> contains a coverage graph, that contains for each base in the window chr2R:7000000-8000000 the number of
reads alignments that cover the position. The UCSC group has a <a href="http://genome.ucsc.edu/FAQ/FAQformat.html#format6">description of the wiggle format (.wig)</a>.</li>
<li>The file <a href="data/hints.rnaseq.intron.gff"><tt>hints.rnaseq.intron.gff</tt></a>
contains likely intron positions, inferred from gaps in the query of the read alignments. Together with the intron boundaries the multiplicity (mult) is reported, which
counts the number of alignments that support the given intron candidate, if there is more than one.</li>
</ol>
In this <a href="http://bioinf.uni-greifswald.de/augustus/binaries/readme.rnaseq.html">readme
about AUGUSTUS in the RGASP assessment</a> a method is described that would produce two such files. TopHat is a spliced read mapper for RNA-Seq that also produces
a coverage grap and reports introns with their multiplicity.
<p>
<span class="assignment">Upload the file <tt>chr2R.7M-8M.wig</tt> as a UCSC custom track</span> (<tt>gzip</tt>ing would speed up upload) or <span class="assignment">click on this <a href="http://genome.ucsc.edu/cgi-bin/hgTracks?db=dm3&hgt.customText=http://bioinf.uni-greifswald.de/augustus/binaries/tutorial/data/chr2R.7M-8M.wig.gz" target="_blank">prepared custom track</a></span>.</p>
<p><span class="assignment">Generate hints about exonic regions from the coverage graph</span> (wig file):
<pre class="code">
cat chr2R.7M-8M.wig | wig2hints.pl --width=10 --margin=10 --minthresh=2 --minscore=4 \
--src=W --type=ep --radius=4.5 > hints.rnaseq.ep.gff
</pre>
The resulting <span class="result"><tt>hints.rnaseq.ep.gff</tt></span> now
contains hints in a GFF format that <tt>augustus</tt> understands:
<pre class="code">
chr2R w2h ep 7007551 7007560 5.300 . . src=W;mult=5;
chr2R w2h ep 7007561 7007570 7.400 . . src=W;mult=7;
chr2R w2h ep 7007571 7007580 9.700 . . src=W;mult=9;
chr2R w2h ep 7007581 7007590 10.200 . . src=W;mult=10;
</pre>
Again, at the end of the column, the multiplicity (mult) contains the
average coverage in the given interval. <tt>augustus</tt> will consider each
such line evidence that the sequence interval is <i>part of an exon</i> (ep=exonpart).
<h3 id="concath">3.3 Concatenate all Hints</h2>
Now, <span class="assignment">join the hints from all sources</span> into one file:
<pre class="code">cat hints.est.gff hints.rnaseq.intron.gff hints.rnaseq.ep.gff > hints.gff</pre>
<span class="result"><tt>hints.gff</tt></span> now contains the exon, intron and exonpart hints from ESTs as well as the intron and exonpart hints from RNA-Seq.
<h2 id="extr">4. SET HINT PARAMETERS</h2>
The strength of the influence of the hints can be adjusted with a few parameters
from no influence (ab initio) up to the point where they should be trusted
completely and "anchor" the gene structure. These parameters are stored in an
extrinsic configuration file.
The folder <tt>config/extrinsic</tt> of the AUGUSTUS package contains a
few examples.
<p>Start by <span class="assignment">copying another extrinsic configuration file</span>:</p>
<pre class="code">cp $AUGUSTUS_CONFIG_PATH/extrinsic/extrinsic.M.RM.E.W.cfg extrinsic.bug.cfg</pre>
Now <span class="assignment">edit <tt class="result">extrinsic.bug.cfg</tt></span> so that the
non-comment lines are like this. Alternatively, you may just copy that file from
the result files and edit some of the bold numbers below.
<pre class="code">
[SOURCES]
M E W
[SOURCE-PARAMETERS]
[GENERAL]
start 1 1 M 1 1e+100 E 1 1 W 1 1
stop 1 1 M 1 1e+100 E 1 1 W 1 1
tss 1 1 M 1 1e+100 E 1 1 W 1 1
tts 1 1 M 1 1e+100 E 1 1 W 1 1
ass 1 1 M 1 1e+100 E 1 1 W 1 1
dss 1 1 M 1 1e+100 E 1 1 W 1 1
exonpart 1 <span style="font-weight:bold;color:red">.997</span> M 1 1e+100 E 1 <span style="font-weight:bold;color:darkgreen">1e2</span> W 1 <span style="font-weight:bold;color:darkgreen">1.05</span>
exon 1 1 M 1 1e+100 E 1 <span style="font-weight:bold;color:darkgreen">1e4</span> W 1 1
intronpart 1 1 M 1 1e+100 E 1 1 W 1 1
intron 1 <span style="font-weight:bold;color:red">.3</span> M 1 1e+100 E 1 <span style="font-weight:bold;color:darkgreen">1e6</span> W 1 1
CDSpart 1 1 <span style="font-weight:bold;color:red">0.985</span> M 1 1e+100 E 1 1 W 1 1
CDS 1 1 M 1 1e+100 E 1 1 W 1 1
UTRpart 1 1 <span style="font-weight:bold;color:red">.96</span> M 1 1e+100 E 1 1 W 1 1
UTR 1 1 M 1 1e+100 E 1 1 W 1 1
irpart 1 1 M 1 1e+100 E 1 1 W 1 1
nonexonpart 1 1 M 1 1e+100 E 1 1 W 1 1
genicpart 1 1 M 1 1e+100 E 1 1 W 1 1
</pre>
The bold green numbers specify the <i>bonus</i> that a gene structure candidate gets
for being compatible with a hint of that type and source.
<p>For example, the <span style="font-weight:bold;color:darkgreen">1e6</span> in the intron row after the source letter E means that for each intron hint from
ESTs (src=E), gene structures that have an intron
with both boundaries given as in the hint are rewarded by a factor of 1 million
relatively to gene structures disregarding the intron hint.</p>
<p>The <span style="font-weight:bold;color:darkgreen">1.05</span> in the exonpart row after the letter W specifies that for each exonpart hint in the RNA-Seq hints file
(src=W), every gene structure that has an exon including the range of the hint gets
a relative bonus factor 1.05 <i>per multiplicity</i>.</p>
<p>The red numbers mean a punishment (malus) for gene structures with unsupported
features. For example, the <span style="font-weight:bold;color:red">.3</span> in
the intron row means that every intron candidate
<i>that has no intron hints supporting it</i> is punished by multiplying its relative probability with the factor 0.3. If you decrease this number even
more (say from .3 to .001) then fewer introns unsupported by spliced transcriptome
reads should be predicted. This would likely decrease the false positive intron rate, but also more true unsupported introns would be missed.
</p>
For more information look at into one of the extrinsic.cfg files.
<h2 id="predh">5. PREDICT GENES USING HINTS</h2>
<span class="assignment">Predict the genes in the range 7,000,001-7,500,000 of chr2R of <i>D. melanogaster</i> using evidence from <tt>hints.gff</tt></span>.
<pre class="code">augustus --species=fly --predictionStart=7000001 --predictionEnd=7500000 chr2R.fa \
--extrinsicCfgFile=extrinsic.bug.cfg --hintsfile=hints.gff > augustus.hints.gff # takes ~9m</pre>
<p>The species <tt>fly</tt> contains UTR parameters, which we didn't have the time to
train for <tt>bug</tt>. When using RNA-Seq as hints it is better to use a model
with UTRs, as a significant fraction of reads map to UTRs. It is also possible
to use <tt>bug</tt> here, though.</p>
The output <span class="result"><tt>augustus.hints.gff</tt></span> now looks like that
<pre class="code">
# This output was generated with AUGUSTUS (version 2.5).
...
# start gene g1
chr2R AUGUSTUS gene 7007533 7009385 0.2 - . g1
chr2R AUGUSTUS transcript 7007533 7009385 0.2 - . g1.t1
chr2R AUGUSTUS tts 7007533 7007533 . - . transcript_id "g1.t1"; gene_id "g1";
chr2R AUGUSTUS exon 7007533 7008630 . - . transcript_id "g1.t1"; gene_id "g1";
chr2R AUGUSTUS stop_codon 7007610 7007612 . - 0 transcript_id "g1.t1"; gene_id "g1";
chr2R AUGUSTUS intron 7008631 7008694 1 - . transcript_id "g1.t1"; gene_id "g1";
chr2R AUGUSTUS intron 7008812 7008865 1 - . transcript_id "g1.t1"; gene_id "g1";
chr2R AUGUSTUS intron 7009192 7009251 1 - . transcript_id "g1.t1"; gene_id "g1";
chr2R AUGUSTUS CDS 7007610 7008630 1 - 1 transcript_id "g1.t1"; gene_id "g1";
chr2R AUGUSTUS CDS 7008695 7008811 1 - 1 transcript_id "g1.t1"; gene_id "g1";
chr2R AUGUSTUS exon 7008695 7008811 . - . transcript_id "g1.t1"; gene_id "g1";
chr2R AUGUSTUS CDS 7008866 7009191 1 - 0 transcript_id "g1.t1"; gene_id "g1";
chr2R AUGUSTUS exon 7008866 7009191 . - . transcript_id "g1.t1"; gene_id "g1";
chr2R AUGUSTUS CDS 7009252 7009353 0.94- 0 transcript_id "g1.t1"; gene_id "g1";
chr2R AUGUSTUS exon 7009252 7009385 . - . transcript_id "g1.t1"; gene_id "g1";
chr2R AUGUSTUS start_codon 7009351 7009353 . - 0 transcript_id "g1.t1"; gene_id "g1";
chr2R AUGUSTUS tss 7009385 7009385 . - . transcript_id "g1.t1"; gene_id "g1";
# protein sequence = [MNTLSSARSVAIYVGPVRSSRSASVLAHEQAKSSITEEHKTYDEIPRPNKFKFMRAFMPGGEFQNASITEYTSAMRKR
# YGDIYVMPGMFGRKDWVTTFNTKDIEMVFRNEGIWPRRDGLDSIVYFREHVRPDVYGEVQGLVASQNEAWGKLRSAINPIFMQPRGLRMYYEPLSNIN
# NEFIERIKEIRDPKTLEVPEDFTDEISRLVFESLGLVAFDRQMGLIRKNRDNSDALTLFQTSRDIFRLTFKLDIQPSMWKIISTPTYRKMKRTLNDSL
# NVAQKMLKENQDALEKRRQAGEKINSNSMLERLMEIDPKVAVIMSLDILFAGVDATATLLSAVLLCLSKHPDKQAKLREELLSIMPTKDSLLNEENMK
# DMPYLRAVIKETLRYYPNGLGTMRTCQNDVILSGYRVPKGTTVLLGSNVLMKEATYYPRPDEFLPERWLRDPETGKKMQVSPFTFLPFGFGPRMCIGK
# RVVDLEMETTVAKLIRNFHVEFNRDASRPFKTMFVMEPAITFPFKFTDIEQ]
# Evidence for and against this transcript:
# % of transcript supported by hints (any source): 100
# CDS exons: 4/4
# E: 4
# W: 4
# CDS introns: 3/3
# E: 3
# 5'UTR exons and introns: 1/1
# E: 1
# 3'UTR exons and introns: 1/1
# W: 1
# hint groups fully obeyed: 137
# E: 4 (gi|15542574,SRR023546.8642467/1)
# W: 133
# incompatible hint groups: 18
# E: 18 (gi|13769068,gi|4203815,gi|15543927,gi|38623822,gi|15539951,gi|14693753,gi|14699170,...)
# end gene g1
...
</pre>
<p>After each predicted transcript a little statistics follows about the support and
compatibility of this transcript with the hints. Note, that AUGUSTUS now predicts alternative
splice forms (ending e.g. in .t2).</p>
Finally, <span class="assignment">make another custom track with the predictions using hints</span>:
<pre class="code">
echo -e "browser position chr2R:7299000-7318000\n\
track name=withhints description=\"Augustus predictions using hints\" db=dm3 visibility=3" > withhints.browser
grep -P "AUGUSTUS\t(CDS|exon)" augustus.hints.gff >> withhints.browser
</pre>
In the last line we are filering out from the predictions just the lines specifying exons and CDS. The additional exon lines identify the UTR (if you used <tt>fly</tt>)
by subtracting the CDS ranges.
<p>Again, you may also just click on this like to a <a href="http://genome.ucsc.edu/cgi-bin/hgTracks?db=dm3&hgt.customText=http://bioinf.uni-greifswald.de/augustus/binaries/tutorial/results/withhints.browser" target="_blank">prepared custom track</a> with the preditions.</p>
</body></html>
|