File: scipio.html

package info (click to toggle)
augustus 3.4.0%2Bdfsg2-2
  • links: PTS, VCS
  • area: main
  • in suites: bullseye
  • size: 758,480 kB
  • sloc: cpp: 65,451; perl: 21,436; python: 3,927; ansic: 1,240; makefile: 1,032; sh: 189; javascript: 32
file content (228 lines) | stat: -rw-r--r-- 14,712 bytes parent folder | download | duplicates (3)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN">
<html><head><title>Scipio</title>
<link rel="stylesheet" type="text/css" href="augustus.css">
<script src="tutorial.js" type="text/javascript"></script>
</head><body>
<font size=-1>
Navigate to <a href="index.html">Lab Session on AUGUSTUS</a>. 
<a href="scipio.html">Using Scipio</a>
<a href="training.html">Training AUGUSTUS</a>.
<a href="prediction.html">Predicting Genes</a>.
<a href="ppx.html">AUGUSTUS-PPX</a>.
</font>
<div align="right">Show <a href="javascript:allOn()">all</a> / <a href="javascript:allOff()">no</a> details.</div>

<h1>Using Scipio to create a (training) set of gene structures</h1>

<p>In the case where a genome and protein sequences from the same or very closely related species are given, 
<a href="http://www.webscipio.org/">Scipio</a> can be used to construct the gene structure from the sequences.
Scipio uses <a href="http://users.soe.ucsc.edu/~kent/">BLAT</a> to align the protein sequences against the genome
and then determines the exact boundaries of the exons, and adds small exons that were not found by BLAT.<br>
</p>

<a href="javascript:onoff('do')" class="dlink"><span id="do" title="dod" class="dcross">[+]</span>
<span class="dtitle">Installing Scipio...</span></a> <br>
<div id="dod" class="details" style="display:none;">
<span class="assignment">Download Scipio</span> from <a href="http://www.webscipio.org/webscipio/download_scipio">
http://www.webscipio.org/webscipio/download_scipio</a>.<br>
Scipio requires that BLAT and the yaml perl module YAML.pm is installed.<br>
<span class="assignment">Download and install <a href="http://hgwdev.cse.ucsc.edu/~kent/exe/linux/">BLAT</a>.</span><br>
On UBUNTU linux <span class="assignment">install <tt>libyaml-perl</tt></span> doing
<pre class="code">sudo apt-get install libyaml-perl</pre>
</div>


<a href="javascript:onoff('da')" class="dlink"><span id="da" title="dad" class="dcross">[+]</span>
<span class="dtitle">Some comments about when to use Scipio...</span></a> <br>
<div id="dad" class="details" style="display:none;">
Scipio is post-processing spliced alignments from BLAT. It is therefore suitable only for cases in which the input protein sequences are expected to be fairly similar to 
the proteins encoded by the genome (e.g. >90% similarity). One use case is the migration of gene structures from one assembly to another assembly. Another use case
is the identification of exon coordinates for protein of the same species as the genome when the gene structures (GFF) is not available (anymore). A third use case
is the mapping of protein sequences from a related known species to a new species, e.g. from human to Orangutan. Scipio is well-suited for draft assemblies with short
contigs, as it is assembling alignments of proteins where different fragments match different contigs.
</div><br>

<h2>1. Run Scipio</h2>

As example input we use data from 5 mega bases of the <i>Drosophila</i> genome.
We first <span class="assignment">create symbolic links</span> to have aliases to the file names with intuitive names
<pre class="code">ln -s chr2R.2M-7M.fa genome.fa
ln -s chr2R.2M-7M.aa proteins.aa
</pre>
(The first two megabases at the chromosome tip were not taken because they are highly repetitive.)<br><br>

<span class="assignment">Run Scipio:</span>
<pre class="code">scipio.1.4.pl --blat_output=prot.vs.genome.psl genome.fa proteins.aa > scipio.yaml   # takes ~7m
</pre> 
The option <tt>--blat_output=prot.vs.genome.psl</tt> means that the time-consuming protein-to-dna alignments are stored, in case
one wants to rerun Scipio with other parameters.

The file <tt class="result">scipio.yaml</tt> now contains for each protein alignment details, in particular also the exact sequence coordinates we
require. (Protein BLAT resolves exon boundaries at best up the accuracy of complete amino acids, which is not enough for 
creating a training set of gene structures.)<br><br>

<span class="assignment">Extract a GFF file from the yaml output</span> and <span class="assignment">convert the GFF output</span> to one suitable for further processing
<pre class="code">cat scipio.yaml | yaml2gff.1.4.pl > scipio.scipiogff
scipiogff2gff.pl --in=scipio.scipiogff --out=scipio.gff
</pre>

The command 
<pre class="code">cat scipio.yaml | yaml2log.1.4.pl > scipio.log</pre>
creates a log file that can be examined to see for each protein whether a perfect alignment was found (status: complete).<br><br>

The file <span class="result"><tt>scipio.gff</tt></span> should now look like this
<pre class="code">
chr2R   Scipio  CDS     900562  900621  1.000   +       0       transcript_id "392"
chr2R   Scipio  CDS     904518  904880  1.000   +       0       transcript_id "392"
chr2R   Scipio  CDS     904940  905131  1.000   +       0       transcript_id "392"
chr2R   Scipio  CDS     905195  905263  1.000   +       0       transcript_id "392"
chr2R   Scipio  CDS     3595076 3596041 1.000   +       0       transcript_id "2517"
...
</pre>

<h3> 1.1 Optional: visualize gene structures</h3>
<p>We will check what the gene structures constructed above "look like" using the
<a href="http://genome.ucsc.edu">UCSC Genome Browser</a>. This is here possible as
the UCSC Genome Browser holds <i>Drosophila</i>
assembly dm3. In general, you will have to set up your own browser (e.g. 
<a href="http://gmod.org/wiki/GBrowse">GBrowse</a>).</p>

<span class="assignment">Prepare a custom track file</span>
<pre class="code">
echo -e "browser position chr2R:3000000-3200000\n\
browser hide multiz15way bdtnpChipper\n\
track name=scipio description=\"training and test genes\"\
 db=dm3 offset=2000000 visibility=3" > scipio.browser
cat scipio.gff >> scipio.browser
gzip -c scipio.browser > scipio.browser.gz
</pre>

The file <span class="result"><tt>scipio.browser</tt> and its packed version
<tt>scipio.browser.gz</tt></span> will now look like this
<pre class="code">
browser position chr2R:3000000-3200000
browser hide multiz15way bdtnpChipper
track name=scipio description="training and test genes" db=dm3 offset=2000000 visibility=2
chr2R   Scipio  CDS     900562  900621  1.000   +       0       transcript_id "392"
chr2R   Scipio  CDS     904518  904880  1.000   +       0       transcript_id "392"
chr2R   Scipio  CDS     904940  905131  1.000   +       0       transcript_id "392"
chr2R   Scipio  CDS     905195  905263  1.000   +       0       transcript_id "392"
chr2R   Scipio  CDS     3595076 3596041 1.000   +       0       transcript_id "2517"
...
</pre>

This is a format that the UCSC Genome Browser understands. The first three lines tell it
which region to display first (chr2R:3000000-3200000), to hide the tracks multiz15way and
bdtnpChipper (we don't need to look at them), the track name (scipio) a header
for the track, the assembly it refers to (dm3), that the Browser needs to add 2,000,000 to the coordinates because our example genome starts after 2 mega bases only (offset=2000000) and that the display mode should be "full" (visibility=2).

<ol>
<li><span class="assignment">open the <a href="http://genome.ucsc.edu/cgi-bin/hgTracks?db=dm3" target="_blank">browser for <i>Drosophila melanogaster</i></a>,</span></li> 
<li> <span class="assignment">click on "add custom tracks" or "manage custom tracks"</span>,</li>
<li> <span class="assignment">upload the file <tt>scipio.browser.gz</tt></span> and
<li> <span class="assignment">click on the link in the column "Pos" in the table
of custom tracks</tt></span>. 
</ol>

The lazy ones can look at the results by clicking this link to a previously
<a href="http://genome.ucsc.edu/cgi-bin/hgTracks?db=dm3&hgt.customText=http://augustus.gobics.de/tutorial/customtracks/scipio.browser.gz" target="_blank">prepared custom track</a>.

<p>
The UCSC Genome Browser group has an excellent <a href="http://genome.ucsc.edu/goldenPath/help/hgTracksHelp.html#CustomTracks">help page for setting up UCSC custom tracks</a>.</p>

<h2>2. Convert GFF file to Genbank format file</h2>

The training program <tt>etraining</tt> of the AUGUSTUS package requires input in the form of Genbank format,
that includes both the genomic sequences and the sequence coordinates of the coding regions CDS and optionally
of the UTR. In  many cases the coordinates of the gene structures are available in 
<a href="http://www.sanger.ac.uk/resources/software/gff/spec.html">GFF</a> format or 
<a href="http://mblab.wustl.edu/GTF22.html">GTF format</a>.
The tool <tt>gff2gbSmallDNA.pl</tt> that comes with AUGUSTUS can be used to <span class="assignment">convert</span> files in some <span class="assignment">GFF</span> formats or
GTF <span class="assignment">to Genbank format</span>:<br>

<pre class="code">gff2gbSmallDNA.pl scipio.gff genome.fa 1000 genes.raw.gb</pre>

This command takes each gene in <tt>scipio.gff</tt> together with 1000 bp flanking intergenic region upstream
and downstream of the gene and creates a LOCUS in Genbank format.<br><br>

<a href="javascript:onoff('if')" class="dlink"><span id="if" title="ifd" class="dcross">[+]</span>
<span class="dtitle">Input format of gff2gbSmallDNA.pl...</span></a> <br>
<div id="ifd" class="details" style="display:none;">
The last column (column 9) of the GFF input file is used to group exons of the same transcript together. For an example
see the <tt>scipio.gff</tt> from above. In GTF format this is based on whether they agree on the <tt>transcript_id</tt> field. 
If the last column contains the string <tt>Parent=</tt> (as in GFF3) then this is used for grouping. If no grouping 
from the standard formats is detected then the whole 9th column is used for grouping exons to genes.
It is possible to convert genes just containing the coding regions (CDS in the third column) but also 
to convert <b>genes including UTR annotation</b> (keyword UTR or exon) in the third column.

It is also possible to restrict output using lists of gene names to include or exclude:
<pre class="code">Usage: gff2gbSmallDNA.pl gff-file seq-file max-size-of-gene-flanking-DNA output-file [options]
options:
--bad=badfile    Specify a file with gene names. All except these are included in the output.
--good=goodfile  Specify a file with gene names. Only these genes are included in the output.
--overlap        Overlap filtering turned off.
</pre>
</div><br>

<a href="javascript:onoff('fr')" class="dlink"><span id="fr" title="frd" class="dcross">[+]</span>
<span class="dtitle">Size of flanking regions...</span></a> <br>
<div id="frd" class="details" style="display:none;">

The flanking regions upstream and downstream of each gene are used to train non-coding DNA content, i.e. estimate <i>k</i>-mer 
frequencies in these regions. When two neighboring genes in the gff file are closer to each other
than 2*<tt>max-size-of-gene-flanking-DNA</tt>, then then flanking region is decreased such that half of the intergenic
region between the two genes is included in each locus. This avoids to some extend to have overlapping non-coding regions for training
and to include genes in the "intergenic" region.
The choice of the parameter <tt>max-size-of-gene-flanking-DNA</tt> is important for several reasons.
<ul>
<li> The flanking region should be large enough to allow a prepresentative estimate of non coding regions.
When the GFF file only contains CDS, then part of the flanking regions are UTR and may not be representative of
intergenic region (e.g. CpG islands in vertebrates).</li>
<li>Usually the gff file is not complete and genes are missing from it. In that case the flanking regions may contain genic regions
of genes that are missing in the gff file. Apart from the bias on <i>k</i>-mer frequencies this will yield a bad specificity when
these loci are used for assessing accuracy.
<li>A small flanking region may lead to an overestimation of gene level sensitivity if it is used for assessing accuracy. Generally,
gene finders may be more accurate on sequences that contain exactly one locus and only little flanking regions.
</li>
<li>A large flanking region may increase the running time of <tt>optimize_augustus.pl</tt> substantially.</li>
</li>
</ul>
A compromise between above conflicting aims needs to be sought.
</div>

<p>The file <tt>genes.raw.gb</tt> now contains training genes in the right format. However, some fraction of them may
be problematic with respect to AUGUSTUS training. They may contain genes with <b>nonconsensus splice sites, missing start codons,
missing stop codons, in-frame stop codons</b>. To avoid warning messages later and because genes with these features are partially
ignored anyways, we <span class="assignment">remove these problematic locis</span> from <tt>genes.raw.gb</tt>:

<pre class="code">etraining --species=generic --stopCodonExcludedFromCDS=true genes.raw.gb 2> train.err
</pre>
<tt>etraining</tt> complains about and reports problematic genes. The option --stopCodonExcludedFromCDS
determines whether in the input Genbank file the stop codon is considered to be part of the CDS or not. This 
is sometimes the case and sometimes not. When the GFF stems from other sources than Scipio this may have to be set to <tt>false</tt>
rather than to <tt>true</tt>. We <span class="assignment">write the error messages into a file</span> <tt>train.err</tt> which looks like this:

<pre class="code" style="font-size:small">
gene 186 transcr. 1 in sequence chr2R_549753-555284: Initial exon has length < 3!
gene 461 transcr. 1 in sequence chr2R_1034318-1036751: Initial Exon doesn't begin with start codon.
gene 567 transcr. 1 in sequence chr2R_1198437-1201521: Initial Exon doesn't begin with start codon.
gene 4537 transcr. 1 in sequence chr2R_1354359-1361857: Initial Exon doesn't begin with start codon.
gene 4783 transcr. 1 in sequence chr2R_1669860-1673241: Terminal exon doesn't end in stop codon. Variable stopCodonExcludedFromCDS set right?
gene 5161 transcr. 1 in sequence chr2R_2043765-2046183: Single exon gene doesn't begin with atg codon but with ccc
gene 6319 transcr. 1 in sequence chr2R_3734070-3735386: Initial Exon doesn't begin with start codon.
gene 3577 transcr. 1 in sequence chr2R_4767472-4770826: Initial Exon doesn't begin with start codon.
</pre>

</p>
We can now <span class="assignment">filter out the problematic genes</span>, e.g. by issuing
<pre class="code">
cat train.err | perl -pe 's/.*in sequence (\S+): .*/$1/' > badgenes.lst
filterGenes.pl badgenes.lst genes.raw.gb > genes.gb
grep -c "LOCUS" genes.raw.gb genes.gb
# genes.raw.gb:594
# genes.gb:586
</pre>
Here, <tt>badgenes.lst</tt> ist a file with gene names that were reportet in <tt>train.err</tt>.<br><br>
<span class="result"><tt>genes.gb</tt> now contains 586 loci with one "unproblematic" gene each.</span>

</body></html>