File: de_novo.html

package info (click to toggle)
augustus 3.4.0%2Bdfsg2-2
  • links: PTS, VCS
  • area: main
  • in suites: bullseye
  • size: 758,480 kB
  • sloc: cpp: 65,451; perl: 21,436; python: 3,927; ansic: 1,240; makefile: 1,032; sh: 189; javascript: 32
file content (201 lines) | stat: -rw-r--r-- 8,231 bytes parent folder | download | duplicates (3)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN">
<html><head><title>Predicting Genes with AUGUSTUS</title>
<link rel="stylesheet" type="text/css" href="augustus.css">
<script src="tutorial.js" type="text/javascript"></script>
</head><body>
<font size=-1>
Navigate to <a href="index.html">main AugCGP tutorial</a>. 
<a href="cactus.html">Cactus alignments and assembly Hubs</a>.
<a href="rnaseq.html">AugCGP with RNA-Seq</a>.
<a href="liftover.html">Annotation transfer with AugCGP</a>.
<a href="rnaseq+liftover.html">Combining RNA-Seq and annotation evidence</a>.
<a href="hgm.html">Cross-species consistency of gene sets</a>.
</font>
<div align="right">Show <a href="javascript:allOn()">all</a> / <a href="javascript:allOff()">no</a> details.</div>

<h1><i>De novo</i> comparative gene prediction with Augustus-cgp</h1>

In this tutorial, we run Augustus-cgp in its most basic form: <i>de novo</i>. That means
only the raw genomes and the alignment of the genomes are input, but no extrinsic evidence, e.g. from transcriptome data.
In most applications, however, RNA-Seq data will be available and it is recommended to incorporate them to obtain a better accuracy.

<h2 id="loadFA">1. Load genomes into an SQLite database</h2>

<span class="assignment">Prepare</span> a text file with a list of species names and location of
the corresponding genome fasta files.<br><br>

<a href="javascript:onoff('cft')" class="dlink"><span id="cft" title="cftd" class="dcross">[+]</span>
<span class="dtitle">file format...</span></a> <br>
<div id="cftd" class="details" style="display:none;">
<pre class="code">
name_of_genome_1  path/to/genome_1
name_of_genome_2  path/to/genome_2
...
name_of_genome_N  path/to/genome_N
</pre>
</div>
</br>

<pre class="code">
for f in $PWD/genomes/*.fa; do echo -ne "$(basename $f .fa)\t$f\n"; done >genomes.tbl
</pre>

The file  <span class="result"><tt>genomes.tbl</tt></span> will now look like this (except for the parent directory)<br>
<pre class="code">
bosTau8 /var/www/augustus/htdocs/binaries/tutorial-cgp/data/genomes/bosTau8.fa
canFam3 /var/www/augustus/htdocs/binaries/tutorial-cgp/data/genomes/canFam3.fa
galGal4 /var/www/augustus/htdocs/binaries/tutorial-cgp/data/genomes/galGal4.fa
hg38    /var/www/augustus/htdocs/binaries/tutorial-cgp/data/genomes/hg38.fa
mm10    /var/www/augustus/htdocs/binaries/tutorial-cgp/data/genomes/mm10.fa
monDom5 /var/www/augustus/htdocs/binaries/tutorial-cgp/data/genomes/monDom5.fa
rheMac3 /var/www/augustus/htdocs/binaries/tutorial-cgp/data/genomes/rheMac3.fa
rn6     /var/www/augustus/htdocs/binaries/tutorial-cgp/data/genomes/rn6.fa
</pre>

<span class="assignment">Load</span> the genomes into an SQLite database
<pre class="code">
while read line
do
  species=$(echo "$line" | cut -f 1)
  genome=$(echo "$line" | cut -f 2)
  load2sqlitedb --noIdx --species=$species --dbaccess=vertebrates.db $genome
done &ltgenomes.tbl
</pre>

<span class="assignment">Finalize</span>  database by creating indices on the tables
<pre class="code">
load2sqlitedb --makeIdx --dbaccess=vertebrates.db
</pre>
You can <span class="assignment">check</span> if loading was successful with following
database query
<pre class="code">
sqlite3 -header -column vertebrates.db "\
 SELECT speciesname, \
  sum(end-start+1) AS 'genome length',\
  count(*) AS '# chunks',\
  count(distinct seqnr) AS '# seqs'\
 FROM genomes natural join speciesnames\
 GROUP BY speciesname;"
</pre>
that returns a summary of the genomes in the database, e.g. their sizes (total number of bases per genome),
number of sequences per genome, etc.
<pre class="code">
speciesname  genome length  # chunks    # seqs    
-----------  -------------  ----------  ----------
bosTau8      156091         4           1         
canFam3      184728         4           1         
galGal4      149999         3           1         
hg38         210155         5           1         
mm10         178393         4           1         
monDom5      540519         11          1         
rheMac3      220640         5           1         
rn6          99944          2           1         
</pre>

<h2 id="runAug">2. Run AUGUSTUS in CGP mode</h2>

<span class="assignment">Create</span> a new folder for the de novo experiments and
<span class="assignment">switch</span> to the new directory
<pre class="code">
mkdir augCGP_denovo
cd augCGP_denovo
</pre>
For convenience <span class="assignment">assign</span> each alignment chunk to a job ID by
creating softlinks
<pre class="code">
num=1
for f in ../mafs/*.maf; do ln -s $f $num.maf; ((num++)); done
</pre>
<span class="assignment">Run</span> Augustus in cgp mode on all alignment chunks in parallel (~3min).<br>
<span class="assignment">Use</span> the option <tt>--softmasking=1</tt> in cases where the genomes are soft-masked.<br>
The parameter <tt>--species</tt> is the same as in the standard version of Augustus. You can take
the species identifier that best represents your clade,<br>
(e.g. <tt>--species=human</tt> for mammalian clades, <tt>--species=fly</tt> for fly clades, <tt>--species=chicken</tt> for bird clades, ...)<br>
A complete list of species identifiers can be found in
the directory <tt>augustus/config/species</tt>.
<pre class="code">
for ali in *.maf; do
id=${ali%.maf} # remove .maf suffix
augustus \
--species=human \
--softmasking=1 \
--treefile=../tree.nwk \
--alnfile=$ali \
--dbaccess=../vertebrates.db \
--speciesfilenames=../genomes.tbl \
--alternatives-from-evidence=0 \
--/CompPred/outdir=pred$id &gt aug$id.out 2&gt err$id.out &
done
</pre>
This will generate the folders <span class="result"><tt>pred*/</tt></span> (one for each alignment chunk)
that contain gff files with gene predictions for each input genome.
<pre class="code">
bosTau8.cgp.gff
canFam3.cgp.gff
galGal4.cgp.gff
hg38.cgp.gff
mm10.cgp.gff
monDom5.cgp.gff
rheMac3.cgp.gff
rn6.cgp.gff
</pre>

Note that the parallelization with the bash '&' command  above is quite simple and rather for demonstration purposes.<br>
For real applications with several hundreds or thousands of alignment chunks, we recommend to
run job arrays on a compute cluster.

<h2 id="merge">3. Merge gene predictions from parallel runs</h2>
<pre class="code">
mkdir joined_pred
while read line
do
  species=$(echo "$line" | cut -f 1)
  find pred* -name "${species}.cgp.gff" >${species}_gtfs.lst;
  joingenes -f ${species}_gtfs.lst -o joined_pred/$species.gff
done &lt ../genomes.tbl
</pre>
This will create the folder <span class="result"><tt>joined_pred/</tt></span>
with the final gene predictions for each input genome.
<pre class="code">
bosTau8.gff
canFam3.gff
galGal4.gff
hg38.gff
mm10.gff
monDom5.gff
rheMac3.gff
rn6.gff
</pre>

<h2 id="makeBeds">4. Upload gene predictions into the assembly hub</h2>
<span class="assignment">Convert</span> the final gene predictions from gff to BED format and place
each BED file in a separate folder with the name of the corresponding genome. It is important that directory names are consistent with the names in the HAL alignment.
<pre class="code">
for f in joined_pred/*.gff
do
mkdir "$(basename $f .gff)"
gtf2bed.pl <$f >$(basename $f .gff)/augCGP_denovo.bed --itemRgb=225,0,0
done
</pre>
Specify any RGB color you like for the track with option <tt>--itemRgb</tt>, e.g. <span style="color:rgb(255,0,0);">225,0,0.</span><br>
The name of the current directory (i.e. <tt>augCGP_denovo</tt>) will be used as track name on the browser.<br>

<span class="assignment">Switch</span> back to the main working directory <tt>data/</tt>
<pre class="code">
cd ..
</pre>
and rerun the <tt>hal2assemblyHub.py</tt> script. Include gene tracks with option <tt>--bedDirs</tt>

<pre class="code">
hal2assemblyHub.py vertebrates.hal vertHub --lod \
--alignability --gcContent \
--hub vertCompHub --shortLabel VertebratesCompHub \
--bedDirs augCGP_denovo \
--tabBed \
--maxThreads=10 --longLabel "Vertebrates Comparative Assembly Hub"
</pre>
You can also include gene tracks from other exercises by passing a comma-separated list of directories e.g.
<tt>--bedDirs refseq,augCGP_denovo,augCGP_rnaseq,augCGP_liftover,...</tt><br><br>

<span class="assignment">Repeat</span> <a href="cactus.html#loadHub">4. Load the hub and browser the alignment</a>.<br>
</body></html>