1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182
|
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN">
<html><head><title>Predicting Genes with AUGUSTUS</title>
<link rel="stylesheet" type="text/css" href="augustus.css">
<script src="tutorial.js" type="text/javascript"></script>
</head><body>
<font size=-1>
Navigate to <a href="index.html">main AugCGP tutorial</a>.
<a href="cactus.html">Cactus alignments and assembly Hubs</a>.
<a href="de_novo.html">AugCGP de novo</a>.
<a href="rnaseq.html">AugCGP with RNA-Seq</a>.
<a href="liftover.html">Annotation transfer with AugCGP</a>.
<a href="hgm.html">Cross-species consistency of gene sets</a>.
</font>
<div align="right">Show <a href="javascript:allOn()">all</a> / <a href="javascript:allOff()">no</a> details.</div>
<h1>Combining Annotation transfer and RNA-Seq-based prediction</h1>
In the typical clade annotation scenario, most genomes are supplemented with RNA-Seq data, whereas a few may represent model organsims for which high quality annotations exist (e.g. human and mouse in the vertebrate clade). This tutorial describes how RNA-Seq evidence and evidence from existing annotations can be combined in Augustus-cgp.
<h2 id="loadAnno">1. Create a database with RNA-Seq and annotation hints</h2>
If you don't have a database with the genomes and RNA-Seq hints, yet, <span class="assignment">follow</span> the instructions in
<a href="rnaseq.html#loadRNASeq">1. Load RNA-Seq hints ...</a><br>
to create the database <tt>vertebrates_rnaseq.db</tt>.<br><br>
<span class="assignment">Make</span> a copy of the database
<pre class="code">
cp vertebrates_rnaseq.db vertebrates_rnaseq+anno.db
</pre>
and <span class="assignment">load</span> the annotation hints from
<a href="liftover.html#refseq2hints">exercise 4.1</a> into the new database
<pre class="code">
load2sqlitedb --species=hg38 --dbaccess=vertebrates_rnaseq+anno.db refseq/hg38.hints.gff
</pre>
You can <span class="assignment">check</span> if loading was successful with following
database query
<pre class="code">
sqlite3 -header -column vertebrates_rnaseq+anno.db "SELECT count(*) AS '#hints',typename,speciesname FROM
(hints as H join featuretypes as F on H.type=F.typeid) natural join speciesnames group by speciesid,typename;"
</pre>
that returns a summary of how many hints of each type are in the database for each species.<br>
Your database should now contain both RNA-Seq hints from vertebrates_rnaseq.db and annotation hints from vertebrates_anno.db
<pre class="code">
#hints typename speciesname
---------- ---------- -----------
3368 exonpart galGal4
129 intron galGal4
86 CDS hg38
7905 exonpart hg38
345 intron hg38
7930 exonpart mm10
378 intron mm10
11050 exonpart rheMac3
265 intron rheMac3
</pre>
<h2 id="extrinsic">3. Prepare an extrinsic config file</h2>
Start by <span class="assignment">copying</span> following extrinsic configuration file:
<pre class="code">
cp ${AUGUSTUS_CONFIG_PATH}extrinsic/extrinsic-cgp.cfg extrinsic-rnaseq+anno.cfg
</pre>
<span class="assignment">Open</span> the <tt>extrinsic-rnaseq+anno.cfg</tt> file with a text editor,
<span class="assignment">go</span> to the first <tt>[GROUP]</tt> section and replace the following line
<pre class="code">
[GROUP] # replace 'none' by the names of genomes with src=W and src=E hints in the database
none
</pre>
by the names of genomes with annotation RNA-Seq hints, i.e.
<pre class="code">
[GROUP]
hg38 mm10 rheMac3 galGal4
</pre>
Note, that in our case, nothing further has to be done, since the only genome with annotation hints - hg38 -
is already covered in the first table. In other applications, you may have genomes with annotations, but
no RNA-Seq data. In this case the names of the genomes that ONLY have annotation hints must be listed in the
second [GROUP] section.<br><br>
<a href="javascript:onoff('cfg')" class="dlink"><span id="cfg" title="cfgd" class="dcross">[+]</span>
<span class="dtitle">format of the extrinsic.cfg file in cgp mode ...</span></a> <br>
<div id="cfgd" class="details" style="display:none;">
In cgp mode hints can be integrated for multiple species.
In order to have different extrinsic config settings for different species,
multiple <tt>[GENERAL]</tt> tables are specified. Each table is followed by a <tt>[GROUP]</tt> section,
a single line, in which a subset of the species is listed, for which the table is valid.
Use the same species identifiers as in the genome alignment and in the phylogenetic tree.
If a species is not assigned to any of the tables, all hints for that species are
ignored. To assign all species to a single table, the key 'all' can be used instead of itemizing
every single species identifier. Use the key 'other' to specify a table for all species, not
listed in any previous table.
Note that the source RM must be specified in case that the softmasking option is turned on.
Also note that all tables have the same dimension, i.e. each table must contain all sources
listed in the section <tt>[SOURCES]</tt>, even sources for which no hints exist for any of species
in group.
</div>
</br>
<h2 id="runAug">4. Run AUGUSTUS-CGP with RNA-Seq and annotation hints</h2>
<span class="assignment">Create</span> a new folder for the liftover experiments and
<span class="assignment">switch</span> to the new directory
<pre class="code">
mkdir augCGP_rnaseq+liftover
cd augCGP_rnaseq+liftover
</pre>
For convenience <span class="assignment">assign</span> each alignment chunk to a job ID by
creating softlinks
<pre class="code">
num=1
for f in ../mafs/*.maf; do ln -s $f $num.maf; ((num++)); done
</pre>
<span class="assignment">Run</span> Augustus with retrieval of hints from the
database (~3min).<br>
<pre class="code">
for id in *.maf
do
augustus \
--species=human \
--softmasking=1 \
--treefile=../tree.nwk \
--alnfile=$id \
--dbaccess=../vertebrates_rnaseq+anno.db \
--speciesfilenames=../genomes.tbl \
--alternatives-from-evidence=0 \
--dbhints=1 \
--UTR=1 \
--allow_hinted_splicesites=atac \
--extrinsicCfgFile=../extrinsic-rnaseq+anno.cfg \
--/CompPred/outdir=pred${id%.maf} > aug${id%.maf}.out 2> err${id%.maf}.out &
done
</pre>
This will generate the folders <span class="result"><tt>pred*/</tt></span> (one for each alignment chunk)
that contain gff files with gene predictions for each input genome.
<pre class="code">
bosTau8.cgp.gff
canFam3.cgp.gff
galGal4.cgp.gff
hg38.cgp.gff
mm10.cgp.gff
monDom5.cgp.gff
rheMac3.cgp.gff
rn6.cgp.gff
</pre>
Note that the parallelization with the bash '&' command above is quite simple and rather for demonstration purposes.<br>
For real applications with several hundreds or thousands of alignment chunks, we recommend to
run job arrays on a compute cluster.
<h2 id="abinitiopred"><a href="de_novo.html#merge">5. Merge gene predictions from parallel runs</a></h2>
<h2 id="makeBeds">6. Upload gene predictions into the assembly hub</h2>
<span class="assignment">Convert</span> the final gene predictions from gff to BED format and place
each BED file in a separate folder with the name of the corresponding genome. It is important that directory names are consistent with the names in the HAL alignment.
<pre class="code">
for f in joined_pred/*.gff
do
mkdir "$(basename $f .gff)"
gtf2bed.pl <$f >$(basename $f .gff)/augCGP_rnaseq+anno.bed --itemRgb=255,165,0
done
</pre>
Specify any RGB color you like for the track with option <tt>--itemRgb</tt>, e.g. <span style="color:rgb(255,165,0);">255,165,0.</span><br>
The name of the current directory (i.e. <tt>augCGP_rnaseq+liftover</tt>) will be used as track name on the browser.<br>
<span class="assignment">Switch</span> back to the main working directory <tt>data/</tt>
<pre class="code">
cd ..
</pre>
and rerun the <tt>hal2assemblyHub.py</tt> script. Include gene tracks with option <tt>--bedDirs</tt>
<pre class="code">
hal2assemblyHub.py vertebrates.hal vertHub --lod \
--alignability --gcContent \
--hub vertCompHub --shortLabel VertebratesCompHub \
--bedDirs augCGP_rnaseq+liftover \
--tabBed \
--maxThreads=10 --longLabel "Vertebrates Comparative Assembly Hub"
</pre>
You can also include gene tracks from other exercises by passing a comma-separated list of directories e.g.
<tt>--bedDirs refseq,augCGP_denovo,augCGP_rnaseq,augCGP_liftover,...</tt><br><br>
<span class="assignment">Repeat</span> <a href="cactus.html#loadHub">4. Load the hub and browser the alignment</a>.<br>
</body></html>
|