1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224
|
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN">
<html><head><title>Predicting Genes with AUGUSTUS</title>
<link rel="stylesheet" type="text/css" href="augustus.css">
<script src="tutorial.js" type="text/javascript"></script>
</head><body>
<font size=-1>
Navigate to <a href="index.html">Lab Session on AUGUSTUS</a>.
<a href="scipio.html">Using Scipio</a>.
<a href="training.html">Training AUGUSTUS</a>.
<a href="ppx.html">AUGUSTUS-PPX</a>.
</font>
<div align="right">Show <a href="javascript:allOn()">all</a> / <a href="javascript:allOff()">no</a> details.</div>
<h1>Creating a Whole-Genome-Alignment with progressiveCactus</h1>
This tutorial describes how a whole-genome alignment of multiple genomes can be created with the Cactus aligner.<br>
The output alignment is a compressed file in HAL format that can be exported as MAF.
<h2 id="runCactus">1. Running progressive Cactus</h2>
<span class="assignment">Prepare</span> a text file with the species tree followed by a space-separated list of species names and location of
the corresponding genome fasta files (absolute path names required!).<br><br>
<a href="javascript:onoff('cft')" class="dlink"><span id="cft" title="cftd" class="dcross">[+]</span>
<span class="dtitle">file format...</span></a> <br>
<div id="cftd" class="details" style="display:none;">
<pre class="code">
tree in Newick format
name_of_genome_1 absolute/path/to/genome_1
name_of_genome_2 absolute/path/to/genome_2
...
name_of_genome_N absolute/path/to/genome_N
</pre>
For furter information and optional parameters, read the <a href="https://github.com/glennhickey/progressiveCactus">progressiveCactus manual</a>
</div>
</br>
<pre class="code">
cp tree.nwk vertebrates.txt
for f in $PWD/genomes/*.fa; do echo -ne "$(basename $f .fa)\t$f\n"; done >>vertebrates.txt
</pre>
The file <span class="result"><tt>vertebrates.txt</tt></span> will now look like this (except for the parent directory)<br>
<pre class="code">
((monDom5:0.340786,(((hg38:0.035974,rheMac3:0.043601):0.109934,(mm10:0.084509,rn6:0.091589):0.271974):0.020593,(bosTau8:0.18908,canFam3:0
.13303):0.032898):0.258392):0.181168,galGal4:0.559442);
bosTau8 /var/www/augustus/htdocs/binaries/tutorial-cgp/data/genomes/bosTau8.fa
canFam3 /var/www/augustus/htdocs/binaries/tutorial-cgp/data/genomes/canFam3.fa
galGal4 /var/www/augustus/htdocs/binaries/tutorial-cgp/data/genomes/galGal4.fa
hg38 /var/www/augustus/htdocs/binaries/tutorial-cgp/data/genomes/hg38.fa
mm10 /var/www/augustus/htdocs/binaries/tutorial-cgp/data/genomes/mm10.fa
monDom5 /var/www/augustus/htdocs/binaries/tutorial-cgp/data/genomes/monDom5.fa
rheMac3 /var/www/augustus/htdocs/binaries/tutorial-cgp/data/genomes/rheMac3.fa
rn6 /var/www/augustus/htdocs/binaries/tutorial-cgp/data/genomes/rn6.fa
</pre>
<span class="assignment">Run</span> progressiveCactus, e.g. using 8 threads (~12min):
<pre class="code">
runProgressiveCactus.sh vertebrates.txt cactusout vertebrates.hal --maxThreads=8 2>&1 > cactus.out
</pre>
This creates a .hal binary indexed alignment file <span class="result"><tt>vertebrates.hal</tt></span><br><br>
<h2 id="hal2maf">2. Export HAL alignment as MAF</h2>
Next, we will export the HAL alignment as MAF. Observe, that the MAF format is reference-based, i.e. alignments
can only be represented with respect to some reference species. Generally any species can be chosen as reference.
We recommend, however, to choose a reference species that 1.) has a high-quality genome assembly and 2.) is placed somewhere
in the middle of the clade, e.g. in our case human (hg38). For parallelization of the gene prediction step,
we recommend to split the global alignmnent into several overlapping alignment chunks. This can be achieved with <tt>hal2maf_split.pl</tt> using
the parameters <tt>--chunksize</tt> that specifies the maximum chunk size (number of bases covered in the reference genome) and
<tt>--overlap</tt> that specifies the amount of overlap (number of bases) between two consecutive chunks.
Observe that <tt>hal2maf_split.pl</tt> works on top of the HAL tools toolbox, which is also part of the progressiveCactus package.
If the <tt>progressiveCactus/submodules/hal/bin</tt> directory in is not in your global python path, use the parameter <tt>--hal_exec_dir</tt> to point to the directory<br><br>
<span class="assignment">Export</span> <span class="result"><tt>vertebrates.hal</tt></span> as MAF
<pre class="code">
hal2maf_split.pl --halfile vertebrates.hal --refGenome hg38 --cpus 8 --chunksize 50000 --overlap 25000 --outdir mafs
</pre>
This will generate the directory <span class="result"><tt>mafs/</tt></span> that
contains the output alignment chunks in MAF format
<pre class="code">
-rw-r--r-- 1 stefanie bioinf 392K May 24 19:43 chr16.0-49999.maf
-rw-r--r-- 1 stefanie bioinf 307K May 24 19:43 chr16.100000-149999.maf
-rw-r--r-- 1 stefanie bioinf 365K May 24 19:43 chr16.125000-174999.maf
-rw-r--r-- 1 stefanie bioinf 586K May 24 19:43 chr16.150000-199999.maf
-rw-r--r-- 1 stefanie bioinf 395K May 24 19:43 chr16.175000-210154.maf
-rw-r--r-- 1 stefanie bioinf 345K May 24 19:43 chr16.25000-74999.maf
-rw-r--r-- 1 stefanie bioinf 408K May 24 19:43 chr16.50000-99999.maf
-rw-r--r-- 1 stefanie bioinf 418K May 24 19:43 chr16.75000-124999.maf
</pre>
For demonstration purposes, we chose very small values for the chunk size and overlap.
In a more realistic setting, i.e. when working with whole-genome alignments,
we recommend a chunk size in the megabase order, e.g. <tt>--chunksize=2500000</tt> and <tt>--overlap=500000</tt> (default values).<br><br>
<a href="javascript:onoff('nsl')" class="dlink"><span id="nsl" title="nsld" class="dcross">[+]</span>
<span class="dtitle">avoid splitting within genes...</span></a> <br>
<div id="nsld" class="details" style="display:none;">
Gene prediction errors may occur from splitting of the alignment within genes. The smaller the chunk size
is chosen, the more likely genes are truncated at alignment boundaries. To avoid this,
a list of genic regions/intervals may be passed to <tt>hal2maf_split.pl</tt>. In this case, the splitting is
restricted to intergenic regions, e.g. position outside of the given intervals. The genic intervals can be obtained
from existing annotations for the reference species, if available.<br><br>
<span class="assignment">Use</span> the hg38 RefSeq annotation <a href="data/refseq/refseq.hg38.gtf"><tt>refseq/refseq.hg38.gtf</tt></a> to create a text file with a list of genic intervals, in which the splitting is forbidden.
Each genic interval is specified as follows: <tt>seqname <tab> start <tab> end <newline></tt><br><br>
<pre class="code">
gtf2gff.pl --printIntron <refseq/refseq.hg38.gtf --out=/dev/stdout | cut -f 1,4,5 >no_split_list.txt
</pre>
If your annotation is in gtf format as in the example, don't forget to include the intron lines!<br><br>
<span class="assignment">Rerun</span> <tt>hal2maf_split.pl</tt> using the parameter <tt>--no_split_list</tt> to pass <span class="result"><tt>no_split_list.txt</tt></span>
<pre class="code">
hal2maf_split.pl --halfile vertebrates.hal --refGenome hg38 --cpus 8 \
--chunksize 50000 --overlap 25000 --no_split_list no_split_list.txt --outdir mafs_good_split
</pre>
This will generate the directory <span class="result"><tt>mafs_good_split/</tt></span> that
contains the following MAF alignments
<pre class="code">
-rw-r--r-- 1 stefanie bioinf 707K May 25 18:17 chr16.0-91311.maf
-rw-r--r-- 1 stefanie bioinf 442K May 25 18:17 chr16.173067-210154.maf
-rw-r--r-- 1 stefanie bioinf 913K May 25 18:17 chr16.45035-173067.maf
-rw-r--r-- 1 stefanie bioinf 1013K May 25 18:17 chr16.91311-207894.maf
</pre>
Observe that when using the <tt>--no_split_list</tt> option, the maximum chunk size is not always respected.
</div>
<a href="javascript:onoff('msl')" class="dlink"><span id="msl" title="msld" class="dcross">[+]</span>
<span class="dtitle">splitting MAF alignments...</span></a> <br>
<div id="msld" class="details" style="display:none;">
Let's assume you already have a vetted whole-genome alignment in MAF format.
You can split it similarily to HAL alignments into overlapping alignment chunks.
Let <tt>hg38</tt> be the reference genome in <tt>aln.maf</tt>.
<pre class="code">
REF="hg38"
</pre>
<span class="assignment">Create</span> a tab-separated list of reference chromosomes/scaffolds of and their lengths
<pre class="code">
grep "^s $REF" aln.maf | perl -ne '@a=split(/\s+/,$_); $a[1]=~s/.*\.//; print "$a[1]\t$a[5]\n";' | sort -u >seqlist
</pre>
<span class="assignment">Cut</span> <tt>aln.maf</tt> into overlapping pieces with the auxiliary tool <tt>filterMaf.pl</tt> from the <tt>scripts/</tt> folder in the Augustus package.
<pre class="code">
mkdir mafs
CHUNKSIZE=2500000
OVERLAP=500000
cat seqlist | while read line
do
SEQ=$(echo "$line" | cut -f 1)
CHRLEN=$(echo "$line" | cut -f 2)
for ((start=0; start<CHRLEN; start+=(CHUNKSIZE-OVERLAP))); do
end=$((start+CHUNKSIZE))
if [ $end -ge $CHRLEN ]; then
end=$CHRLEN
fi
filterMaf.pl --interval=${REF}.${SEQ}:$start-$end <aln.maf >mafs/${SEQ}:$start-$end.maf
done
done
</pre>
</div>
</br>
<h2 id="makeHub">3. Build a comparative assembly hub</h2>
<pre class="code">
hal2assemblyHub.py vertebrates.hal vertHub --lod \
--alignability --gcContent \
--hub vertCompHub --shortLabel VertebratesCompHub \
--maxThreads=10 --longLabel "Vertebrates Comparative Assembly Hub"
</pre>
Manually <span class="assignment">edit</span> the .html files in the <span class="result"><tt>vertHub/</tt></span> directory, if desired.
<h2 id="loadHub">4. Load the hub and browse the alignment</h2>
<span class="assignment">Copy</span> the <span class="result"><tt>vertHub/</tt></span> directory to a web directory:
<pre class="code">
scp -r vertHub/ stefanie@greifweb1:/var/www/bioinf/htdocs/stefanie/cgp-tutorial/
</pre>
<span class="assignment">Go</span> to the UCSC Genome Browser <a href="http://genome-euro.ucsc.edu/cgi-bin/hgHubConnect">Track Data Hubs</a> page and paste your <a href="http://bioinf.uni-greifswald.de/bioinf/stefanie/cgp-tutorial/vertHub/hub.txt"><tt>hub.txt</tt></a> url into the "My Hubs" tab.
Select a species as reference (e.g. hg38) and click "Go".<br><br>
<img src="figs/hub.png" width="40%" height="40%"/>
<h2 id="loadAnno">5. Upload a reference annotation to the hub</h2>
<span class="assignment">Convert</span> the <a href="data/refseq/refseq.hg38.gtf">hg38 RefSeq annotation</a> to BED format and place it in a folder with the name of the genome
<pre class="code">
mkdir refseq/hg38
gtf2bed.pl <refseq/refseq.hg38.gtf >refseq/hg38/refseq.bed --itemRgb=25,25,112
</pre>
Specify any RGB color you like for the track with option <tt>--itemRgb</tt>, e.g. <span style="color:rgb(25,25,112);">25,25,112.</span><br>
<span class="assignment">Rerun</span> the hal2assemblyHub.py script. Include gene tracks with option --bedDirs.
<pre class="code">
rm -r jobTree # delete job data from previous run
hal2assemblyHub.py vertebrates.hal vertHub --lod \
--alignability --gcContent \
--hub vertCompHub --shortLabel VertebratesCompHub \
--bedDirs refseq \
--tabBed \
--maxThreads=10 --longLabel "Vertebrates Comparative Assembly Hub"
</pre>
For further information on the specification of parameter <tt>--bedDirs</tt>, read section <a href="https://github.com/glennhickey/hal/tree/master/assemblyHub#annotations-provided-by-users">'Annotations provided by users'</a> in the assemblyHub manual.<br><br>
<a href="javascript:onoff('tah')" class="dlink"><span id="tah" title="tahd" class="dcross">[+]</span>
<span class="dtitle">hal2assemblyHub troubleshooting...</span></a> <br>
<div id="tahd" class="details" style="display:none;">
If the <tt>hal2assemblyHub.py</tt> script returns immediately without error message. Delete the jobTree folder and rerun it again.
<pre class="code">
rm -r jobtree
</pre>
Make sure that all auxiliary tools, e.g <tt>hgGcPercent</tt>, are in the global path. If not export
<pre class="code">
export PATH=/home/stefanie/tools/progressiveCactus/bin:/home/stefanie/tools/progressiveCactus/submodules/kentToolBinaries:$PATH
export PYTHONPATH=/home/stefanie/tools/progressiveCactus/submodules:$PYTHONPATH
</pre>
</div><br>
<span class="assignment">Repeat</span> <a href="cactus.html#loadHub">4. Load the hub and browser the alignment</a>.<br>
The new track, should be visible in the 'VertebratesCompHub Track Settings' menu, enable it if necessary.<br><br>
<img src="figs/menu.png" width="40%" height="40%"/>
</body>
</html>
|