1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291
|
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% %
% Pipeline for training and running the gene finder AUGUSTUS automatically %
% %
% Xin Xing, Ralph Krimmel, Mario Stanke %
% University of Goettingen %
% Contact: Mario Stanke, mario@gobics.de %
% %
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
1. Introduction
2. System Overview
3. Software Components
4. Installation and Configuration
5. Typical Usage
6. Command Line Arguments
7. Component Scripts
a. autoAugPred.pl
b. autoAugTrain.pl
1. Introduction
---------------
The purpose of this pipeline script is to run the AUGUSTUS training process and gene
prediction algorithm automatically on a given eukaryotic genome with
available cDNA evidence. The complete process contains the following steps.
These are encapsulated in this automatic training pipeline.
-- Construct an initial training set of genes (protein coding regions thereof), e.g. using PASA
-- Train AUGUSTUS to predict coding regions (i.e. without Untranslated Regions (UTR)) using the initial training set.
-- Predict coding regions of genes in your genome (both ab initio and using hints from cDNA)
-- Train the UTR model of AUGUSTUS using EST-supported genes and EST alignments.
-- Predict genes including UTRs in your genome using hints from cDNA.
The first step is optional and can be replaced by preparing your own training set of genes. For a schema of the
pipeline look at the poster in the doc folder.
TIP: Please manually/visually check the results of the pipeline, e.g. by loading them into a browser.
The pipeline outputs files for tracks on a GBrowse.
WARNING: Please do not assume that you get the best possible results from a fully automatic pipeline.
Eukaryotic gene prediction is still a complex craft and some beasts are just different.
2. System Overview
------------------
This program runs on a UNIX/LINUX-based architecture and involves
components written in Perl and C++. Results of the training-set generating step (PASA) are stored
primarily within a MySQL database. The whole pipeline can take in the order of 10 CPU days on a
medium complete genome, so please be patient and consider using a compute cluster.
If you happen to have a Sun Grid Engine available you can configure the pipeline to run in parallel
and completely automatically. If you are using another cluster, the job scripts will be created and you
can then submit them manually using your batch job submission system.
3. Software Components
-----------------------------------
AUGUSTUS: Download from https://github.com/Gaius-Augustus/Augustus
BLAT: You get Jim Kent's alignment program at http://users.soe.ucsc.edu/~kent/src/
Also install pslCDnaFilter which is part of the Jim Kent src tree
(http://hgwdev.cse.ucsc.edu/~kent/src/unzipped/hg/pslCDnaFilter/).
PASA: Download the Program to Assemble Spliced Alignments from
Sourceforge: http://sourceforge.net/projects/pasa
You do not need PASA if you already have a training set of genes for your species.
SCIPIO: Install SCIPIO if you have training genes in the form of protein sequences.
SCIPIO can then find their gene structures in the genome. Most users will not need this.
http://www.webscipio.org/
4. Installation and Configuration
---------------------------------
Install the required components according to their documentation.
Make sure you have set the environment variables AUGUSTUS_CONFIG_PATH and those for
PASA. That means for example lines such as
export AUGUSTUS_CONFIG_PATH=/home/mario/augustus/config
export PASAHOME=/home/mario/PASA
shall be written in your startup script (like ~/.bashrc). Furthermore,
all scripts of AUGUSTUS and PASA shall be executable. For
details see the documentations of AUGUSTUS and PASA.
If you use PASA with GMAP instead of BLAT then please note that PASA is currently not fully compatible
with GMAP versions after the version from 2007-09-28. If you use a newer GMAP version you might have to
remove the "-S" option in the PASA script scripts/process_GMAP_alignments.pl so it says:
gmap_setup -D gmap_db_dir -d $genome_db $genome_db
(Although one user reported to us that GMAP-2012-04-07 needs exactly the -S tag!)
Pay attention on how to configure PASA with MySQL! If you want to run PASA without root privileges, you
should first create a database for PASA (e.g. dbname-PASA, user priv > read and write), and then create
a new database with name PASAspeciesname, where speciesname is the name you use in --species=speciesname.
Furthermore, if you are using a cluster, AUGUSTUS should be installed there as well. The
path to "augustus" should also be concatenated to the "PATH" in your cluster.
In case you are using a Sun Grid Engine and you want to use the option "noninteractive" (see section 6.),
configure your ssh so that you can login on the cluster without having to type a password.
Make sure to include the augustus/scripts directory in your PATH environment variable!
There is test data in augustus/examples/autoAug for testing this pipeline. I recommend you
test the installation on this small set first to quickly detect possible installation problems.
5. Typical Usage
----------------
First create a new directory to store all files created by the pipeline.
Situation 1:
You already have a set of training genes (e.g. constructed from ESTs using PASA).
The way you run the pipeline depends on whether you want to use a compute cluster
to run the prediction jobs in parallel or whether you want to run it on a single PC.
The easiest but slower way is to run everything sequentially in one command, e.g.
>autoAug.pl -g genome.fa -t traingenes.gff --species=yourSpecies --cdna=cdna.fa -v --singleCPU --useexisting
traingenes.gff is the set of training gene structures in GFF format (e.g. as produced by PASA).
Alternatively you can use -t traingenes.gb if you have a training set
in Genbank format. This has the advantage that the training genes can be based
on a different DNA sequence than genome.fa.
cdna.fa is a fasta format file with all available cDNA sequences (ESTs, 454 transcripts, mRNA).
In a first test run we recommend to use the option --optrounds=0 to skip the
optimization, which usually takes most time.
If you have a compute cluster you can remove the option --singleCPU from above
command. In that case the program flow is interrupted at certain times so you
can submit the jobs on your cluster. Simply follow the output prompts. If your
SGE cluster happens to be set as in section 4 you can also use conveniently
>autoAug.pl -g genome.fa -t traingenes.gff --species=yourSpecies --cdna=cdna.fa -v --noninteractive
Note: For this pipeline you only need a training set with coding regions, even
if you want AUGUSTUS to predict the UTRs. It will construct a training set of
UTRs from EST alignments.
Situation 2:
You can just input a cDNA sequence file in FASTA format, and let PASA extract
a training set of genes. You need to install PASA for this separately.
>autoAug.pl -g genome.fa --species=yourSpecies -c cdna.fa -v -v --pasa --useGMAPforPASA --useexisting
Then follow the output prompts. If your cluster is set as in section 4 you can also use the option --noninteractive.
6. Command Line Arguments
-------------------------
The program supports file name arguments with absolute path (i.e. file name begin with "/") as well as with
relative path (beginning with "..", "~" etc.).
The keywords (such as genome) can be abbreviated (e.g. "ge") to the extend that they are still unique.
--genome=genomeFile
genomeFile is a file in Fasta format with the genome assembly
--trainingset=trainingfile
trainingfile is a file with training genes in Genbank format, GFF format or protein FASTA format
--species=sname
sname is species name. This is used to identify the parameter set later. The program will check if a
directory 'sname' already exists under $AUGUSTUS_CONFIG_PATH/species/.
If there is one, the program will be aborted unless you specify --useexisting.
--cdna=cdna.fa
cdna.fa is a cdna sequence file in the FASTA format (typically EST sequences or 454)
--pasa
This argument switches the PASA function on for creating a training set of genes from the cDNA, default: off.
--useGMAPforPASA
use GMAP instead of BLAT in the PASA run
--hints=hintsfile
hintsfile is a file with extrinsic evidence about the location and structure of genes, see documentation of AUGUSTUS.
--estali=estfile
cDNA alignments in PSL format (as generated by BLAT and GMAP) are used to construct UTRs
--noutr
use this if you do not want predictions of untranslated regions and want to safe a bunch of time
However, note that usually the predictions for the coding regions are better with UTRs turned on, because
typical EST alignments at the transcript ends can be better "interpreted" by AUGUSTUS.
--verbose=n
verbosity level 0, 1, 2, 3. Default: 2. You can also use a shortcut: "-v", "-v -v" or "-v -v -v"
--workingdir=/path/to/wd/
In the working directory results and temporary files are stored. Default: current working directory
--noninteractive
switch it on to avoid manual operation in cluster, it works only if you use a SUN GRID ENGINE cluster
--clustername=cname
set you SUN GRID ENGINE cluster name by using --noninteractive. Default: fe
7. Component Scripts
------------------------------------
autoAug.pl uses two the two scripts autoAugTrain.pl and autoAugPred.pl for training and prediction subtasks that
are important steps of their own.
a. autoAugPred.pl predict genes with AUGUSTUS on genomes
Once parameters for a species have been trained, you can use this script to (re)predict genes on your genome,
e.g. when you have new data.
Usage:
autoAugPred.pl [OPTIONS] --genome=genome.fa --species=sname
autoAugPred.pl [OPTIONS] --genome=genome.fa --species=sname --hints=hintsfile
--genome=genome.fa
FASTA file with DNA sequences for training
--species=sname
species name as used by AUGUSTUS
--hints=hintsfile
run AUGUSTUS using extrinsic information from hintsfile
--noninteractive
for SGE users, who have configurated openssh key
with this option one can run AUGUSTUS automatically
--utr
switch on prediction of untranslated regions. Off by default.
--hints=hintsfile
run AUGUSTUS with hintsfile
--verbose
the verbosity level
--remote=clustername
works only by using --noninteractive, specify the SGE cluster name for noninteractive, default by "fe"
b. autoAugTrain.pl: train AUGUSTUS automatically
Usage:
autoAugTrain.pl [OPTIONS] --species=sname --trainingset=training.gb
autoAugTrain.pl [OPTIONS] --species=sname --genome=genome.fa --trainingset=genes.gff
autoAugTrain.pl [OPTIONS] --species=sname --genome=genome.fa --trainingset=proteins.fa
autoAugTrain.pl [OPTIONS] --species=sname --genome=genome.fa --species=sname --utr --estali=cdna.psl --aug=augustus.gff
--species=sname
species name as used by AUGUSTUS
--trainingset=genes.gb
genes.gb is a file with training genes a Genbank format. For examples for the exact format look at
the files in http://bioinf.uni-greifswald.de/webaugustus/datasets
--trainingset=genes.gff
genes.gff is a file with training genes in GFF format
--trainingset=proteins.fa
proteins.fa is a file with protein sequences from the target species or a very close relative
You can use this when the structures 'got lost' or you changed the assembly.
--genome=genome.fa
fasta file with DNA sequences for training
genome.fa should include the corresponding
sequences in this case
--est=cdna.f.psl
EST alignments are used to guess the UTR and its end point.
--utr
Switch it on to train AUGUSTUS with untranslated regions. Off by default
--flanking_DNA=length
flanking_DNA length, default:4000
--verbose
verbosity level of the program, default
level: 1, max level: 3
|