1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311
|
NAME
====
VelvetOptimiser
VERSION
=======
Version 2.2.5
LICENCE
=======
Copyright 2009 - Simon Gladman - CSIRO & Monash University.
simon.gladman@monash.edu
This program is free software; you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation; either version 2 of the License, or
(at your option) any later version.
This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU General Public License for more details.
You should have received a copy of the GNU General Public License
along with this program; if not, write to the Free Software
Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston,
MA 02110-1301, USA.
INTRODUCTION
============
The VelvetOptimiser is designed to run as a wrapper script for the Velvet
assembler (Daniel Zerbino, EBI UK) and to assist with optimising the
assembly. It searches a supplied hash value range for the optimum,
estimates the expected coverage and then searches for the optimum coverage
cutoff. It uses Velvet's internal mechanism for estimating insert lengths
for paired end libraries. It can optimise the assemblies by either the
default optimisation condition or by a user supplied one. It outputs the
results to a subdirectory and records all its operations in a logfile.
Expected coverage is estimated using the length weighted mode of the contig
coverage in all active short columns of the stats.txt file.
PREREQUISITES
=============
Velvet => 0.7.51
Perl => 5.8.8
BioPerl >= 1.4
GNU utilities: grep sed free cut
COMMAND LINE
============
VelvetOptimiser.pl [options] -f 'velveth input line'
Options:
--help This help.
--V|version! Print version to stdout and exit.
--v|verbose+ Verbose logging, includes all velvet output in the logfile. (default '0').
--s|hashs=i The starting (lower) hash value (default '19').
--e|hashe=i The end (higher) hash value (default '31').
--x|step=i The step in hash search.. min 2, no odd numbers (default '2').
--f|velvethfiles=s The file section of the velveth command line. (default '0').
--a|amosfile! Turn on velvet's read tracking and amos file output. (default '0').
--o|velvetgoptions=s Extra velvetg options to pass through. eg. -long_mult_cutoff -max_coverage etc (default '').
--t|threads=i The maximum number of simulataneous velvet instances to run. (default '4').
--g|genomesize=f The approximate size of the genome to be assembled in megabases.
Only used in memory use estimation. If not specified, memory use estimation
will not occur. If memory use is estimated, the results are shown and then program exits. (default '0').
--k|optFuncKmer=s The optimisation function used for k-mer choice. (default 'n50').
--c|optFuncCov=s The optimisation function used for cov_cutoff optimisation. (default 'Lbp').
--p|prefix=s The prefix for the output filenames, the default is the date and time in the format DD-MM-YYYY-HH-MM_. (default 'auto').
--d|dir_final=s The name of the directory to put the final output into. (default '.')
--z|upperCovCutoff=f The maximum coverage cutoff to consider as a multiplier of the expected coverage. (default '0.8').
Advanced!: Changing the optimisation function(s)
Velvet optimiser assembly optimisation function can be built from the following variables.
LNbp = The total number of Ns in large contigs
Lbp = The total number of base pairs in large contigs
Lcon = The number of large contigs
max = The length of the longest contig
n50 = The n50
ncon = The total number of contigs
tbp = The total number of basepairs in contigs
Examples are:
'Lbp' = Just the total basepairs in contigs longer than 1kb
'n50*Lcon' = The n50 times the number of long contigs.
'n50*Lcon/tbp+log(Lbp)' = The n50 times the number of long contigs divided
by the total bases in all contigs plus the log of the number of bases
in long contigs.
EXAMPLES
========
Find the best assembly for a lane of Illumina single-end reads, trying k-values between 27 and 31:
% VelvetOptimiser.pl -s 27 -e 31 -f '-short -fastq s_1_sequence.txt'
Print an estimate of how much RAM is needed by the above command, if we use eight threads at once,
and we estimate our assembled genome to be 4.5 megabases long:
% VelvetOptimiser.pl -s 27 -e 31 -f '-short -fastq s_1_sequence.txt' -g 4.5 -t 8
Find the best assembly for Illumina paired end reads just for k=31, using four threads (eg. quad core CPU),
but optimizing for N50 for k-mer length rather than sum of large contig sizes:
% VelvetOptimiser.pl -s 31 -e 31 -f '-shortPaired -fasta interleaved.fasta' -t 4 --optFuncKmer 'n50'
DETAILED OPTIONS
================
-h or --help
Prints the commandline help to STDOUT.
-V or --version
Prints the program name and version to STDOUT. Note that other information is still
printed to STDERR. You can ignore this by redirecting the output like this:
% VelvetOptimiser.pl --version 2> /dev/null
-v or --verbose
Adds the full velveth and velvetg output to the logfile. (Handy for
looking at the insert lengths and sds that Velvet has chosen for each library.)
-s or --hashs
Parameter type required: odd integer > 0 & <= the MAXKMERLENGTH velvet was compiled with.
Default: 19
This is the lower end of the hash value range that the optimiser will search for the optimum.
If the supplied value is even, it will be lowered by 1.
If the supplied value is higher than MAXKMERLENGTH, it will be dropped to MAXKMERLENGTH.
-e or --hashe
Parameter type required: odd integer >= 'hashs' & <= the MAXKMERLENGTH velvet was compiled with.
Default: MAXKMERLENGTH
This is the upper end of the hash value range that the optimiser will search for the optimum.
If the supplied value is even, it will be lowered by 1.
If the supplied value is higher than MAXKMERLENGTH, it will be dropped to MAXKMERLENGTH.
If the supplied value is lower than 'hashs' then it will be set to equal 'hashs'.
-x or --step
Parameter type required: even integer < difference between 'hashs' and 'hashe'.
Default: 2
This parameter details the number of hash values to skip each increment from 'hashs' to 'hashe' when searching for the optimum.
If the supplied value is odd, it will be lowered by 1.
If the supplied value is less than 2, it will be set to 2.
If the supplied value is greater than the 'hashs' to 'hashe' range, it will be set to the range.
-f or --velvethfiles
Parameter type required: string with '' or ""
No default.
This is a required parameter. If this option is not specified, then the optimisers usage will be displayed.
You need to supply everything you would normally supply velveth at this point except for the hash size and the
directory name in the following format.
{[-file_format][-read_type] filename} repeated for as many read files as you have.
File format options:
-fasta
-fastq
-fasta.gz
-fastq.gz
-bam
-sam
-eland
-gerald
Read type options:
-short
-shortPaired
-short2
-shortPaired2
-long
-longPaired
Examples:
-f 'reads.fna'
reads.fna is short not paired and fasta. (these are the defaults: -short and -fasta)
-f '-shortPaired -fastq paired_reads.fastq -long long_reads.fna'
Two read files supplied, first one is a paired end fastq file and the second is a long single ended read file.
-f '-shortPaired paired_reads_1.fna -shortPaired2 paired_reads_2.fna'
Two read files supplied, both are short paired fastas but come from two different libraries, therefore needing two different CATEGORIES.
There is a fairly extensive checker built into the optimiser to check if the format of the string is correct. However, it won't check the read files for their format (fasta, fastq, eland etc.)
-a or --amosfile
Turns on Velvets read tracking and amos file output.
This option is the same as specifying '-amos_file yes -read_trkg yes' in velvetg. However, it will only be added to the final assembly and not to the intermediate ones.
-o or --velvetgoptions
Parameter type required: string.
No default
String should contain extra options to be passed to velvetg as required such as "-long_mult_cutoff 1" or "-max_coverage 50" etc. Warning, there is no sanity check, so be careful. The optimiser will crash if you give velvetg something it doesn't handle.
-t or --threads
Parameter type required: integer
Specifies the maximum number of threads (simulataneous Velvet instances) to run. It defaults to the number of CPUs in the current computer.
-g or --genomesize
Parameter type required: float.
No default.
This option will run the Optimiser's memory estimator. It will estimate the memory required to run Velvet with the current -f parameter and number of threads. Once the estimator has finsihed its calulations, it will display the required memory, make a recommendation and then exit the script. This is useful for determining if you will have sufficient free RAM to run the assembly before you start.
You need to supply the approximate size of the genome you are assembling in mega bases. For example, for a Salmonella genome I would use: -g 5
-k or --optFuncKmer
Parameter type required: string.
Default: 'n50'
This option will change the function that the Optimiser uses to find the best hash value from the given range. The default is to use the n50. I have found this function to work for me better than the previous single optimisation function, but you may wish to change it. A list of possible variables to use in your optimisation function and some examples are shown below.
-c or --optFuncCov
Parameter type required: string.
Default: 'Lbp'
This option will change the function that the Optimiser uses to find the best hash value from the given range. The default is to use the number of basepairs in contigs greater than 1 kilobase in length. I have found this function to work for me but you may wish to change it. A list of possible variables to use in your optimisation function and some examples are shown below.
Velvet optimiser assembly optimisation functions can be built from the following variables:
LNbp = The total number of Ns in large contigs
Lbp = The total number of base pairs in large contigs
Lcon = The number of large contigs
max = The length of the longest contig
n50 = The n50
ncon = The total number of contigs
tbp = The total number of basepairs in contigs
Examples are:
'Lbp' = Just the total basepairs in contigs longer than 1kb
'n50*Lcon' = The n50 times the number of long contigs.
'n50*Lcon/tbp+log(Lbp)' = The n50 times the number of long contigs divided
by the total bases in all contigs plus the log of the number of bases
in long contigs.
Be warned! The optimiser doesn't care what you supply in this string and will attempt to run anyway. If you give it a nonsensical optimisation function be prepared to receive a nonsensical assembly!
-p or --prefix
Parameter type required: string
Default: The current date and time in the format "DD-MM-YYYY-HH-MM-SS_"
Names the logfile and the output directory with whatever prefix is supplied followed by "_logfile.txt" for the logfile and "_data_k" where k is the optimum hash value for the ouput data directory.
-d or --dir_final
Parameter type required: string
Default: . (the current directory)
At the completion of the optimiser, any non default string will cause the final velvet output and the Velvet Optimiser logfile to be moved to the directory specified. If the directory already exists, an error is generated and the optimiser stops.
-z or --upperCovCutoff
Parameter type required: float
Default: 0.8
Uses this fraction of the expected coverage to set the upper limit of the coverage cutoff range to search for the optimum in.
BUGS
====
* None that I am aware of.
TO DO
=====
* Make the auto_XXX folders be in --dir_final when set, not in the current directory.
* Use velvetk.pl script to choose suitable -s and -e parameters.
CONTACT
=======
Simon Gladman <simon.gladman@csiro.au>
Torsten Seemann <torsten.seemann@monash.edu>
|