1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176
|
#!/bin/bash
usage(){
echo "
Written by Brian Bushnell
Last modified April 1, 2019
Description: Generates random synthetic reads from a reference genome. Read names indicate their genomic origin.
Allows precise customization of things like insert size and synthetic mutation type, sizes, and rates.
Read names generated by this program are used by MakeRocCure (samtoroc.sh) and GradeSamFile (gradesam.sh).
They can also be used by BBMap (bbmap.sh) and BBMerge (bbmerge.sh) to automatically calculate
true and false positive rates, if the flag 'parsecustom' is used.
Usage: randomreads.sh ref=<file> out=<file> length=<number> reads=<number>
Basic parameters:
out=null Output file. If reads are paired and a single file name is
given, output will be interleaved. For paired reads in twin
files, set out1= and out2=
ref=null Reference file. Not needed if the reference is already indexed.
build=1 If multiple references are indexed in the same directory,
each needs a unique build ID.
midpad=300 Specifies space between scaffolds in packed index.
reads=0 Generate this many reads (or pairs).
coverage=-1 If positive, generate enough reads to hit this coverage
target, based on the genome size.
overwrite=t Set to false to disallow overwriting of existing files.
replacenoref=f Set to true to replace Ns in the reference sequence
with random letters.
simplenames=f Set to true to generate read names that clearly indicate
genomic origin, without BBMap internal coordinates.
illuminanames=f Set to true to have matching names for paired reads,
rather than naming by location.
renamebyinsert=f Insert the insert size into the name.
addpairnum=f Set true to add ' 1:' and ' 2:' to the end of read names.
addslash=f Set true to add '/1' and '/2' to the end of read names.
spaceslash=f Set true to add a space before slash read pairnum.
prefix=null Generated reads will start with this prefix,
rather than naming by location.
seed=0 Use this to set the random number generator seed;
use -1 for a random seed.
Length Parameters - normally only minlength and maxlength are needed.
minlength=150 Generate reads of up to this length.
maxlength=150 Generate reads of at least this length.
gaussianlength=f Use a gaussian length distribution (for PacBio).
Otherwise, the distribution is linear.
midlength=-1 Gaussian curve peaks at this point. Must be between
minlength and maxlength, in Gaussian mode.
readlengthsd=-1 Standard deviation of the Gaussian curve. Note that the
final curve is a sum of multiple curves, but this will affect
overall curve width. By default this is set to 1/4 of range.
Pairing parameters:
paired=f Set to true for paired reads.
mininsert= Controls minimum insert length. Default depends on read length.
maxinsert= Controls maximum insert length. Default depends on read length.
triangle=f Make a triangular insert size distribution.
flat=f Make a roughly flat insert size distribution..
superflat=f Make a perfectly flat insert size distribution.
gaussian=t Make a bell-shaped insert size distribution, with
standard deviation of (maxinsert-mininsert)/6.
samestrand=f Generate paired reads on the same strand.
Mutation parameters:
snprate=0 Add snps to reads with this probability (0-1).
insrate=0 Add insertions to reads with this probability (0-1).
delrate=0 Add deletions to reads with this probability (0-1).
subrate=0 Add contiguous substitutions to reads with this probability (0-1).
nrate=0 Add nocalls to reads with this probability (0-1).
Note: With a 'rate' of X, each read has an X chance of getting at least
1 mutation, X^2 chance of 2+ mutations, X^3 chance of 3+ mutations,
and so forth up to the maximum allowed number of mutations of that type.
maxsnps=3 Add at most this many snps per read.
maxinss=2 Add at most this many deletions per read.
maxdels=2 Add at most this many insertions per read.
maxsubs=2 Add at most this many contiguous substitutions per read.
maxns=0 Add at most this many blocks of Ns per read.
maxinslen=12 Max length of insertions.
maxdellen=400 Max length of deletions.
maxsublen=12 Max length of contiguous substitutions.
maxnlen=1 Min length of N blocks.
mininslen=1 Min length of insertions.
mindellen=1 Min length of deletions.
minsublen=2 Min length of contiguous substitutions.
minnlen=1 Min length of N blocks.
Illumina quality parameters:
maxq=36 Upper bound of quality values.
midq=28 Approximate average of quality values.
minq=20 Lower bound of quality values.
q= Sets maxq, midq, and minq to the same value.
adderrors=t Add substitution errors based on quality values,
after mutations.
qv=4 Vary the base quality of reads by up to this much
to simulate tile effects.
PacBio quality parameters:
pacbio=f Use a PacBio error model, rather than Illumina
error model, and add PacBio errors after mutations.
pbmin=0.13 Minimum rate of PacBio errors for a read.
pbmax=0.17 Maximum rate of PacBio errors for a read.
Other Parameters:
overlap=1 Require reads to overlap scaffold end by at least this much.
banns=f Do not generate reads over reference Ns.
metagenome=f Assign scaffolds a random exponential coverage level,
to simulate a metagenomic or RNA coverage distribution.
randomscaffold=f Choose random scaffolds without respect to length.
amp=1 Simulate highly-amplified MDA single-cell data by
setting this to a higher number like 1000.
replacenoref=f Replace intra- and inter-scaffold Ns with random bases.
pbadapter= Add adapter sequence to some reads using this literal string.
fragadapter= Add this sequence to paired reads with insert size
shorter than read length.
fragadapter2= Use this sequence for read 2.
Java Parameters:
-Xmx This will set Java's memory usage, overriding the
program's automatic memory detection.
-Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify
200 megs.
The max is typically 85% of physical memory.
-eoom This flag will cause the process to exit if an out-of-memory
exception occurs. Requires Java 8u92+.
-da Disable assertions.
"
}
#This block allows symlinked shellscripts to correctly set classpath.
pushd . > /dev/null
DIR="${BASH_SOURCE[0]}"
while [ -h "$DIR" ]; do
cd "$(dirname "$DIR")"
DIR="$(readlink "$(basename "$DIR")")"
done
cd "$(dirname "$DIR")"
DIR="$(pwd)/"
popd > /dev/null
#DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" && pwd )/"
CP="$DIR""current/"
z="-Xmx1g"
z2="-Xms1g"
set=0
if [ -z "$1" ] || [[ $1 == -h ]] || [[ $1 == --help ]]; then
usage
exit
fi
calcXmx () {
source "$DIR""/calcmem.sh"
setEnvironment
parseXmx "$@"
if [[ $set == 1 ]]; then
return
fi
freeRam 3200m 84
z="-Xmx${RAM}m"
z2="-Xms${RAM}m"
}
calcXmx "$@"
randomreads() {
local CMD="java $EA $EOOM $z -cp $CP align2.RandomReads3 build=1 $@"
echo $CMD >&2
eval $CMD
}
randomreads "$@"
|