1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97
|
#!/bin/bash
usage(){
echo "
Written by Brian Bushnell
Last modified October 14, 2022
Description: Counts duplicate sequences probabilistically,
using around 20 bytes per unique read. Read pairs are treated
as a single read. Reads are converted to a hashcode and only
the hashcode is stored when tracking duplicates, so (rare) hash
collisions will result in false positive duplicate detection.
Optionally outputs the deduplicated and/or duplicate reads.
Usage: countduplicates.sh in=<input file>
Input may be fasta, fastq, or sam, compressed or uncompressed.
in2, out2, and outd2 are accepted for paired files.
Standard parameters:
in=<file> Primary input, or read 1 input.
out=<file> Optional output for deduplicated reads.
outd=<file> Optional output for duplicate reads. An extension like .fq
will output reads; .txt will output headers only.
stats=stdout May be replaced by a filename to write stats to a file.
showspeed=t (ss) Set to 'f' to suppress display of processing speed.
Processing parameters (these are NOT mutually exclusive):
bases=t Include bases when generating hashcodes.
names=f Include names (headers) when generating hashcodes.
qualities=f Include qualities when generating hashcodes.
maxfraction=-1.0 Set to a positive number 0-1 to FAIL input
that exceeds this fraction of reads with duplicates.
maxrate=-1.0 Set to a positive number >=1 to FAIL input that exceeds this
average duplication rate (the number of copies per read).
failcode=0 Set to some other number like 1 to produce a
non-zero exit code for failed input.
samplerate=1.0 Fraction of reads to subsample, to conserve memory. Sampling
is deterministic - if a read is sampled, copies will be too.
Unsampled reads are not sent to any output stream or counted
in statistics.
Java Parameters:
-Xmx This will set Java's memory usage, overriding autodetection.
-Xmx20g will specify 20 gigs of RAM, and -Xmx200m will
specify 200 megs. The max is typically 85% of physical memory.
-eoom This flag will cause the process to exit if an out-of-memory
exception occurs. Requires Java 8u92+.
-da Disable assertions.
Please contact Brian Bushnell at bbushnell@lbl.gov if you encounter any problems.
"
}
#This block allows symlinked shellscripts to correctly set classpath.
pushd . > /dev/null
DIR="${BASH_SOURCE[0]}"
while [ -h "$DIR" ]; do
cd "$(dirname "$DIR")"
DIR="$(readlink "$(basename "$DIR")")"
done
cd "$(dirname "$DIR")"
DIR="$(pwd)/"
popd > /dev/null
#DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" && pwd )/"
CP="$DIR""current/"
z="-Xmx4g"
z2="-Xms4g"
set=0
if [ -z "$1" ] || [[ $1 == -h ]] || [[ $1 == --help ]]; then
usage
exit
fi
calcXmx () {
source "$DIR""/calcmem.sh"
setEnvironment
parseXmx "$@"
if [[ $set == 1 ]]; then
return
fi
freeRam 4000m 84
z="-Xmx${RAM}m"
z2="-Xms${RAM}m"
}
calcXmx "$@"
countDupes() {
local CMD="java $EA $EOOM $z -cp $CP jgi.CountDuplicates $@"
echo $CMD >&2
eval $CMD
}
countDupes "$@"
|