File: quickbin.sh

package info (click to toggle)
bbmap 39.20%2Bdfsg-3
links: PTS, VCS
area: main
in suites: forky, sid
size: 26,024 kB
sloc: java: 312,743; sh: 18,099; python: 5,247; ansic: 2,074; perl: 96; makefile: 39; xml: 38
file content (150 lines) | stat: -rwxr-xr-x 6,254 bytes
parent folder | download | duplicates (2)
#!/bin/bash

usage(){
echo "
Written by Brian Bushnell
Last modified March 31, 2025

Description:  Bins contigs using coverage and kmer frequencies.
If reads or covstats are provided, coverage will be calculated from those;
otherwise, it will be parsed from contig headers.  Coverage can be parsed
from Spades or Tadpole contig headers; alternatively, renamebymapping.sh
can be used to annotate the headers with coverage from multiple sam files.
Any number of sam files may be used (from different samples of the same
environment, usually).  The more sam files, the more accurate, though
some stringency (depthratio and maxcovariance) may need to be relaxed
with large numbers of sam files (more than 4).  Ideally, sam files will
be generated from paired reads like this:
bbmap.sh ref=contigs.fa in=reads.fq ambig=random mateqtag minid=0.9 maxindel=10 out=mapped.sam
For PacBio-only metagenomes, it is best to generate synthetic paired 
reads from the PacBio CCS reads, and align those.

Usage:  quickbin.sh in=contigs.fa out=bins/bin_%.fa *.sam covout=cov.txt
or
quickbin.sh in=contigs.fa out=bins/bin_%.fa cov=cov.txt
or
quickbin.sh contigs.fa out=bins *.sam

File parameters:
in=<file>       Assembly input; only required parameter.  Files named *.fa
                or *.fasta do not need 'in='.
reads=<file>    Read input (fastq or sam).  Multiple sam files may be used,
                comma-delimited, or as plain arguments without 'reads='.
                Multiple files will be assumed to be independent samples.
cov=<file>      Cov file generated by QuickBin from sam files.  Files
                named cov*.txt do not need 'cov='.
out=<pattern>   Output pattern.  If this contains a % symbol, like bin%.fa,
                one file will be created per bin.  If not, all contigs will
                be written to the same file, with the name modified to
                indicate their bin number.  A term without a '.' symbol
                like 'out=output' will be considered a directory.

Size parameters:
mincluster=50k  Minimum output cluster size in base pairs; smaller clusters
                will share a residual file.
mincontig=100   Don't load contigs smaller than this; reduces memory usage.
minseed=3000    Minimum contig length to create a new cluster; reducing this
                can increase speed dramatically for large metagenomes,
                increase sensitivity for small contigs, and slightly increase
                contamination.  In particular, large metagenomes with only 
                1 sample will run slowly if this is below 2000; with 
                at least 3 samples the speed should not be affected much.
minresidue=200  Discard unclustered contigs shorter than this; reduces memory.

Stringency parameters:
normal          Default stringency is 'normal'.  All settings, in order of
                increasing sensitivity, are:  xstrict, ustrict, vstrict,
                strict, normal, loose, vloose, uloose, xloose.  'normal'
                aims at under 1% contamination; 'uloose' is more comparable
                in stringency to other binners.  To set a stringency just add
                that flag (without an = sign).  Acceptable shorthand is
                xs, us, vs, s, n, l, vl, ul, xl.

Quantization parameters:
gcwidth=0.02    Width of GC matrix gridlines.  Smaller is faster.
depthwidth=0.5  Width of depth matrix gridlines.  Smaller is faster.  This
                is on a log2 scale so 0.5 would mean 2 gridlines per power
                of 2 depth - lines at 0.707, 1, 1.414, 2, 2.818, 4, etc.
Note: Halving either quantization parameter can roughly double speed,
but will decrease recovery of shorter contigs.

Neural network parameters:
net=auto        Specify a neural network file to use.
cutoff=0.56     Neural network output threshold; higher increases specificity,
                lower increases sensitivity.

Edge-processing parameters:
e1=0                  Edge-first clustering passes; may increase speed
                      at the cost of purity.
e2=4                  Later edge-based clustering passes.
edgeStringency1=0.25  Stringency for edge-first clustering; 
                      lower is more stringent.
edgeStringency2=2     Stringency for later edge-based clustering.
maxEdges=3            Follow up to this many edges per contig.
minEdgeWeight=2       Ignore edges made from fewer read pairs.
minEdgeRatio=0.4      Ignore edges under this fraction of max edge weight.
minmapq=20            Ignore reads mapping with mapq below this for the
                      purpose of making edges.  They are still used for depth.
goodEdgeMult=1.4      Merge stringency multiplier for contigs joined by
                      an edge; lower is more stringent.

Other parameters:
sketchoutput=f        Use SendSketch to identify taxonomy of output clusters.
validate=f            If contig headers have a term such as 'tid_1234', this
                      will be parsed and used to evaluate correctness.

Java Parameters:
-Xmx            This will set Java's memory usage, overriding autodetection.
                -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will
                specify 200 megs. The max is typically 85% of physical memory.
-eoom           This flag will cause the process to exit if an out-of-memory
                exception occurs.  Requires Java 8u92+.
-da             Disable assertions.

Please contact Brian Bushnell at bbushnell@lbl.gov if you encounter any problems.
"
}

#This block allows symlinked shellscripts to correctly set classpath.
pushd . > /dev/null
DIR="${BASH_SOURCE[0]}"
while [ -h "$DIR" ]; do
  cd "$(dirname "$DIR")"
  DIR="$(readlink "$(basename "$DIR")")"
done
cd "$(dirname "$DIR")"
DIR="$(pwd)/"
popd > /dev/null

#DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" && pwd )/"
CP="$DIR""current/"

z="-Xmx4g"
z2="-Xms4g"
set=0

if [ -z "$1" ] || [[ $1 == -h ]] || [[ $1 == --help ]]; then
	usage
	exit
fi

calcXmx () {
	source "$DIR""/calcmem.sh"
	setEnvironment
	parseXmx "$@"
	if [[ $set == 1 ]]; then
		return
	fi
	freeRam 4000m 84
	z="-Xmx${RAM}m"
	z2="-Xms${RAM}m"
}
calcXmx "$@"

quickbin() {
	local CMD="java $EA $EOOM $SIMD $z -cp $CP bin.QuickBin $@"
	echo $CMD >&2
	eval $CMD
}

quickbin "$@"