1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150
|
#!/bin/bash
usage(){
echo "
Written by Brian Bushnell
Last modified March 31, 2025
Description: Bins contigs using coverage and kmer frequencies.
If reads or covstats are provided, coverage will be calculated from those;
otherwise, it will be parsed from contig headers. Coverage can be parsed
from Spades or Tadpole contig headers; alternatively, renamebymapping.sh
can be used to annotate the headers with coverage from multiple sam files.
Any number of sam files may be used (from different samples of the same
environment, usually). The more sam files, the more accurate, though
some stringency (depthratio and maxcovariance) may need to be relaxed
with large numbers of sam files (more than 4). Ideally, sam files will
be generated from paired reads like this:
bbmap.sh ref=contigs.fa in=reads.fq ambig=random mateqtag minid=0.9 maxindel=10 out=mapped.sam
For PacBio-only metagenomes, it is best to generate synthetic paired
reads from the PacBio CCS reads, and align those.
Usage: quickbin.sh in=contigs.fa out=bins/bin_%.fa *.sam covout=cov.txt
or
quickbin.sh in=contigs.fa out=bins/bin_%.fa cov=cov.txt
or
quickbin.sh contigs.fa out=bins *.sam
File parameters:
in=<file> Assembly input; only required parameter. Files named *.fa
or *.fasta do not need 'in='.
reads=<file> Read input (fastq or sam). Multiple sam files may be used,
comma-delimited, or as plain arguments without 'reads='.
Multiple files will be assumed to be independent samples.
cov=<file> Cov file generated by QuickBin from sam files. Files
named cov*.txt do not need 'cov='.
out=<pattern> Output pattern. If this contains a % symbol, like bin%.fa,
one file will be created per bin. If not, all contigs will
be written to the same file, with the name modified to
indicate their bin number. A term without a '.' symbol
like 'out=output' will be considered a directory.
Size parameters:
mincluster=50k Minimum output cluster size in base pairs; smaller clusters
will share a residual file.
mincontig=100 Don't load contigs smaller than this; reduces memory usage.
minseed=3000 Minimum contig length to create a new cluster; reducing this
can increase speed dramatically for large metagenomes,
increase sensitivity for small contigs, and slightly increase
contamination. In particular, large metagenomes with only
1 sample will run slowly if this is below 2000; with
at least 3 samples the speed should not be affected much.
minresidue=200 Discard unclustered contigs shorter than this; reduces memory.
Stringency parameters:
normal Default stringency is 'normal'. All settings, in order of
increasing sensitivity, are: xstrict, ustrict, vstrict,
strict, normal, loose, vloose, uloose, xloose. 'normal'
aims at under 1% contamination; 'uloose' is more comparable
in stringency to other binners. To set a stringency just add
that flag (without an = sign). Acceptable shorthand is
xs, us, vs, s, n, l, vl, ul, xl.
Quantization parameters:
gcwidth=0.02 Width of GC matrix gridlines. Smaller is faster.
depthwidth=0.5 Width of depth matrix gridlines. Smaller is faster. This
is on a log2 scale so 0.5 would mean 2 gridlines per power
of 2 depth - lines at 0.707, 1, 1.414, 2, 2.818, 4, etc.
Note: Halving either quantization parameter can roughly double speed,
but will decrease recovery of shorter contigs.
Neural network parameters:
net=auto Specify a neural network file to use.
cutoff=0.56 Neural network output threshold; higher increases specificity,
lower increases sensitivity.
Edge-processing parameters:
e1=0 Edge-first clustering passes; may increase speed
at the cost of purity.
e2=4 Later edge-based clustering passes.
edgeStringency1=0.25 Stringency for edge-first clustering;
lower is more stringent.
edgeStringency2=2 Stringency for later edge-based clustering.
maxEdges=3 Follow up to this many edges per contig.
minEdgeWeight=2 Ignore edges made from fewer read pairs.
minEdgeRatio=0.4 Ignore edges under this fraction of max edge weight.
minmapq=20 Ignore reads mapping with mapq below this for the
purpose of making edges. They are still used for depth.
goodEdgeMult=1.4 Merge stringency multiplier for contigs joined by
an edge; lower is more stringent.
Other parameters:
sketchoutput=f Use SendSketch to identify taxonomy of output clusters.
validate=f If contig headers have a term such as 'tid_1234', this
will be parsed and used to evaluate correctness.
Java Parameters:
-Xmx This will set Java's memory usage, overriding autodetection.
-Xmx20g will specify 20 gigs of RAM, and -Xmx200m will
specify 200 megs. The max is typically 85% of physical memory.
-eoom This flag will cause the process to exit if an out-of-memory
exception occurs. Requires Java 8u92+.
-da Disable assertions.
Please contact Brian Bushnell at bbushnell@lbl.gov if you encounter any problems.
"
}
#This block allows symlinked shellscripts to correctly set classpath.
pushd . > /dev/null
DIR="${BASH_SOURCE[0]}"
while [ -h "$DIR" ]; do
cd "$(dirname "$DIR")"
DIR="$(readlink "$(basename "$DIR")")"
done
cd "$(dirname "$DIR")"
DIR="$(pwd)/"
popd > /dev/null
#DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" && pwd )/"
CP="$DIR""current/"
z="-Xmx4g"
z2="-Xms4g"
set=0
if [ -z "$1" ] || [[ $1 == -h ]] || [[ $1 == --help ]]; then
usage
exit
fi
calcXmx () {
source "$DIR""/calcmem.sh"
setEnvironment
parseXmx "$@"
if [[ $set == 1 ]]; then
return
fi
freeRam 4000m 84
z="-Xmx${RAM}m"
z2="-Xms${RAM}m"
}
calcXmx "$@"
quickbin() {
local CMD="java $EA $EOOM $SIMD $z -cp $CP bin.QuickBin $@"
echo $CMD >&2
eval $CMD
}
quickbin "$@"
|