File: polyfilter.sh

package info (click to toggle)
bbmap 39.20%2Bdfsg-3
  • links: PTS, VCS
  • area: main
  • in suites: forky, sid
  • size: 26,024 kB
  • sloc: java: 312,743; sh: 18,099; python: 5,247; ansic: 2,074; perl: 96; makefile: 39; xml: 38
file content (144 lines) | stat: -rwxr-xr-x 5,444 bytes parent folder | download | duplicates (2)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
#!/bin/bash

usage(){
echo "
Written by Brian Bushnell
Last modified October 1, 2024

Description:  Filters reads to remove those with suspicious homopolymers.

Usage:  polyfilter.sh in=<input reads> out=<filtered reads>

Example:
polyfilter.sh in=reads.fq out=clean.fq outb=bad.fq k=31 polymers=G


File parameters:
in=<file>       Primary input, or read 1 input.
in2=<file>      Read 2 input if reads are in two files.
out=<file>      Output for clean reads.
outb=<file>     Output for bad (homopolymer) reads.
extra=<file>    Comma-delimited list of additional sequence files.
                For depth-based filtering, set this to the same as the input.
overwrite=t     (ow) Set to false to force the program to abort rather than
                overwrite an existing file.

Hashing parameters:
k=31            Kmer length.
hashes=2        Number of hashes per kmer.  Higher generally reduces 
                false positives at the expense of speed.
sw=t            (symmetricwrite) Increases accuracy when bits>1 and hashes>1.
minprob=0.5     Ignore reference kmers with probability of being correct
                below this (affects fastq references only).
memmult=1.0     Fraction of free memory to use for Bloom filter.  1.0 should
                generally work; if the program crashes with an out of memory
                error, set this lower.  Higher increases specificity.
cells=          Option to set the number of cells manually.  By default this
                will be autoset to use all available memory.  The only reason
                to set this is to ensure deterministic output.
seed=0          This will change the hash function used.
bits=           Bits per cell; it is set automatically from mincount.

Filtering rules:
Reads will always be discarded if they fails ldf2, entropy2, or minpolymer2.
Reads will also be discarded if they fail (minpolymer AND (ldf OR entropy)).
A read pair will be discarded if either read is discarded.

Depth-filtering parameters:
mincount=2      Minimum number of times a read kmer must occur in the 
                read set to be considered 'high-depth'.
ldf=0.24        (lowdepthfraction) Consider a read low-depth if at least
                this fraction of kmers are low depth.  Setting this above 1
                will disable depth analysis (making the program run faster).
ldf2=1.1        Discard reads with at least this fraction of low-depth kmers.
                Values above 1 disables this filter (e.g., for metagenomes).

Entropy-filtering parameters:
entropy=0.67    Reads with average entropy below this are considered 
                low-entropy.
entropy2=0.2    Reads with average entropy below this are discarded.

Quality-filtering parameters (only useful if q-scores are correct):
quality=12.5    Reads with average quality below this are considered 
                low-quality.
quality2=7.5    Reads with average quality below this are discarded.

Homopolymer-filtering parameters:
polymers=GC     Look for homopolymers of these symbols.  e.g., polymers=GC
                would look for poly-G or poly-C (but not poly-GC).
minpolymer=20   Minimum length of homopolymers.
minpolymer2=29  Discard any read with a homopolymer of at least this length.
purity=0.85     Min fraction of the homopolymer region that is the correct
                symbol.  For example, GGGGGGAGGG is length 10 with 9 Gs, for
                a purity of 0.90 (insufficient in this case due to length).

Trimming parameters:
trimpolymers=   Homopolymers to use for trimming.  If unspecified, it will
                be the same as 'polymers'.
trimleft=6      Trim left ends where there is a homopolymer at least this
                long; 0 disables trimming.
trimright=6     Trim left ends where there is a homopolymer at least this
                long; 0 disables trimming.
trim=           Sets both trimleft and trimright.
maxnonpoly=2    Trim through up to this many consecutive mismatches.
minlen=50       Discard reads shorter than this after trimming.

Other parameters:
quantize=1      If greater than 1, bin the quality scores to reduce file size.
cardinality=t   Report estimated number of unique output kmers.

Java Parameters:
-Xmx            This will set Java's memory usage, overriding autodetection.
                -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will
                specify 200 megs. The max is typically 85% of physical memory.
-eoom           This flag will cause the process to exit if an out-of-memory
                exception occurs.  Requires Java 8u92+.
-da             Disable assertions.

Please contact Brian Bushnell at bbushnell@lbl.gov if you encounter any problems.
"
}

#This block allows symlinked shellscripts to correctly set classpath.
pushd . > /dev/null
DIR="${BASH_SOURCE[0]}"
while [ -h "$DIR" ]; do
  cd "$(dirname "$DIR")"
  DIR="$(readlink "$(basename "$DIR")")"
done
cd "$(dirname "$DIR")"
DIR="$(pwd)/"
popd > /dev/null

#DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" && pwd )/"
CP="$DIR""current/"

z="-Xmx4g"
z2="-Xms4g"
set=0

if [ -z "$1" ] || [[ $1 == -h ]] || [[ $1 == --help ]]; then
	usage
	exit
fi

calcXmx () {
	source "$DIR""/calcmem.sh"
	setEnvironment
	parseXmx "$@"
	if [[ $set == 1 ]]; then
		return
	fi
	freeRam 4000m 84
	z="-Xmx${RAM}m"
	z2="-Xms${RAM}m"
}
calcXmx "$@"

polyfilter() {
	local CMD="java $EA $EOOM $z $z2 -cp $CP bloom.PolyFilter $@"
	echo $CMD >&2
	eval $CMD
}

polyfilter "$@"