File: reformatpb.sh

package info (click to toggle)
bbmap 39.20%2Bdfsg-3
  • links: PTS, VCS
  • area: main
  • in suites: forky, sid
  • size: 26,024 kB
  • sloc: java: 312,743; sh: 18,099; python: 5,247; ansic: 2,074; perl: 96; makefile: 39; xml: 38
file content (136 lines) | stat: -rwxr-xr-x 5,160 bytes parent folder | download | duplicates (4)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
#!/bin/bash

usage(){
echo "
Written by Brian Bushnell
Last modified June 16, 2020

Description:  Provides some of Reformat's functionality in a ZMW-aware tool.

Usage:  reformatpb.sh in=<input file> out=<output file> outb=<bad reads>

File I/O parameters:
in=<file>       Primary input.
out=<file>      (outgood) Output for good reads.
outb=<file>     (outbad) Output for discarded reads.
stats=<file>    Print screen output here instead of to the screen.
json=f          Print stats as json.
schist=<file>   Subread count per ZMW histogram.
overwrite=f     (ow) Set to false to force the program to abort rather than
                overwrite an existing file.
ziplevel=2      (zl) Set to 1 (lowest) through 9 (max) to change compression
                level; lower compression is faster.

Processing parameters:
kzt=f           (keepzmwstogether) Send all reads from a ZMW to the same file,
                either good or bad output.
minlen=40       Do not output reads shorter than this, after trimming.
ccsin=f         Input reads are CCS, meaning they are all full-pass.
                Not currently used for anything.
trimpolya=f     Trim terminal poly-A and poly-T sequences, for some isoseq
                libraries.
minpolymer=5    Don't trim poly-A sequence shorter than this.
polyerror=0.2   Max error rate for trimming poly-A.
flaglongreads=f    True to flag reads longer than 1.5x median to be discarded.
longreadmult=1.5   Multiplier to consider a read suspiciously long.

Whitelists and Blacklists:
whitelist=      ZMW identifiers, as a comma-delimited list of integers,
                or files with one integer per line.  All ZMWs not in the
                list will be discarded.
blacklist=      All ZMWs in the list will be discarded.

Sampling parameters (avoid using more than one of these at a time):
reads=-1        If positive, quit after processing this many reads.
zmws=-1         If positive, quit after processing this many ZMWs.
bestpass=f      Set to true to keep only the best read per ZMW.  This is
                the median length read of the non-outermost reads.
                If there are 2 or fewer passes, the longest will be chosen.
longestpass=f   Set to true to keep only the longest read per ZMW.
samplerate=1.0  Retain this fraction of input reads.
samplereadstarget=-1  If positive, retain this many reads.
samplebasestarget=-1  If positive, retain this many bases.
samplezmwstarget=-1   If positive, retain this many ZMWs.
subsamplefromends=f   If true, eliminate outermost reads first, then inner.

CCS Parameters (Note: CCS is still experimental)
ccs=f           Make a single consensus read per ZMW (CPU-intensive).
minpasses=0     Discard ZMWs with fewer than this many passes (estimated;
                first and last subread are usually partial).
minsubreads=0   Discard ZMWs with fewer than this many subreads.
reorient=f      Try aligning both strands in case ZMW ordering is broken.
minshredid=0.6  Do not include shreds with identity below this in consensus.

Entropy parameters (recommended setting is 'entropy=t'):
minentropy=-1   Set to 0.4 or above to remove low-entropy reads;
                range is 0-1, recommended value is 0.55.  0.7 is too high.
                Negative numbers disable this function.
entropyk=3      Kmer length for entropy calculation.
entropylen=350  Reads with entropy below cutoff for at least this many
                consecutive bases will be removed.
entropyfraction=0.5     Alternative minimum length for short reads; the shorter
                        of entropylen and entfraction*readlength will be used.
entropywindow=50        Window size used for entropy calculation.
maxmonomerfraction=0.74 (mmf) Also require this fraction of bases in each
                        window to be the same base.

Java Parameters:
-Xmx            This will set Java's memory usage, overriding autodetection.
                -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will
                specify 200 megs. The max is typically 85% of physical memory.
-eoom           This flag will cause the process to exit if an out-of-memory
                exception occurs.  Requires Java 8u92+.
-da             Disable assertions.

Please contact Brian Bushnell at bbushnell@lbl.gov if you encounter any problems.
"
}

#This block allows symlinked shellscripts to correctly set classpath.
pushd . > /dev/null
DIR="${BASH_SOURCE[0]}"
while [ -h "$DIR" ]; do
  cd "$(dirname "$DIR")"
  DIR="$(readlink "$(basename "$DIR")")"
done
cd "$(dirname "$DIR")"
DIR="$(pwd)/"
popd > /dev/null

#DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" && pwd )/"
CP="$DIR""current/"
JNI="-Djava.library.path=""$DIR""jni/"
#JNI=""

z="-Xmx2g"
z2="-Xms2g"
#z3="-Xss16m"
set=0

if [ -z "$1" ] || [[ $1 == -h ]] || [[ $1 == --help ]]; then
	usage
	exit
fi

calcXmx () {
	source "$DIR""/calcmem.sh"
	setEnvironment
	parseXmx "$@"
	if [[ $set == 1 ]]; then
		return
	fi
	freeRam 2000m 42
	z="-Xmx${RAM}m"
	z2="-Xms${RAM}m"
}
calcXmx "$@"

reformatpb() {
	local CMD="java $EA $EOOM $z $z2 -cp $CP icecream.ReformatPacBio $@"
	if [[ $silent != 1 ]]; then
		echo $CMD >&2
	fi
	eval $CMD
}

reformatpb "$@"