File: train.sh

package info (click to toggle)
bbmap 39.20%2Bdfsg-3
  • links: PTS, VCS
  • area: main
  • in suites: forky, sid
  • size: 26,024 kB
  • sloc: java: 312,743; sh: 18,099; python: 5,247; ansic: 2,074; perl: 96; makefile: 39; xml: 38
file content (173 lines) | stat: -rwxr-xr-x 7,946 bytes parent folder | download | duplicates (2)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
#!/bin/bash

usage(){
echo "
Written by Brian Bushnell
Last modified Jan 25, 2024

Description:  Trains or evaluates neural networks.

Usage:  train.sh in=<data> dims=<X,Y,Z> out=<trained network>

train.sh in=<data> netin=<network> evaluate

Input may be fasta or fastq, compressed or uncompressed.


I/O parameters:
in=<file>       Tab-delimited data vectors.  The first line should look like
                '#dims	5	1' with the number of inputs and outputs; the
                first X columns are inputs, and the last Y the desired result.
                Subsequent lines are tab-delimited floating point numbers.
                Can be created via seqtovec.sh.
validate=<file> Optional validation dataset used exclusively for evaluation.
net=<file>      Optional input network to train.
out=<file>      Final output network after the last epoch.
outb=<file>     Best discovered network according to evaluation metrics.
overwrite=f     (ow) Set to false to force the program to abort rather than
                overwrite an existing file.

Processing parameters:
evaluate=f      Don't do any training, just evaluate the network.
dims=           Set network dimensions.  E.g. dims=5,12,7,1
mindims,maxdims These allow random dimensions, but the number of inputs and
                outputs must agree.  e.g. mindims=5,6,3,1 maxdims=5,18,15,1
batches=400k    Number of batches to train.
alpha=0.08      Amount to adjust weights during backpropagation.  Larger 
                numbers train faster but may not converge.
balance=0.2     If the positive and negative samples are unequal, make copies
                of whichever has fewer until this ratio is met.  1.0 would
                make an equal number of positive and negative samples.
density=1.0     Retain at least this fraction of edges.
edges=-1        If positive, cap the maximum number of edges.
dense=t         Set dense=f (or sparse) to process as a sparse network.
                Dense mode is fastest for fully- or mostly-connected networks;
                sparse becomes faster below 0.25 density or so.

Advanced training parameters
seed=-1         A positive seed will yield deterministic output;
                negative will use a random seed.  For multiple networks,
                each gets a different seed but you only need to set it once.
nets=1          Train this many networks concurrently (per cycle).  Only the
                best network will be reported, so training more networks will
                yield give a better result.  Higher increases memory use, but
                also can improve CPU utilization on many-threaded CPUs.
cycles=1        Each cycle trains 'nets' networks in parallel.
setsize=60000   Iterate through subsets of at most this size while training;
                larger makes batches take longer.
fpb=0.08        Only train this fraction of the subset per batch, prioritizing
                samples with the most error; larger is slower.

Evaluation parameters
vfraction=0.1   If no validation file is given, split off this fraction of the
                input dataset to use exclusively for validation.
inclusive=f     Use the full training dataset for validation.  Note that
                'evaluate' mode automatically used the input for validation.
cutoffeval=     Set the evaluation cutoff directly; any output above this
                cutoff will be considered positive, and below will be
                considered negative, when evaluating a sample.  This does not 
                affect training other than the printed results and the best 
                network selection.  Overrides fpr, fnr, and crossover.
crossover=1     Set 'cutoffeval' dynamically using the intersection of the
                FPR and FNR curves.  If false positives are 3x as detrimental
                as false negatives, set this at 3.0; if false negatives are 2x
                as bad as false positives, set this at 0.5, etc.
fpr=            Set 'cutoffeval' dynamically using this false positive rate.
fnr=            Set 'cutoffeval' dynamically using this false negative rate.

Activation functions; fractions are relative and don't need to add to 1.
sig=0.6         Fraction of nodes using sigmoid function.
tanh=0.4        Fraction of nodes using tanh function.
rslog=0.02      Fraction of nodes using rotationally symmetric log.
msig=0.02       Fraction of nodes using mirrored sigmoid.
swish=0.0       Fraction of nodes using swish.
esig=0.0        Fraction of nodes using extended sigmoid.
emsig=0.0       Fraction of nodes using extended mirrored sigmoid.
bell=0.0        Fraction of nodes using a bell curve.
max=0.0         Fraction of nodes using a max function (TODO).
final=rslog     Type of function used in the final layer.

Exotic parameters
scan=0          Test this many seeds initially before picking one to train.
scanbatches=1k  Evaluate scanned seeds at this point to choose the best.
simd=f          Use SIMD instructions for greater speed; requires Java 18+.
cutoffbackprop=0.5   Optimize around this point for separating positive and
                     negative results.  Unrelated to cutoffeval.
pem=1.0         Positive error mult; when value>target, multiply the error 
                by this number to adjust the backpropagation penalty.
nem=1.0         Negative error mult; when value<target, multiply the error 
                by this number to adjust the backpropagation penalty.
fpem=10.5       False positive error mult; when target<cutoffbackprop
                value>(cutoffbackprop-spread), multiply error by this.
fnem=10.5       False negative error mult; when target>cutoffbackprop
                value<(cutoffbackprop+spread), multiply error by this.
spread=0.05     Allows applying fnem/fpem multipliers to values that
                are barely onsides, but too close to the cutoff.
epem=0.2        Excess positive error mult; error multiplier when 
                target>cutoff and value>target (overshot the target).
enem=0.2        Error multiplier when target<cutoff and value<target.
epm=0.2         Excess pivot mult; lower numbers give less priority to
                training samples that are excessively positive or negative.
cutoff=         Set both cutoffbackprop and cutoffeval.
ptriage=0.0001  Ignore this fraction of positive samples as untrainable.
ntriage=0.0005  Ignore this fraction of negative samples as untrainable.
anneal=0.003    Randomize weights by this much to avoid local minimae.
annealprob=.225 Probability of any given weight being annealed per batch.
ebs=1           (edgeblocksize) 8x gives best performance with AVX256 in
                sparse networks.  4x may be useful for raw sequence. 

Java Parameters:
-Xmx            This will set Java's memory usage, overriding autodetection.
                -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will
                specify 200 megs. The max is typically 85% of physical memory.
-eoom           This flag will cause the process to exit if an out-of-memory
                exception occurs.  Requires Java 8u92+.
-da             Disable assertions.

Please contact Brian Bushnell at bbushnell@lbl.gov if you encounter any problems.
"
}

#This block allows symlinked shellscripts to correctly set classpath.
pushd . > /dev/null
DIR="${BASH_SOURCE[0]}"
while [ -h "$DIR" ]; do
  cd "$(dirname "$DIR")"
  DIR="$(readlink "$(basename "$DIR")")"
done
cd "$(dirname "$DIR")"
DIR="$(pwd)/"
popd > /dev/null

#DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" && pwd )/"
CP="$DIR""current/"

z="-Xmx8g"
z2="-Xms8g"
set=0

if [ -z "$1" ] || [[ $1 == -h ]] || [[ $1 == --help ]]; then
	usage
	exit
fi

calcXmx () {
	source "$DIR""/calcmem.sh"
	setEnvironment
	parseXmx "$@"
	if [[ $set == 1 ]]; then
		return
	fi
	freeRam 8000m 42
	z="-Xmx${RAM}m"
	z2="-Xms${RAM}m"
}
calcXmx "$@"

train() {
	local CMD="java $EA $EOOM $z $z2 $SIMD -cp $CP ml.Trainer $@"
	echo $CMD >&2
	eval $CMD
}

train "$@"