File: cmcalibrate.man

package info (click to toggle)
infernal 1.1.2-2
  • links: PTS, VCS
  • area: main
  • in suites:
  • size: 62,408 kB
  • sloc: ansic: 209,575; perl: 12,119; sh: 5,676; makefile: 2,823
file content (373 lines) | stat: -rw-r--r-- 10,380 bytes parent folder | download
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
.TH "cmcalibrate" 1 "July 2016" "Infernal 1.1.2" "Infernal Manual"

.SH NAME
cmcalibrate - fit exponential tails for covariance model E-value determination

.SH SYNOPSIS
.B cmcalibrate
.I [options]
.I cmfile

.SH DESCRIPTION

.PP
.B cmcalibrate
determines exponential tail parameters for E-value determination by
generating random sequences, searching them with the CM and collecting
the scores of the resulting hits. A histogram of the bit scores of
the hits is fit to an exponential tail, and the parameters of the
fitted tail are saved to the CM file. The exponential tail parameters
are then used to estimate the statistical significance of hits found
in 
.B cmsearch
and
.B cmscan. 

.PP
A CM file must be calibrated with 
.B cmcalibrate
before it can be used in 
.B cmsearch 
or 
.B cmscan,
with a single exception: it is not necessary to calibrate CM files
that include only models with zero basepairs before running
.B cmsearch.


.PP
.B cmcalibrate
is very slow. It takes a couple of hours
to calibrate a single average sized CM on a single CPU. 
.B cmcalibrate
will run in parallel on all available cores if Infernal was built on a
system that supports POSIX threading (see the Installation section of
the user guide for more information). Using 
.B <n>
cores will result in roughly
.B <n>
-fold acceleration versus a single CPU.  MPI (Message Passing
Interface) can be also be used for parallelization with the
.B --mpi 
option if Infernal was built with MPI enabled, but using more than 161
processors is not recommended because increasing past 161 won't
accelerate the calibration.  See the Installation section of the user
guide for more information.

.PP
The 
.B --forecast  
option can be used to estimate how long the program will take to run
for a given 
.I cmfile
on the current machine.
To predict the running time on
.I <n> 
processors with MPI, additionally use the
.BI --nforecast " <n>"
option.

.PP
The random sequences searched in 
.B cmcalibrate
are generated by an HMM that was trained on real genomic sequences
with various GC contents. The goal is to have the GC distributions in
the random sequences be similar to those in actual genomic sequences.

.PP
Four rounds of searches and subsequent exponential tail fits are
performed, one each for the four different CM algorithms that can be
used in 
.B cmsearch 
and 
.B cmscan:
glocal CYK, glocal Inside, local CYK and local Inside.

.PP
The E-values parameters determined by 
.B cmcalibrate
are only used by the
.B cmsearch 
and
.B cmscan 
programs.
If you are not going to use these programs then
do not waste time calibrating your models.

.SH OPTIONS

.TP
.B -h
Help; print a brief reminder of command line usage and available
options.

.TP
.BI -L " <x>"
Set the total length of random sequences to search 
to 
.I <x> 
megabases (Mb). By default, 
.I <x> is
1.6 Mb. Increasing 
.I <x> 
will make the exponential tail fits more precise and 
E-values more accurate, but will take longer (doubling
.I <x> 
will roughly double the running time).
Decreasing 
.I <x> 
is not recommended as it will make the fits less
precise and the E-values less accurate.

.SH OPTIONS FOR PREDICTING REQUIRED TIME AND MEMORY

.TP
.B --forecast
Predict the running time of the calibration of 
.I cmfile 
(with provided options) on the current machine 
and exit. The calibration is not performed.
The predictions should be considered rough
estimates. If multithreading is enabled
(see Installation section of user guide), the timing 
will take into account the number of available cores.

.TP
.BI --nforecast " <n>"
With 
.B --forecast,
specify that 
.I <n>
processors will be used for the calibration.
This might be useful for predicting the running time of an MPI run 
with 
.I <n> 
processors.

.TP
.B --memreq
Predict the amount of required memory for calibrating
.I cmfile 
(with provided options) on the current machine 
and exit. The calibration is not performed.

.SH OPTIONS CONTROLLING EXPONENTIAL TAIL FITS

.TP
.BI --gtailn " <x>"
fit the exponential tail for glocal Inside and glocal CYK to the 
.I <n> 
highest scores in the histogram tail, where
.I <n> 
is 
.I <x>
times the number of Mb searched. The default value of 
.I <x>
is 250. 
The value 250 was chosen because it works well empirically
relative to other values.

.TP
.BI --ltailn " <x>"
fit the exponential tail for local Inside and local CYK to the 
.I <n> 
highest scores in the histogram tail, where
.I <n> 
is 
.I <x>
times the number of Mb searched. The default value of 
.I <x>
is 750. 
The value 750 was chosen because it works well empirically
relative to other values.

.TP
.BI --tailp " <x>"
Ignore the
.B --gtailn
and
.B --ltailn
prefixed options and fit the 
.I <x>
fraction tail of the histogram to an exponential tail, for all
search modes.

.SH OPTIONAL OUTPUT FILES

.TP 
.BI --hfile " <f>"
Save the histograms fit to file
.I <f>.
The format of this file is two space delimited columns per line. The first column
is the x-axis values of bit scores of each bin. The second column is the y-axis
values of number of hits per bin. Each series is delimited by a line
with a single character "&". The file will contain one series for each
of the four exponential tail fits in the following order: glocal CYK,
glocal Inside, local CYK, and local Inside.

.TP 
.BI --sfile " <f>"
Save survival plot information to file
.I <f>.
The format of this file is two space delimited columns per line. The first column
is the x-axis values of bit scores of each bin. The second column is the y-axis
values of fraction of hits that meet or exceed the score for each
bin. Each series is delimited by a line with a single character "&". 
The file will contain three series of data for each
of the four CM search modes in the following order: glocal CYK,
glocal Inside, local CYK, and local Inside.
The first series is the empirical survival plot from the histogram of hits
to the random sequence. The second series is the exponential tail fit
to the empirical distribution. The third series is the exponential
tail fit if lambda were fixed and set as the natural log of 2 (0.691314718).

.TP 
.BI --qqfile " <f>"
Save quantile-quantile plot information to file
.I <f>.
The format of this file is two space delimited columns per line. The first column
is the x-axis values, and the second column is the y-axis
values. The distance of the points from the identity line (y=x) is a
measure of how good the exponential tail fit is, the closer the points
are to the identity line, the better the fit is.
Each series is delimited by a line with a single character "&". 
The file will contain one series of empirical data for each of the
four exponential tail fits in the following order:
glocal CYK, glocal Inside, local CYK and local Inside.

.TP 
.BI --ffile " <f>"
Save space delimited statistics of different exponential tail fits to file
.I <f>.
The file will contain the lambda and mu values for exponential tails
fit to histogram tails of different sizes. The fields in the file are
labelled informatively.

.TP 
.BI --xfile " <f>"
Save a list of the scores in each fit histogram tail to file
.I <f>.
Each line of this file will have a different score indicating one hit
existed in the tail with that score.  Each series is
delimited by a line with a single character "&". The file will contain
one series for each of the four exponential tail fits in the following
order: glocal CYK, glocal Inside, local CYK, and local Inside.

.SH OTHER OPTIONS

.TP
.BI --seed " <n>"
Seed the random number generator with
.I <n>,
an integer >= 0. 
If 
.I <n> 
is nonzero, stochastic simulations will be reproducible; the same
command will give the same results.
If 
.I <n>
is 0, the random number generator is seeded arbitrarily, and
stochastic simulations will vary from run to run of the same command.
The default seed is 181.

.TP
.BI --beta " <x>"
By default query-dependent banding (QDB)
is used to accelerate the CM search algorithms with a beta tail loss
probability of 1E-15.
This beta value can be changed to 
.I <x>
with
.BI --beta " <x>".
The beta parameter is the amount of probability mass excluded
during band calculation, higher values of beta give greater speedups
but sacrifice more accuracy than lower values. The default value used
is 1E-15. (For more information on QDB see 
Nawrocki and Eddy, PLoS Computational Biology 3(3): e56.) 

.TP
.B --nonbanded
Turn off QDB during E-value calibration. This will slow down
calibration.

.TP 
.B --nonull3 
Turn off the null3 post hoc additional null model. This is not
recommended unless you plan on using the same option to 
.B cmsearch 
and/or
.B cmscan.

.TP 
.B --random
Use the background null model of the CM to generate the random
sequences, instead of the more realistic HMM. Unless the CM 
was built using the 
.B --null
option to 
.B cmbuild,
the background null model will be 25% each A, C, G and U.

.TP 
.BI --gc " <f>" 
Generate the random sequences using the 
nucleotide distribution from the sequence file
.I <f>.

.TP
.BI --cpu " <n>"
Specify that 
.I <n>
parallel CPU workers be used. If 
.I <n> 
is set as "0", then the program will be run in serial mode, without
using threads. 
You can also control
this number by setting an environment variable, 
.I INFERNAL_NCPU.
This option will only be available if the machine on
which Infernal was built is capable of using POSIX threading (see the
Installation section of the user guide for more information).

.TP
.B --mpi
Run as an MPI parallel program. This option will only be available if
Infernal has been configured and built with the "--enable-mpi" flag
(see the Installation section of the user guide for more information).


.SH SEE ALSO 

See 
.B infernal(1)
for a master man page with a list of all the individual man pages
for programs in the Infernal package.

.PP
For complete documentation, see the user guide that came with your
Infernal distribution (Userguide.pdf); or see the Infernal web page
().


.SH COPYRIGHT

.nf
Copyright (C) 2016 Howard Hughes Medical Institute.
Freely distributed under a BSD open source license.
.fi

For additional information on copyright and licensing, see the file
called COPYRIGHT in your Infernal source distribution, or see the Infernal
web page 
().

.SH AUTHOR

.nf
The Eddy/Rivas Laboratory
Janelia Farm Research Campus
19700 Helix Drive
Ashburn VA 20147 USA
http://eddylab.org
.fi