1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464
|
.TH "hmm2build" 1 "Oct 2003" "HMMER 2.3.2" "HMMER Manual"
.SH NAME
.TP
hmm2build - build a profile HMM from an alignment
.SH SYNOPSIS
.B hmm2build
.I [options]
.I hmmfile
.I alignfile
.SH DESCRIPTION
.B hmm2build
reads a multiple sequence alignment file
.I alignfile
, builds a new profile HMM, and saves the HMM in
.I hmmfile.
.PP
.I alignfile
may be in ClustalW, GCG MSF, SELEX, Stockholm, or aligned FASTA
alignment format. The format is automatically detected.
.PP
By default, the model is configured to find one or more
nonoverlapping alignments to the complete model: multiple
global alignments with respect to the model, and local with
respect to the sequence.
This
is analogous to the behavior of the
.B hmmls
program of HMMER 1.
To configure the model for multiple
.I local
alignments
with respect to the model and local with respect to
the sequence,
a la the old program
.B hmmfs,
use the
.B -f
(fragment) option. More rarely, you may want to
configure the model for a single
global alignment (global with respect to both
model and sequence), using the
.B -g
option;
or to configure the model for a single local/local alignment
(a la standard Smith/Waterman, or the old
.B hmmsw
program), use the
.B -s
option.
.SH OPTIONS
.TP
.B -f
Configure the model for finding multiple domains per sequence,
where each domain can be a local (fragmentary) alignment. This
is analogous to the old
.B hmmfs
program of HMMER 1.
.TP
.B -g
Configure the model for finding a single global alignment to
a target sequence, analogous to
the old
.B hmms
program of HMMER 1.
.TP
.B -h
Print brief help; includes version number and summary of
all options, including expert options.
.TP
.BI -n " <s>"
Name this HMM
.I <s>.
.I <s>
can be any string of non-whitespace characters (e.g. one "word").
There is no length limit (at least not one imposed by HMMER;
your shell will complain about command line lengths first).
.TP
.BI -o " <f>"
Re-save the starting alignment to
.I <f>,
in Stockholm format.
The columns which were assigned to match states will be
marked with x's in an #=RF annotation line.
If either the
.B --hand
or
.B --fast
construction options were chosen, the alignment may have
been slightly altered to be compatible with Plan 7 transitions,
so saving the final alignment and comparing to the
starting alignment can let you view these alterations.
See the User's Guide for more information on this arcane
side effect.
.TP
.B -s
Configure the model for finding a single local alignment per
target sequence. This is analogous to the standard Smith/Waterman
algorithm or the
.B hmmsw
program of HMMER 1.
.TP
.B -A
Append this model to an existing
.I hmmfile
rather than creating
.I hmmfile.
Useful for building HMM libraries (like Pfam).
.TP
.B -F
Force overwriting of an existing
.I hmmfile.
Otherwise HMMER will refuse to clobber your existing HMM files,
for safety's sake.
.SH EXPERT OPTIONS
.TP
.B --amino
Force the sequence alignment to be interpreted as amino acid
sequences. Normally HMMER autodetects whether the alignment is
protein or DNA, but sometimes alignments are so small that
autodetection is ambiguous. See
.B --nucleic.
.TP
.BI --archpri " <x>"
Set the "architecture prior" used by MAP architecture construction to
.I <x>,
where
.I <x>
is a probability between 0 and 1. This parameter governs a geometric
prior distribution over model lengths. As
.I <x>
increases, longer models are favored a priori.
As
.I <x>
decreases, it takes more residue conservation in a column to
make a column a "consensus" match column in the model architecture.
The 0.85 default has been chosen empirically as a reasonable setting.
.TP
.B --binary
Write the HMM to
.I hmmfile
in HMMER binary format instead of readable ASCII text.
.TP
.BI --cfile " <f>"
Save the observed emission and transition counts to
.I <f>
after the architecture has been determined (e.g. after residues/gaps
have been assigned to match, delete, and insert states).
This option is used in HMMER development for generating data files
useful for training new Dirichlet priors. The format of
count files is documented in the User's Guide.
.TP
.B --fast
Quickly and heuristically determine the architecture of the model by
assigning all columns will more than a certain fraction of gap
characters to insert states. By default this fraction is 0.5, and it
can be changed using the
.B --gapmax
option.
The default construction algorithm is a maximum a posteriori (MAP)
algorithm, which is slower.
.TP
.BI --gapmax " <x>"
Controls the
.I --fast
model construction algorithm, but if
.I --fast
is not being used, has no effect.
If a column has more than a fraction
.I <x>
of gap symbols in it, it gets assigned to an insert column.
.I <x>
is a frequency from 0 to 1, and by default is set
to 0.5. Higher values of
.I <x>
mean more columns get assigned to consensus, and models get
longer; smaller values of
.I <x>
mean fewer columns get assigned to consensus, and models get
smaller.
.I <x>
.TP
.B --hand
Specify the architecture of the model by hand: the alignment file must
be in SELEX or Stockholm format, and the reference annotation
line (#=RF in SELEX, #=GC RF in Stockholm) is used to specify
the architecture. Any column marked with a non-gap symbol (such
as an 'x', for instance) is assigned as a consensus (match) column in
the model.
.TP
.BI --idlevel " <x>"
Controls both the determination of effective sequence number and
the behavior of the
.I --wblosum
weighting option. The sequence alignment is clustered by percent
identity, and the number of clusters at a cutoff threshold of
.I <x>
is used to determine the effective sequence number.
Higher values of
.I <x>
give more clusters and higher effective sequence
numbers; lower values of
.I <x>
give fewer clusters and lower effective sequence numbers.
.I <x>
is a fraction from 0 to 1, and
by default is set to 0.62 (corresponding to the clustering level used
in constructing the BLOSUM62 substitution matrix).
.TP
.BI --informat " <s>"
Assert that the input
.I seqfile
is in format
.I <s>;
do not run Babelfish format autodection. This increases
the reliability of the program somewhat, because
the Babelfish can make mistakes; particularly
recommended for unattended, high-throughput runs
of HMMER. Valid format strings include FASTA,
GENBANK, EMBL, GCG, PIR, STOCKHOLM, SELEX, MSF,
CLUSTAL, and PHYLIP. See the User's Guide for a complete
list.
.TP
.B --noeff
Turn off the effective sequence number calculation, and use the
true number of sequences instead. This will usually reduce the
sensitivity of the final model (so don't do it without good reason!)
.TP
.B --nucleic
Force the alignment to be interpreted as nucleic acid sequence,
either RNA or DNA. Normally HMMER autodetects whether the alignment is
protein or DNA, but sometimes alignments are so small that
autodetection is ambiguous. See
.B --amino.
.TP
.BI --null " <f>"
Read a null model from
.I <f>.
The default for protein is to use average amino acid frequencies from
Swissprot 34 and p1 = 350/351; for nucleic acid, the default is
to use 0.25 for each base and p1 = 1000/1001. For documentation
of the format of the null model file and further explanation
of how the null model is used, see the User's Guide.
.TP
.BI --pam " <f>"
Apply a heuristic PAM- (substitution matrix-) based prior on match
emission probabilities instead of
the default mixture Dirichlet. The substitution matrix is read
from
.I <f>.
See
.B --pamwgt.
The default Dirichlet state transition prior and insert emission prior
are unaffected. Therefore in principle you could combine
.B --prior
with
.B --pam
but this isn't recommended, as it hasn't been tested. (
.B --pam
itself hasn't been tested much!)
.TP
.BI --pamwgt " <x>"
Controls the weight on a PAM-based prior. Only has effect if
.B --pam
option is also in use.
.I <x>
is a positive real number, 20.0 by default.
.I <x>
is the number of "pseudocounts" contriubuted by the heuristic
prior. Very high values of
.I <x>
can force a scoring system that is entirely driven by the
substitution matrix, making
HMMER somewhat approximate Gribskov profiles.
.TP
.BI --pbswitch " <n>"
For alignments with a very large number of sequences,
the GSC, BLOSUM, and Voronoi weighting schemes are slow;
they're O(N^2) for N sequences. Henikoff position-based
weights (PB weights) are more efficient. At or above a certain
threshold sequence number
.I <n>
.B hmm2build
will switch from GSC, BLOSUM, or Voronoi weights to
PB weights. To disable this switching behavior (at the cost
of compute time, set
.I <n>
to be something larger than the number of sequences in
your alignment.
.I <n>
is a positive integer; the default is 1000.
.TP
.BI --prior " <f>"
Read a Dirichlet prior from
.I <f>,
replacing the default mixture Dirichlet.
The format of prior files is documented in the User's Guide,
and an example is given in the Demos directory of the HMMER
distribution.
.TP
.BI --swentry " <x>"
Controls the total probability that is distributed to local entries
into the model, versus starting at the beginning of the model
as in a global alignment.
.I <x>
is a probability from 0 to 1, and by default is set to 0.5.
Higher values of
.I <x>
mean that hits that are fragments on their left (N or 5'-terminal) side will be
penalized less, but complete global alignments will be penalized more.
Lower values of
.I <x>
mean that fragments on the left will be penalized more, and
global alignments on this side will be favored.
This option only affects the configurations that allow local
alignments,
e.g.
.B -s
and
.B -f;
unless one of these options is also activated, this option has no effect.
You have independent control over local/global alignment behavior for
the N/C (5'/3') termini of your target sequences using
.B --swentry
and
.B --swexit.
.TP
.BI --swexit " <x>"
Controls the total probability that is distributed to local exits
from the model, versus ending an alignment at the end of the model
as in a global alignment.
.I <x>
is a probability from 0 to 1, and by default is set to 0.5.
Higher values of
.I <x>
mean that hits that are fragments on their right (C or 3'-terminal) side will be
penalized less, but complete global alignments will be penalized more.
Lower values of
.I <x>
mean that fragments on the right will be penalized more, and
global alignments on this side will be favored.
This option only affects the configurations that allow local
alignments,
e.g.
.B -s
and
.B -f;
unless one of these options is also activated, this option has no effect.
You have independent control over local/global alignment behavior for
the N/C (5'/3') termini of your target sequences using
.B --swentry
and
.B --swexit.
.TP
.B --verbose
Print more possibly useful stuff, such as the individual scores for
each sequence in the alignment.
.TP
.B --wblosum
Use the BLOSUM filtering algorithm to weight the sequences,
instead of the default.
Cluster the sequences at a given percentage identity
(see
.B --idlevel);
assign each cluster a total weight of 1.0, distributed equally
amongst the members of that cluster.
.TP
.B --wgsc
Use the Gerstein/Sonnhammer/Chothia ad hoc sequence weighting
algorithm. This is already the default, so this option has no effect
(unless it follows another option in the --w family, in which case it
overrides it).
.TP
.B --wme
Use the Krogh/Mitchison maximum entropy algorithm to "weight"
the sequences. This supersedes the Eddy/Mitchison/Durbin
maximum discrimination algorithm, which gives almost
identical weights but is less robust. ME weighting seems
to give a marginal increase in sensitivity
over the default GSC weights, but takes a fair amount of time.
.TP
.B --wnone
Turn off all sequence weighting.
.TP
.B --wpb
Use the Henikoff position-based weighting scheme.
.TP
.B --wvoronoi
Use the Sibbald/Argos Voronoi sequence weighting algorithm
in place of the default GSC weighting.
.SH SEE ALSO
Master man page, with full list of and guide to the individual man
pages: see
.B hmmer2(1).
.PP
For complete documentation, see the user guide (ftp://selab.janelia.org/pub/software/hmmer/2.3.2/Userguide.pdf); or see the HMMER web page,
http://hmmer.janelia.org/.
.SH COPYRIGHT
.nf
Copyright (C) 1992-2003 HHMI/Washington University School of Medicine.
Freely distributed under the GNU General Public License (GPL).
.fi
See the file COPYING in your distribution for details on redistribution
conditions.
.SH AUTHOR
.nf
Sean Eddy
HHMI/Dept. of Genetics
Washington Univ. School of Medicine
4566 Scott Ave.
St Louis, MO 63110 USA
http://www.genetics.wustl.edu/eddy/
.fi
|