1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 939 940 941 942 943 944 945 946 947 948 949 950 951 952 953 954 955 956 957 958 959 960 961 962 963 964 965 966 967 968 969 970 971 972
|
\chapter{Input files and formats}
\label{chapter:formats}
\setcounter{footnote}{0}
\section{Reading from files, compressed files, and pipes}
Generally, HMMER programs read their sequence and/or profile input
from files. Unix power users often find it convenient to string an
incantation of commands together with pipes.\sidenote{Indeed, such
wizardly incantations are a point of pride.} For example, you might
extract a subset of query sequences from a larger file using a
one-liner combination of scripting commands (python, perl, awk,
whatever). To facilitate the use of HMMER programs in such
incantations, you can almost always use an argument of '-' (dash) in
place of a filename, and the program will take its input from a
standard input pipe instead of opening a file.
For example, the following three commands are equivalent, and give
essentially identical output:
\vspace{1ex}
\user{\% hmmsearch globins4.hmm uniprot\_sprot.fasta} \\
\user{\% cat globins4.hmm | hmmsearch - uniprot\_sprot.fasta} \\
\user{\% cat uniprot\_sprot.fasta | hmmsearch globins4.hmm - } \\
\vspace{1ex}
Most Easel ``miniapp'' programs share the same pipe-reading ability.
Because the programs for profile HMM fetching (\mono{hmmfetch}) and
sequence fetching (\mono{esl-sfetch}) can fetch any number of profiles
or sequences by names/accessions given in a list, \emph{and} these
programs can also read these lists from a stdin pipe, you can craft
incantations that generate subsets of queries or targets on the
fly. For example:
\vspace{1ex}
\begin{fullwidth}
\user{\indent \% esl-sfetch --index uniprot\_sprot.fasta} \\
\user{\% cat mytargs.list | esl-sfetch -f uniprot\_sprot.fasta - | hmmsearch globins4.hmm -} \\
\end{fullwidth}
\vspace{1ex}
This takes a list of sequence names/accessions in
\mono{mytargets.list}, fetches them one by one from UniProt (note that
we index the UniProt file first, for fast retrieval; and note that
\mono{esl-sfetch} is reading its \mono{<namefile>} list of
names/accessions through a pipe using the '-' argument), and pipes
them to an \mono{hmmsearch}. It should be obvious from this that we
can replace the \mono{cat mytargets.list} with \emph{any} incantation
that generates a list of sequence names/accessions (including SQL
database queries).
Ditto for piping subsets of profiles. Supposing you have a copy of Pfam in Pfam-A.hmm:
\vspace{1ex}
\begin{fullwidth}
\user{\indent \% hmmfetch --index Pfam-A.hmm} \\
\user{\% cat myqueries.list | hmmfetch -f Pfam.hmm - | hmmsearch - uniprot\_sprot.fasta}\\
\end{fullwidth}
\vspace{1ex}
This takes a list of query profile names/accessions in
\mono{myqueries.list}, fetches them one by one from Pfam, and does an
hmmsearch with each of them against UniProt. As above, the \mono{cat
myqueries.list} part can be replaced by any suitable incantation
that generates a list of profile names/accessions.
There are three kinds of cases where using '-' is restricted or
doesn't work. A fairly obvious restriction is that you can only use
one '-' per command; you can't do a \mono{hmmsearch - -} that tries to
read both profile queries and sequence targets through the same stdin
pipe. Second, another case is when an input file must be obligately
associated with additional, separately generated auxiliary files, so
reading data from a single stream using '-' doesn't work because the
auxiliary files aren't present (in this case, using '-' will be
prohibited by the program). An example is \mono{hmmscan}, which needs
its \mono{<hmmfile>} argument to be associated with four auxiliary
files named \mono{<hmmfile>.h3\{mifp\}} that \mono{hmmpress} creates,
so \mono{hmmscan} does not permit a '-' for its \mono{<hmmfile>}
argument. Finally, when a command would require multiple passes over
an input file, the command will generally abort after the first pass
if you are trying to read that file through a standard input pipe
(pipes are nonrewindable in general; a few HMMER or Easel programs
will buffer input streams to make multiple passes possible, but this
is not usually the case). An example would be trying to search a file
containing multiple profile queries against a streamed target
database:
\vspace{1ex}
\begin{fullwidth}
\user{\indent \% cat myqueries.list | hmmfetch -f Pfam.hmm > many.hmms}\\
\user{\% cat mytargets.list | esl-sfetch -f uniprot\_sprot.fasta - | hmmsearch many.hmms -}\\
\end{fullwidth}
\vspace{1ex}
This will fail. Unfortunately the above business about how it will
``generally abort after the first pass'' means it fails weirdly. The
first query profile search will succeed, and its output will appear;
then an error message will be generated when \mono{hmmsearch} sees the
\emph{second} profile query and oops, suddenly realizes it is unable
to rewind the target sequence database stream. This is inherent in how
it reads the profile HMM query file sequentially as a stream (which is
what's allowing it to read input from stdin pipes in the first place),
one model at a time: it doesn't see there's more than one query model
in the file until it gets to the second model.
This case isn't too restricting because the same end goal can be
achieved by reordering the commands. In cases where you want to do
multiple queries against multiple targets, you always want to be
reading the \emph{queries} from a stdin pipe, not the targets:
\vspace{1ex}
\user{\% cat mytargets.list | esl-sfetch -f uniprot\_sprot.fasta > mytarget.seqs}\\
\user{\% cat myqueries.list | hmmfetch -f Pfam.hmm - | hmmsearch - mytarget.seqs}\\
\vspace{1ex}
So in this multiple queries/multiple targets case of using stdin
pipes, you just have to know, for any given program, which file it
considers to be queries and which it considers to be targets. (That
is, the logic in searching many queries against many targets is ``For
each query: search the target database; then rewind the target
database to the beginning.'') For \mono{hmmsearch}, the profiles are
queries and sequences are targets. For \mono{hmmscan}, the reverse.
In general, HMMER and Easel programs document in their man page
whether (and which) command line arguments can be replaced by '-'.
You can always check by trial and error, too. The worst that can
happen is a ``Failed to open file -'' error message, if the program
can't read from pipes.
\subsection{.gz compressed files}
In general, HMMER programs and Easel miniapps can also read \mono{.gz}
compressed files; they will uncompress them on the fly. You need to
have \mono{gunzip} installed on your system for this to work.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\newpage
\section{HMMER profile HMM files}
\label{section:savefiles}
A HMMER profile file looks like this, with ...'s marking elisions made
for clarity and space:\sidenote{This is the \mono{globins4.hmm}
profile from the tutorial.}
\xsreoutput{inclusions/hmmbuild-globins.out2}
A profile file consists of one or more profiles. Each profile starts
with a format version identifier (here, \mono{HMMER3/\HMMERfmtversion{}}) and ends with
\mono{//} on a line by itself. The format version identifier allows
backward compatibility as the HMMER software evolves: it tells the
parser this file is from HMMER3's save file format version
\HMMERfmtversion{}.\footnote{HMMER 3.0 used 3/b format. HMMER 3.1 and 3.2 use 3/f
format. Some alpha test versions of 3.0 used 3/a format. Internal
development versions of 3.1 used 3/c, 3/d, and 3/e formats.} The
closing \mono{//} allows multiple profiles to be concatenated.
The format is divided into two regions. The first region contains
textual information and miscalleneous parameters in a roughly
tag-value scheme. This section ends with a line beginning with the
keyword \mono{HMM}. The second region is a tabular, whitespace-limited
format for the main model parameters.
All probability parameters are all stored as negative natural log
probabilities with five digits of precision to the right of the
decimal point, rounded. For example, a probability of $0.25$ is stored
as $-\log 0.25 = 1.38629$. The special case of a zero probability is
stored as '*'.
Spacing is arranged for human readability, but the parser only cares
that fields are separated by at least one space character.
A more detailed description of the format follows.
\subsection{header section}
The header section is parsed line by line in a tag/value format. Each
line type is either \textbf{mandatory} or \textbf{optional} as
indicated.
\begin{sreitems}{\monob{header}}
\item [\monob{HMMER3/\HMMERfmtversion{}}] Unique identifier for the save file format
version; the \mono{/\HMMERfmtversion{}} means that this is HMMER3 profile file format
version \HMMERfmtversion{}. When HMMER3 changes its save file format, the revision
code advances. This way, parsers can be backwards
compatible. The remainder of the line after the \mono{HMMER3/\HMMERfmtversion{}} tag
is free text that is ignored by the parser. HMMER currently writes
its version number and release date in brackets here,
e.g. \mono{\HMMERsavestamp{}} in this
example. \textbf{Mandatory.}
\item [\monob{NAME <s>}] Model name; \mono{<s>} is a single word
containing no spaces or tabs. The name is normally picked up from the
\mono{\#=GF ID} line from a Stockholm alignment file. If this is not
present, the name is created from the name of the alignment file by
removing any file type suffix. For example, an otherwise nameless HMM
built from the alignment file \mono{rrm.slx} would be named
\mono{rrm}. \textbf{Mandatory.}
\item [\monob{ACC <s>}] Accession number; \mono{<s>} is a one-word
accession number. This is picked up from the \mono{\#=GF AC} line in a
Stockholm format alignment. \textbf{Optional.}
\item [\monob{DESC <s>}] Description line; \mono{<s>} is a one-line
free text description. This is picked up from the \mono{\#=GF DE} line
in a Stockholm alignment file. \textbf{Optional.}
\item [\monob{LENG <d>}] Model length; \mono{<d>}, a positive nonzero
integer, is the number of match states in the model.
\textbf{Mandatory.}
\item [\monob{MAXL <d>}] Max instance length; \mono{<d>}, a positive
nonzero integer, is the upper bound on the length at which and instance
of the model is expected to be found. Used only by nhmmer and nhmmscan.
\textbf{Optional.}
\item [\monob{ALPH <s>}] Symbol alphabet type. For biosequence
analysis models, \mono{<s>} is \mono{amino}, \mono{DNA}, or \mono{RNA}
(case insensitive). There are also other accepted alphabets for
purposes beyond biosequence analysis, including \mono{coins},
\mono{dice}, and \mono{custom}. This determines the symbol alphabet
and the size of the symbol emission probability distributions. If
\mono{amino}, the alphabet size $K$ is set to 20 and the symbol
alphabet to ``ACDEFGHIKLMNPQRSTVWY'' (alphabetic order); if
\mono{DNA}, the alphabet size $K$ is set to 4 and the symbol alphabet
to ``ACGT''; if \mono{RNA}, the alphabet size $K$ is set to 4 and the
symbol alphabet to ``ACGU''. \textbf{Mandatory.}
\item [\monob{RF <s>}] Reference annotation flag; \mono{<s>} is
either \mono{no} or \mono{yes} (case insensitive). If \mono{yes}, the
reference annotation character field for each match state in the main
model (see below) is valid; if \mono{no}, these characters are
ignored. Reference column annotation is picked up from a Stockholm
alignment file's \mono{\#=GC RF} line. It is propagated to alignment
outputs, and also may optionally be used to define consensus match
columns in profile HMM construction. \textbf{Optional}; assumed to be
no if not present.
\item [\monob{MM <s>}] Model masked flag; \mono{<s>} is
either \mono{no} or \mono{yes} (case insensitive). If \mono{yes}, the
model mask annotation character field for each match state in the main
model (see below) is valid; if \mono{no}, these characters are
ignored. Indicates that the profile model was created such that
emission probabilities at masked positions are set to match the
background frequency, rather than being set based on observed frequencies
in the alignment. Position-specific insertion and deletion rates are not
altered, even in masked regions. \textbf{Optional}; assumed to be
no if not present.
\item [\monob{CONS <s>}] Consensus residue annotation flag;
\mono{<s>} is either \mono{no} or \mono{yes} (case insensitive). If
\mono{yes}, the consensus residue field for each match state in the
main model (see below) is valid. If \mono{no}, these characters are
ignored. Consensus residue annotation is determined when models are
built. For models of single sequences, the consensus is the same as
the query sequence. For models of multiple alignments, the consensus
is the maximum likelihood residue at each position. Upper case
indicates that the model's emission probability for the consensus
residue is $\geq$ an arbitrary threshold (0.5 for protein models,
0.9 for DNA/RNA models).
\item [\monob{CS <s>}] Consensus structure annotation flag;
\mono{<s>} is either \mono{no} or \mono{yes} (case insensitive). If
\mono{yes}, the consensus structure character field for each match
state in the main model (see below) is valid; if \mono{no} these
characters are ignored. Consensus structure annotation is picked up
from a Stockholm file's \mono{\#=GC SS\_cons} line, and propagated to
alignment displays. \textbf{Optional}; assumed to be no if not
present.
\item [\monob{MAP <s>}] Map annotation flag; \mono{<s>} is either
\mono{no} or \mono{yes} (case insensitive). If set to \mono{yes}, the
map annotation field in the main model (see below) is valid; if
\mono{no}, that field will be ignored. The HMM/alignment map
annotates each match state with the index of the alignment column from
which it came. It can be used for quickly mapping any subsequent
HMM alignment back to the original multiple alignment, via the model.
\textbf{Optional}; assumed to be no if not present.
\item [\monob{DATE <s>}] Date the model was constructed; \mono{<s>}
is a free text date string. This field is only used for logging
purposes.\footnote{HMMER does not use dates for any purpose other than
human-readable annotation, so it is no more prone than you are to Y2K,
Y2038, or any other date-related eschatology.} \textbf{Optional.}
\item [\monob{COM [<n>] <s>}] Command line log; \mono{<n>} counts
command line numbers, and \mono{<s>} is a one-line command. There may
be more than one \mono{COM} line per save file, each numbered starting
from $n=1$. These lines record every HMMER command that modified the
save file. This helps us reproducibly and automatically log how Pfam
models have been constructed, for example. \textbf{Optional.}
\item [\monob{NSEQ <d>}] Sequence number; \mono{<d>} is a nonzero
positive integer, the number of sequences that the HMM was trained on.
This field is only used for logging purposes.
\textbf{Optional.}
\item [\monob{EFFN <f>}] Effective sequence number; \mono{<f>} is a
nonzero positive real, the effective total number of sequences
determined by \mono{hmmbuild} during sequence weighting, for combining
observed counts with Dirichlet prior information in parameterizing the
model. This field is only used for logging purposes.
\textbf{Optional.}
\item [\monob{CKSUM <d>}] Training alignment checksum; \mono{<d>} is
a nonnegative unsigned 32-bit integer. This number is calculated
from the training sequence data, and used in conjunction with the
alignment map information to verify that a given alignment is indeed
the alignment that the map is for. \textbf{Optional.}
\item [\monob{GA <f> <f>}] Pfam gathering thresholds GA1 and GA2.
See Pfam documentation of GA lines. \textbf{Optional.}
\item [\monob{TC <f> <f>}] Pfam trusted cutoffs TC1 and TC2. See
Pfam documentation of TC lines. \textbf{Optional.}
\item [\monob{NC <f> <f>}] Pfam noise cutoffs NC1 and NC2. See Pfam
documentation of NC lines. \textbf{Optional.}
\item [\monob{STATS <s1> <s2> <f1> <f2>}] Statistical parameters
needed for E-value calculations. \mono{<s1>} is the model's
alignment mode configuration: currently only \mono{LOCAL} is
recognized. \mono{<s2>} is the name of the score distribution:
currently \mono{MSV}, \mono{VITERBI}, and \mono{FORWARD} are
recognized. \mono{<f1>} and \mono{<f2>} are two real-valued
parameters controlling location and slope of each distribution,
respectively; $\mu$ and $\lambda$ for Gumbel distributions for MSV
and Viterbi scores, and $\tau$ and $\lambda$ for exponential tails
for Forward scores. $\lambda$ values must be positive. All three
lines or none of them must be present: when all three are present,
the model is considered to be calibrated for E-value
statistics. \textbf{Optional.}
\item [\monob{HMM }] Flags the start of the main model
section. Solely for human readability of the tabular model data, the
symbol alphabet is shown on the \mono{HMM} line, aligned to the fields
of the match and insert symbol emission distributions in the main
model below. The next line is also for human readability, providing
column headers for the state transition probability fields in the main
model section that follows. Though unparsed after the \mono{HMM} tag,
the presence of two header lines is \textbf{mandatory:} the parser
always skips the line after the \mono{HMM} tag line.
\item [\monob{COMPO <f>*K}] The first line in the main model section
may be an optional line starting with \monob{COMPO}: these are the
model's overall average match state emission probabilities, which are
used as a background residue composition in the ``filter null''
model. The $K$ fields on this line are log probabilities for each
residue in the appropriate biosequence alphabet's
order. \textbf{Optional.}
\end{sreitems}
\subsection{main model section}
All the remaining fields are \textbf{mandatory}.
The first two lines in the main model section are
atypical.\footnote{That is, the first two lines after the optional
\mono{COMPO} line. Don't be confused by the presence of an optional \mono{COMPO}
line here. The \mono{COMPO} line is placed in the model section, below the
residue column headers, because it's an array of numbers much like
residue scores, but it's not really part of the model.} They
contain information for the core model's BEGIN node. This is stored as
model node 0, and match state 0 is treated as the BEGIN state. The
begin state is mute, so there are no match emission probabilities. The
first line is the insert 0 emissions. The second line contains the
transitions from the begin state and insert state 0. These seven
numbers are: $B \rightarrow M_1$, $B \rightarrow I_0$, $B \rightarrow
D_1$; $I_0 \rightarrow M_1$, $I_0 \rightarrow I_0$; then a 0.0 and a
'*', because by convention, nonexistent transitions from the
nonexistent delete state 0 are set to $\log 1 = 0$ and $\log 0 =
-\infty = $ `*'.
The remainder of the model has three lines per node, for $M$ nodes
(where $M$ is the number of match states, as given by the \mono{LENG}
line). These three lines are ($K$ is the alphabet size in residues):
\begin{sreitems}{\textbf{State transition line}}
\item [\textbf{Match emission line}] The first field is the node
number ($1 \ldots M$). The parser verifies this number as a
consistency check (it expects the nodes to come in order). The next
$K$ numbers for match emissions, one per symbol, in alphabetic order.
The next field is the \mono{MAP} annotation for this node. If
\mono{MAP} was \mono{yes} in the header, then this is an integer,
representing the alignment column index for this match state
(1..alen); otherwise, this field is `-'.
The next field is the \mono{CONS} consensus residue for this node. If
\mono{CONS} was \mono{yes} in the header, then this is a single
character, representing the consensus residue annotation for this
match state; otherwise, this field is `-'.
The next field is the \mono{RF} annotation for this node. If
\mono{RF} was \mono{yes} in the header, then this is a single
character, representing the reference annotation for this match state;
otherwise, this field is `-'.
The next field is the \mono{MM} mask value for this node. If
\mono{MM} was \mono{yes} in the header, then this is a single 'm'
character, indicating that the position was identified as a masked
position during model construction; otherwise, this field is `-'.
The next field is the \mono{CS} annotation for this node. If
\mono{CS} was \mono{yes}, then this is a single character,
representing the consensus structure at this match state; otherwise
this field is `-'.
\item [\textbf{Insert emission line}] The $K$ fields on this line are
the insert emission scores, one per symbol, in alphabetic order.
\item [\textbf{State transition line}] The seven fields on this line
are the transitions for node $k$, in the order shown by the transition
header line: $M_k \rightarrow M_{k+1}, I_{k}, D_{k+1}$; $ I_k
\rightarrow M_{k+1}, I_k$; $D_{k} \rightarrow M_{k+1}, D_{k+1}$.
For transitions from the final node $M$, match state $M+1$ is
interpreted as the END state $E$, and there is no delete state $M+1$;
therefore the final $M_k \rightarrow D_{k+1}$ and $D_k \rightarrow
D_{k+1}$ transitions are always * (zero probability), and the final
$D_k \rightarrow M_{k+1}$ transition is always 0.0 (probability 1.0).
\end{sreitems}
Finally, the last line of the format is the ``//'' record separator.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\newpage
\section{Stockholm, the recommended multiple sequence alignment format}
\label{section:stockholm}
The Pfam and Rfam Consortiums have developed a multiple sequence
alignment format called ``Stockholm format'' that allows rich and
extensible annotation.
Most popular multiple alignment file formats can be changed into a
minimal Stockholm format file just by adding a Stockholm header line
and a trailing \mono{//} terminator:
\xsreoutput{inclusions/globins4.sto}
The first line in the file must be \mono{\# STOCKHOLM 1.x}, where
\mono{x} is a minor version number for the format specification (and
which currently has no effect on my parsers). This line allows a
parser to instantly identify the file format.
In the alignment, each line contains a name, followed by the aligned
sequence. A dash, period, underscore, or tilde (but not whitespace)
denotes a gap. If the alignment is too long to fit on one line, the
alignment may be split into multiple blocks, with blocks separated by
blank lines. The number of sequences, their order, and their names
must be the same in every block. Within a given block, each
(sub)sequence (and any associated \mono{\#=GR} and \mono{\#=GC} markup,
see below) is of equal length, called the \emph{block length}. Block
lengths may differ from block to block. The block length must be at
least one residue, and there is no maximum.
Other blank lines are ignored. You can add comments anywhere to the
file (even within a block) on lines starting with a \mono{\#}.
All other annotation is added using a tag/value comment style. The
tag/value format is inherently extensible, and readily made
backwards-compatible; unrecognized tags will simply be ignored. Extra
annotation includes consensus and individual RNA or protein secondary
structure, sequence weights, a reference coordinate system for the
columns, and database source information including name, accession
number, and coordinates (for subsequences extracted from a longer
source sequence) See below for details.
\subsection{syntax of Stockholm markup}
There are four types of Stockholm markup annotation, for per-file,
per-sequence, per-column, and per-residue annotation:
\begin{description}%{\monob{\#=GR <seqname> <tag> <..s..>}}
\item [\monob{\#=GF <tag> <s>}]
Per-file annotation. \mono{<s>} is a free format text line
of annotation type \mono{<tag>}. For example, \mono{\#=GF DATE
April 1, 2000}. Can occur anywhere in the file, but usually
all the \mono{\#=GF} markups occur in a header.
\item [\monob{\#=GS <seqname> <tag> <s>}]
Per-sequence annotation. \mono{<s>} is a free format text line
of annotation type \mono{tag} associated with the sequence
named \mono{<seqname>}. For example, \mono{\#=GS seq1
SPECIES\_SOURCE Caenorhabditis elegans}. Can occur anywhere
in the file, but in single-block formats (e.g. the Pfam
distribution) will typically follow on the line after the
sequence itself, and in multi-block formats (e.g. HMMER
output), will typically occur in the header preceding the
alignment but following the \mono{\#=GF} annotation.
\item [\monob{\#=GC <tag> <..s..>}]
Per-column annotation. \mono{<..s..>} is an aligned text line
of annotation type \mono{<tag>}.
\mono{\#=GC} lines are
associated with a sequence alignment block; \mono{<..s..>}
is aligned to the residues in the alignment block, and has
the same length as the rest of the block.
Typically \mono{\#=GC} lines are placed at the end of each block.
\item [\monob{\#=GR <seqname> <tag> <..s..>}]
Per-residue annotation. \mono{<..s..>} is an aligned text line
of annotation type \mono{<tag>}, associated with the sequence
named \mono{<seqname>}.
\mono{\#=GR} lines are
associated with one sequence in a sequence alignment block;
\mono{<..s..>}
is aligned to the residues in that sequence, and has
the same length as the rest of the block.
Typically
\mono{\#=GR} lines are placed immediately following the
aligned sequence they annotate.
\end{description}
\subsection{semantics of Stockholm markup}
Any Stockholm parser will accept syntactically correct files, but is
not obligated to do anything with the markup lines. It is up to the
application whether it will attempt to interpret the meaning (the
semantics) of the markup in a useful way. At the two extremes are the
Belvu alignment viewer and the HMMER profile hidden Markov model
software package.
Belvu simply reads Stockholm markup and displays it, without trying to
interpret it at all. The tag types (\mono{\#=GF}, etc.) are sufficient
to tell Belvu how to display the markup: whether it is attached to the
whole file, sequences, columns, or residues.
HMMER uses Stockholm markup to pick up a variety of information from
the Pfam multiple alignment database. The Pfam consortium therefore
agrees on additional syntax for certain tag types, so HMMER can parse
some markups for useful information. This additional syntax is imposed
by Pfam, HMMER, and other software of mine, not by Stockholm format
per se. You can think of Stockholm as akin to XML, and what my
software reads as akin to an XML DTD, if you're into that sort of
structured data format lingo.
The Stockholm markup tags that are parsed semantically by my software
are as follows:
\subsection{recognized \#=GF annotations}
\begin{sreitems}{\monob{TC <f> <f>}}
\item [\monob{ID <s>}]
Identifier. \monob{<s>} is a name for the alignment;
e.g. ``rrm''. One word. Unique in file.
\item [\monob{AC <s>}]
Accession. \monob{<s>} is a unique accession number for the
alignment; e.g.
``PF00001''. Used by the Pfam database, for instance.
Often a alphabetical prefix indicating the database
(e.g. ``PF'') followed by a unique numerical accession.
One word. Unique in file.
\item [\monob{DE <s>}]
Description. \monob{<s>} is a free format line giving
a description of the alignment; e.g.
``RNA recognition motif proteins''. One line. Unique in file.
\item [\monob{AU <s>}]
Author. \monob{<s>} is a free format line listing the
authors responsible for an alignment; e.g.
``Bateman A''. One line. Unique in file.
\item [\monob{GA <f> <f>}]
Gathering thresholds. Two real numbers giving HMMER bit score
per-sequence and per-domain cutoffs used in gathering the
members of Pfam full alignments.
\item [\monob{NC <f> <f>}]
Noise cutoffs. Two real numbers giving HMMER bit score
per-sequence and per-domain cutoffs, set according to the
highest scores seen for unrelated sequences when gathering
members of Pfam full alignments.
\item [\monob{TC <f> <f>}]
Trusted cutoffs. Two real numbers giving HMMER bit score
per-sequence and per-domain cutoffs, set according to the
lowest scores seen for true homologous sequences that
were above the GA gathering thresholds, when gathering
members of Pfam full alignments.
\end{sreitems}
\subsection{recognized \#=GS annotations}
\begin{sreitems}{\monob{WT <f>}}
\item [\monob{WT <f>}]
Weight. \monob{<f>} is a positive real number giving the
relative weight for a sequence, usually used to compensate
for biased representation by downweighting similar sequences.
Usually the weights average 1.0 (e.g. the weights sum to
the number of sequences in the alignment) but this is not
required. Either every sequence must have a weight annotated,
or none of them can.
\item [\monob{AC <s>}]
Accession. \monob{<s>} is a database accession number for
this sequence. (Compare the \mono{\#=GF AC} markup, which gives
an accession for the whole alignment.) One word.
\item [\monob{DE <s>}]
Description. \monob{<s>} is one line giving a description for
this sequence. (Compare the \mono{\#=GF DE} markup, which gives
a description for the whole alignment.)
\end{sreitems}
\subsection{recognized \#=GC annotations}
\begin{sreitems}{\monob{SA\_cons}}
\item [\monob{RF}]
Reference line. Any character is accepted as a markup for a
column. The intent is to allow labeling the columns with some
sort of mark.
\item [\monob{MM}]
Model mask line. An 'm' indicates that the column lies within a
masked range, so that \mono{hmmbuild} should produce emissions matching
the background for a match state corresponding to that alignment column.
Otherwise, a '.' is used.
\item [\monob{SS\_cons}]
Secondary structure consensus. For protein alignments,
DSSP codes or gaps are accepted as markup: [HGIEBTSCX.-\_], where
H is alpha helix, G is 3/10-helix, I is p-helix, E is extended
strand, B is a residue in an isolated b-bridge, T is a turn,
S is a bend, C is a random coil or loop, and X is unknown
(for instance, a residue that was not resolved in a crystal
structure).
\item [\monob{SA\_cons}]
Surface accessibility consensus. 0-9, gap symbols, or X are
accepted as markup. 0 means $<$10\% accessible residue surface
area, 1 means $<$20\%, 9 means $<$100\%, etc. X means unknown
structure.
\end{sreitems}
\subsection{recognized \#=GR annotations}
\begin{sreitems}{\monob{SA}}
\item [\monob{SS}]
Secondary structure consensus. See \mono{\#=GC SS\_cons} above.
\item [\monob{SA}]
Surface accessibility consensus. See \mono{\#=GC SA\_cons} above.
\item [\monob{PP}] Posterior probability for an aligned residue. This
represents the probability that this residue is assigned to the HMM
state corresponding to this alignment column, as opposed to some
other state. (Note that a residue can be confidently
\emph{unaligned}: a residue in an insert state or flanking N or C
state may have high posterior probability.) The posterior
probability is encoded as 11 possible characters \mono{0-9*+}: $0.0
\leq p < 0.05$ is coded as 0, $0.05 \leq p < 0.15$ is coded as 1,
(... and so on ...), $0.85 \leq p < 0.95$ is coded as 9, and $0.95
\leq p \leq 1.0$ is coded as '*'. Gap characters appear in the PP
line where no residue has been assigned.
\end{sreitems}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\newpage
\section{A2M multiple alignment format}
% Adapted from Easel documentation's format_a2m.tex by:
% - including format_a2m.tex
% - add the section header below
% - revise first pp to be about HMMER.
HMMER's Easel library routines are capable of writing alignments in UC
Santa Cruz ``A2M'' (alignment to model) format, the native input
format for the UCSC SAM profile HMM software package.
To select A2M format, use the format code \mono{a2m}: for example,
to reformat a Stockholm alignment to A2M:
\user{esl-reformat a2m myali.sto}
Easel currently does not read A2M format, and it currently only writes
in what UCSC calls ``dotless'' A2M format.
The most official documentation for A2M format appears to be at
\url{http://compbio.soe.ucsc.edu/a2m-desc.html}. You may refer to that
document if anything in the brief description below is unclear.
\subsection{An example A2M file}
This alignment:
\begin{sreoutput}
seq1 ACDEF...GHIKLMNPQTVWY
seq2 ACDEF...GHIKLMNPQTVWY
seq3 ---EFmnrGHIKLMNPQT---
\end{sreoutput}
\noindent
is encoded in A2M format as:
\begin{sreoutput}
>seq1 Sequence 1 description
ACDEFGHIKLMNPQTVWY
>seq2 Sequence 2 description
ACDEFGHIKLMNPQTVWY
>seq3 Sequence 3 description
---EFmnrGHIKLMNPQT---
\end{sreoutput}
A2M format looks a lot like aligned FASTA format. A crucial difference
is that the aligned sequences in a ``dotless'' A2M file do not
necessarily all have the same number of characters. The format
distinguishes between ``consensus columns'' (where residues are in
upper case and gaps are a dash, `-') and ``insert columns'' (where
residues are in lower case and gaps are dots, `.', that aren't
explicitly shown in the format -- hence ``dotless'' A2M). The position
and number of gaps in insert columns (dots) is implicit in this
representation. An advantage of this format is its compactness.
This representation only works if all insertions relative to consensus
are considered to be unaligned characters. That is how insertions are
handled by profile HMM implementations like SAM and HMMER, and profile
SCFG implementations like Infernal.
Thus every sequence must have the same number of consensus columns
(upper case letters plus `-' characters), and may have additional lower
case letters for insertions.
\subsection{Legal characters}
A2M (and SAM) do not support some special characters such as `*'
(not-a-residue) or `\mono{\~}' (missing data). Easel outputs these
characters as gaps: either `-' in a consensus column, or nothing in an
insert column.
The SAM software parses only a subset of legal ambiguity codes for
amino acids and nucleotides. For amino acids, it only reads \{BXZ\} in
addition to the 20 standard one-letter codes. For nucleotides, it only
reads \{NRY\} in addition to \{ACGTU\}. With one crucial exception, it
treats all other letters as X or N.
The crucial exception is `O'. SAM reads an `O' as the position of a
``free insertion module'' (FIM), a concept specific to SAM-style
profile HMMs. This has no impact on nucleic acid sequences, where `O'
is not a legal character. But in amino acid sequences, `O' means
pyrrolysine, one of the unusual genetically-encoded amino acids. This
means that A2M format alignments must not contain pyrrolysine
residues, lest they be read as FIMs. For this reason, Easel converts
`O' residues to `X' when it writes an amino acid alignment in A2M
format.
\subsection{Determining consensus columns}
Writing A2M format requires knowing which alignment columns are
supposed to be considered consensus and which are considered
inserts. If the alignment was produced by HMMER or Infernal, then the
alignment has so-called ``reference annotation'' (what appears as a
\mono{\#=GC RF} line in Stockholm format) marking the consensus
columns.
Often, an alignment has no reference annotation; for example, if it
has been read from an alignment format that has no reference
annotation line (only Stockholm and SELEX formats support reference
annotation). In this case, Easel internally generates a ``reasonable''
guess at consensus columns, using essentially the same procedure that
HMMER's \mono{hmmbuild} program uses by default: sequence fragments
(sequences $<$50\% of the mean sequence length in the alignment
overall) are ignored, and for the remaining sequences, any column
containing $\geq$ 50\% residues is considered to be a consensus
column.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\newpage
\section{hmmpgmd sequence database format}
The hmmpgmd sequence database format closely resembles the
FASTA format, with slight modification to support use within HMMER's
\mono{hmmpgmd} daemon.
The \mono{hmmpgmd} program enables search of one or more sequence
databases (e.g. NR, SwissProt, UniProt) within a single instance,
having loaded a single sequence file into memory. Because the set of
sequences found within the various databases may overlap, the hmmpgmd
format allows each sequence to be stored once, and includes a small piece of
metadata that indicates, for each sequence, the list of source databases in
which the sequence is found. When a search is performed in \mono{hmmpgmd}, a
single target database is specified, and search is restricted to sequences
belonging to that database.
Furthermore, because a single sequence might be found multiple times
within a single sequence database, \mono{hmmpgmd} is designed to compute
E-values not just on the total number of non-redundant sequences
within a database, but on the total number of sequences in the original
(possibly redundant) database, provided those redundant counts are
given in the hmmpgmd-formatted file.
The \mono{hmmpgmd} file begins with a single line containing various counts
describing the contents of the file, of the form
\monob{\#res\_cnt seq\_cnt db\_cnt cnt\_1 fullcnt\_1 cnt\_2 fullcnt\_2 $\ldots$ date\_stamp}
\subsection{Fields in header line}
\begin{sreitems}{\monob{header}}
\item [\monob{res\_cnt}] Number of residues in the sequence file.
\item [\monob{seq\_cnt}] Number of sequences in the sequence file.
\item [\monob{db\_cnt}] Number of databases in the sequence file.
\item [\monob{cnt\_i}] Of the sequnces in the file, the number that belong to
database \mono{i}. Note that if the file contains only a single
database, this will match \mono{seq\_cnt}.
\item [\monob{fullcnt\_i}] For database \mono{i}, the number of sequences
that should be used in computing E-values. If redundant
sequences were collapsed out of the original database, this may be
larger than \mono{cnt\_i}.
\end{sreitems}
\subsection{FASTA-like sequence format}
In the main body of the sequence file, database sequences are stored
sequentially, with each entry consisting of a one-line FASTA-like
header followed by one or more lines containing the amino acid sequence,
like
\begin{sreoutput}
>1 100
ACDEFGHIKLMNPQTVWY
>2 010
ACDKLMNPQTVWYEFGHI
>3 111
EFMNRGHIKLMNPQT
\end{sreoutput}
Note that the per-entry header line is made up of two parts. The first part
is a numeric, incrementing sequence identifier (the i'th entry has the
identifier \mono{i}). The second part is a bit string indicating database
membership. In this example, sequence 1 is found in database 1, sequence 2 is
found in database 2, and sequence 3 found in databases 1, 2, and 3. The number
of bits in each bit string should match \mono{db\_cnt}.
Because \mono{hmmpgmd} accepts only numeric sequence identifiers, it is
necessary to keep track of the mapping between each numeric sequence identifier
and the corresponding meta data (e.g. name, accession, description) external to
the hmmpgmd-format file, and post-process \mono{hmmpgmd} seach results to
apply those fields to the target sequence information.
Depending on the size of the map list, this might be easily acheived with a
simple perl array, or might require a more heavy-duty mapping backend such as
mongodb (\url{http://www.mongodb.org}).
\subsection{Creating a file in hmmpgmd format}
The HMMER-Easel tool \mono{esl-reformat} is able to convert a file in unaligned
fasta format into an hmmpgmd format file, such as
\user{esl-reformat --id\_map mydb.hmmpgmd.map hmmpgmd mydb.fasta > mydb.hmmpgmd}
The optional \mono{--id\_map} flag captures sequence name and description
information into a simple tabular file, to be used for mapping those values
back to \mono{hmmpgmd} query hits.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\newpage
\section{Score matrix files}
Profile HMMs can be built from single sequences, not just from
multiple sequence alignments. For example, the \mono{phmmer} and
\mono{jackhmmer} search programs take a single sequence query as an
input, and \mono{nhmmer} can as well. For single sequence queries, the
probability parameters for residue alignments are derived from a
substitution score matrix, such as BLOSUM62. Scores are converted to
probabilities as described by Altschul (1991).\cite{Altschul91}. The
scores can be arbitrary, but they must satisfy a couple of conditions
so they can be interpreted as implicit log-odds probabilities: there
must be at least one positive score, and the expected score (on
nonhomologous alignments) must be negative.
The default score matrix for protein alignments is BLOSUM62; for DNA,
a matrix we call DNA1. Using the \mono{--mx} option (for programs that
can use score matrices), you can choose instead one of several
alternative protein score matrices: PAM30, PAM70, PAM120, PAM240,
BLOSUM45, BLOSUM50, BLOSUM80, and BLOSUM90. For example, you could use
\mono{--mx BLOSUM50}.
The \mono{--mxfile} option allows you to provide a score matrix as a
file. You can download many standard score matrice files from NCBI at
\url{ftp://ftp.ncbi.nlm.nih.gov/blast/matrices/}. HMMER can read
\emph{almost} any of these files, with one exception: because it
requires the score matrix to be symmetrical (same number of residues
in rows and columns), the NCBI \mono{NUC.4.4} matrix for DNA works,
but the alternative short-form format \mono{NUC.4.2} does not
work. The scores in these two files are identical, so just use
\mono{NUC.4.4}, and files like it.
Here's a simple example file:
\begin{sreoutput}
# My score matrix
#
A T G C
A 1 -3 -3 -3
T -3 1 -3 -3
G -3 -3 1 -3
C -3 -3 -3 1
\end{sreoutput}
In more detail, the rules for the format are:
\begin{itemize}
\item Blank lines are ignored.
\item A \mono{\#} indicates a comment. Anything after it is
ignored. Lines that start with \mono{\#} are ignored like blank
lines.
\item The first data line is a header line, labeling each column with
$n$ single residue characters (case-insensitive). A nucleic matrix
has $4 \leq n \leq 16$: at least the four canonical residues
\mono{ACGT} (or \mono{U}, for you RNA zealots), and it may also
contain any or all ambiguity codes \mono{RYMKSWHBVDN*}. A protein
matrix has $20 \leq n 27$; it must contain at least the 20 canonical
residues \mono{ACDEFGHIKLMNPQRSTVWY}, and may contain any or all
ambiguity codes \mono{BJZOU}. These residues can be in any order,
but the rows must be in the same order as the columns.
\item The next $n$ data lines are the rows of a square $n \times n$
score matrix:
\begin{itemize}
\item Fields in the row are whitespace-delimited (tabs or
spaces).
\item Optionally, each row can start with its single residue label.
Therefore each row has either $n+1$ fields if there is a leading
label, or $n$ fields if not. Rows and columns are in the same order.
\item Each score is an integer. Plus signs are optional.
\end{itemize}
\item The file may not contain any additional lines (other than
comments or blank lines).
\end{itemize}
HMMER only uses the scores for canonical residues, and uses them to
calculate probabilities. If scores for ambiguous residue codes are
provided, HMMER ignores them; it has its own logic for dealing with
the probability of ambiguous residues, given the probability of
canonical residues.
|