1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324
|
\chapter{Tabular output formats}
\label{chapter:tabular}
\setcounter{footnote}{0}
\section{The target hits table}
The \mono{-{}-tblout} output option produces the \emph{target hits
table}. The target hits table consists of one line for each
different query/target comparison that met the reporting thresholds,
ranked by decreasing statistical significance (increasing E-value).
\paragraph{tblout fields for protein search programs}
In the protein search programs, each line consists of \textbf{18
space-delimited fields} followed by a free text target sequence description, as
follows:\marginnote{The \mono{tblout} format is deliberately space-delimited
(rather than tab-delimited) and justified into aligned columns, so these files
are suitable both for automated parsing and for human
examination. I feel that tab-delimited data files are difficult for humans to
examine and spot check. For this reason, I think tab-delimited
files are a minor evil in the world. Although I occasionally
receive shrieks of outrage about this, I still stubbornly feel that
space-delimited files are just as easily parsed as tab-delimited
files.}
\begin{description}
\item[\monob{(1) target name:}]
The name of the target sequence or profile.
\item[\monob{(2) accession:}]
The accession of the target sequence or profile, or '-' if none.
\item[\monob{(3) query name:}]
The name of the query sequence or profile.
\item[\monob{(4) accession:}]
The accession of the query sequence or profile, or '-' if none.
\item[\monob{(5) E-value (full sequence):}] The expectation value
(statistical significance) of the target. This is a \emph{per
query} E-value; i.e.\ calculated as the expected number of false
positives achieving this comparison's score for a \emph{single}
query against the $Z$ sequences in the target dataset. If you
search with multiple queries and if you want to control the
\emph{overall} false positive rate of that search rather than the
false positive rate per query, you will want to multiply this
per-query E-value by how many queries you're doing.
\item[\monob{(6) score (full sequence):}]
The score (in bits) for this target/query comparison. It includes
the biased-composition correction (the ``null2'' model).
\item[\monob{(7) Bias (full sequence):}] The biased-composition
correction: the bit score difference contributed by the null2
model. High bias scores may be a red flag for a false positive,
especially when the bias score is as large or larger than the
overall bit score. It is difficult to correct for all possible ways
in which a nonrandom but nonhomologous biological sequences can
appear to be similar, such as short-period tandem repeats, so there
are cases where the bias correction is not strong enough (creating
false positives).
\item[\monob{(8) E-value (best 1 domain):}] The E-value if only the
single best-scoring domain envelope were found in the sequence, and
none of the others. If this E-value isn't good, but the full
sequence E-value is good, this is a potential red flag. Weak hits,
none of which are good enough on their own, are summing up to lift
the sequence up to a high score. Whether this is Good or Bad is not
clear; the sequence may contain several weak homologous domains, or
it might contain a repetitive sequence that is hitting by chance
(i.e. once one repeat hits, all the repeats hit).
\item[\monob{(9) score (best 1 domain):}] The bit score if only the
single best-scoring domain envelope were found in the sequence, and
none of the others. (Inclusive of the null2 bias correction.]
\item[\monob{(10) bias (best 1 domain):}] The null2 bias correction
that was applied to the bit score of the single best-scoring domain.
\item[\monob{(11) exp:}] Expected number of domains, as calculated by
posterior decoding on the mean number of begin states used in the
alignment ensemble.
\item[\monob{(12) reg:}] Number of discrete regions defined, as
calculated by heuristics applied to posterior decoding of begin/end
state positions in the alignment ensemble. The number of regions
will generally be close to the expected number of domains. The more
different the two numbers are, the less discrete the regions appear
to be, in terms of probability mass. This usually means one of two
things. On the one hand, weak homologous domains may be difficult
for the heuristics to identify clearly. On the other hand,
repetitive sequence may appear to have a high expected domain number
(from lots of crappy possible alignments in the ensemble, no one of
which is very convincing on its own, so no one region is discretely
well-defined).
\item[\monob{(13) clu:}] Number of regions that appeared to be
multidomain, and therefore were passed to stochastic traceback
clustering for further resolution down to one or more
envelopes. This number is often zero.
\item[\monob{(14) ov:}] For envelopes that were defined by stochastic
traceback clustering, how many of them overlap other envelopes.
\item[\monob{(15) env:}]
The total number of envelopes defined, both by single envelope
regions and by stochastic traceback clustering into one or more
envelopes per region.
\item[\monob{(16) dom:}] Number of domains defined. In general, this
is the same as the number of envelopes: for each envelope, we find
an MEA (maximum expected accuracy) alignment, which defines the
endpoints of the alignable domain.
\item[\monob{(17) rep:}]
Number of domains satisfying reporting thresholds. If you've also
saved a \mono{-{}-domtblout} file, there will be one line in it
for each reported domain.
\item[\monob{(18) inc:}]
Number of domains satisfying inclusion thresholds.
\item[\monob{(19) description of target:}]
The remainder of the line is the target's description line, as free text.
\end{description}
\paragraph{tblout fields for DNA search programs}
In the DNA search programs, there is less concentration on domains, and more
focus on presenting the hit ranges. Each line consists of \textbf{15
space-delimited fields} followed by a free text target sequence description, as follows:
\begin{description}
\item[\monob{(1) target name:}]
The name of the target sequence or profile.
\item[\monob{(2) accession:}]
The accession of the target sequence or profile, or '-' if none.
\item[\monob{(3) query name:}]
The name of the query sequence or profile.
\item[\monob{(4) accession:}]
The accession of the query sequence or profile, or '-' if none.
\item[\monob{(5) hmmfrom:}]
The position in the hmm at which the hit starts.
\item[\monob{(6) hmm to:}]
The position in the hmm at which the hit ends.
\item[\monob{(7) alifrom:}]
The position in the target sequence at which the hit starts.
\item[\monob{(8) ali to:}]
The position in the target sequence at which the hit ends.
\item[\monob{(9) envfrom:}]
The position in the target sequence at which the surrounding envelope starts.
\item[\monob{(10) env to:}]
The position in the target sequence at which the surrounding envelope ends.
\item[\monob{(11) sq len:}]
The length of the target sequence..
\item[\monob{(12) strand:}]
The strand on which the hit was found (``-" when alifrom>ali to).
\item[\monob{(13) E-value:}] The expectation value
(statistical significance) of the target, as above.
\item[\monob{(14) score (full sequence):}]
The score (in bits) for this hit. It includes the biased-composition
correction.
\item[\monob{(15) Bias (full sequence):}] The biased-composition
correction, as above
\item[\monob{(16) description of target:}]
The remainder of the line is the target's description line, as free text.
\end{description}
These tables are columnated neatly for human readability, but do not
write parsers that rely on this columnation; rely on space-delimited
fields. The pretty columnation assumes fixed maximum widths for each
field. If a field exceeds its allotted width, it will still be fully
represented and space-delimited, but the columnation will be disrupted
on the rest of the row.
Note the use of target and query columns. A program like
\mono{hmmsearch} searches a query profile against a target sequence
database. In an \mono{hmmsearch} tblout file, the sequence (target)
name is first, and the profile (query) name is second. A program like
\mono{hmmscan}, on the other hand, searches a query sequence against a
target profile database. In a \mono{hmmscan} tblout file, the profile
name is first, and the sequence name is second. You might say, hey,
wouldn't it be more consistent to put the profile name first and the
sequence name second (or vice versa), so \mono{hmmsearch} and
\mono{hmmscan} tblout files were identical? Well, first of all, they
still wouldn't be identical, because the target database size used for
E-value calculations is different (number of target sequences for
\mono{hmmsearch}, number of target profiles for \mono{hmmscan}, and
it's good not to forget this. Second, what about programs like
\mono{phmmer} where the query is a sequence and the targets are also
sequences?
If the ``domain number estimation'' section of the protein table (exp, reg,
clu, ov, env, dom, rep, inc) makes no sense to you, it may help to
read the previous section of the manual, which describes the HMMER
processing pipeline, including the steps that probabilistically define
domain locations in a sequence.
\section{The domain hits table (protein search only)}
In protein search programs, the \mono{-{}-domtblout} option produces the
\emph{domain hits table}. There is one line for each domain. There may be more than
one domain per sequence. The domain table has \textbf{22
whitespace-delimited fields} followed by a free text target sequence
description, as follows:
\begin{description}
\item[\monob{(1) target name:}] The name of the target sequence or profile.
\item[\monob{(2) target accession:}] Accession of the target sequence
or profile, or '-' if none is available.
\item[\monob{(3) tlen:}] Length of the target sequence or profile, in residues.
This (together with the query length) is useful for interpreting
where the domain coordinates (in subsequent columns) lie in the
sequence.
\item[\monob{(4) query name:}] Name of the query sequence or profile.
\item[\monob{(5) accession:}] Accession of the target sequence or
profile, or '-' if none is available.
\item[\monob{(6) qlen:}] Length of the query sequence or profile, in residues.
\item[\monob{(7) E-value:}] E-value of the overall sequence/profile
comparison (including all domains).
\item[\monob{(8) score:}] Bit score of the overall sequence/profile
comparison (including all domains), inclusive of a null2 bias
composition correction to the score.
\item[\monob{(9) bias:}] The biased composition score correction that
was applied to the bit score.
\item[\monob{(10) \#:}] This domain's number (1..ndom).
\item[\monob{(11) of:}] The total number of domains reported in the
sequence, ndom.
\item[\monob{(12) c-Evalue:}] The ``conditional E-value'', a
permissive measure of how reliable this particular domain may be.
The conditional E-value is calculated on a smaller search space than
the independent E-value. The conditional E-value uses the number of
targets that pass the reporting thresholds. The null hypothesis test
posed by the conditional E-value is as follows. Suppose that we
believe that there is already sufficient evidence (from other
domains) to identify the set of reported sequences as homologs of
our query; now, how many \emph{additional} domains would we expect
to find with at least this particular domain's bit score, if the
rest of those reported sequences were random nonhomologous sequence
(i.e.\ outside the other domain(s) that were sufficient to
identified them as homologs in the first place)?
\item[\monob{(13) i-Evalue:}] The ``independent E-value'', the
E-value that the sequence/profile comparison would have received if
this were the only domain envelope found in it, excluding any
others. This is a stringent measure of how reliable this particular
domain may be. The independent E-value uses the total number of
targets in the target database.
\item[\monob{(14) score:}] The bit score for this domain.
\item[\monob{(15) bias:}] The biased composition (null2) score
correction that was applied to the domain bit score.
\item[\monob{(16) from (hmm coord):}]
The start of the MEA alignment of this domain with respect to the
profile, numbered 1..N for a profile of N consensus positions.
\item[\monob{(17) to (hmm coord):}]
The end of the MEA alignment of this domain with respect to the
profile, numbered 1..N for a profile of N consensus positions.
\item[\monob{(18) from (ali coord):}]
The start of the MEA alignment of this domain with respect to the
sequence, numbered 1..L for a sequence of L residues.
\item[\monob{(19) to (ali coord):}]
The end of the MEA alignment of this domain with respect to the
sequence, numbered 1..L for a sequence of L residues.
\item[\monob{(20) from (env coord):}] The start of the domain
envelope on the sequence, numbered 1..L for a sequence of L
residues. The \emph{envelope} defines a subsequence for which their
is substantial probability mass supporting a homologous domain,
whether or not a single discrete alignment can be identified.
The envelope may extend beyond the endpoints of the MEA alignment,
and in fact often does, for weakly scoring domains.
\item[\monob{(21) to (env coord):}] The end of the domain
envelope on the sequence, numbered 1..L for a sequence of L
residues.
\item[\monob{(22) acc:}] The mean posterior probability of aligned
residues in the MEA alignment; a measure of how reliable the overall
alignment is (from 0 to 1, with 1.00 indicating a completely
reliable alignment according to the model).
\item[\monob{(23) description of target:}] The remainder of the line
is the target's description line, as free text.
\end{description}
As with the target hits table (above), this table is columnated neatly
for human readability, but you should not write parsers that rely on
this columnation; parse based on space-delimited fields instead.
|