1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891
|
\documentclass{article}
\usepackage{amsmath}
\author{Martin C Frith}
\title{GLAM2 Methods}
\newcommand{\ud}{\mathrm d}
\DeclareMathOperator*{\argmax}{arg\,max}
\begin{document}
\maketitle
\section*{Overview}
The GLAM2 software aims to find the strongest motif that is present in
a set of protein or nucleotide sequences. In fact, it can analyze
sequences over any user-specified alphabet. The method is described
here in three stages: firstly, what exactly is meant by a ``motif'',
secondly, what is meant by ``strongest'', and finally, the algorithm
used to search the space of motifs for the strongest one. Most aspects
of the method have close analogs in the well known Gibbs Sampling
technique for motif discovery \cite{Science.:262:208,
ProteinSci.:4:1618, NucleicAcidsRes.:32:189}: GLAM2 is essentially a
generalization of Gibbs Sampling to allow insertions and deletions in
a fully general fashion.
Having found a motif, we often wish to scan it against a database of
sequences, to find other instances of the motif. GLAM2 includes such a
scanning method, which is also described here.
\section*{Motif Definition}
A motif in GLAM2 has a certain number, $W$, of ``key positions''. The
idea is that the key positions hold residues that are important for
the motif's function. An instance of the motif is a string of residues
(amino acids or nucleotides), where each residue either occupies one
of the key positions, or is inserted between key positions. More than
one residue may be inserted between key positions, and a key position
may lack a corresponding residue, meaning that it is deleted in this
motif instance.
What GLAM2 searches for is an alignment of substrings of the input
sequences to a series of key positions. The number of key positions is
optimized by the algorithm. Currently, each input sequence may
contribute at most one substring to the alignment, although this
restriction could be relaxed in principle. GLAM2 requires the
alignment to contain at least some minimum number of substrings, by
default 2. This lower limit is a useful generalization of the OOPS
(one occurrence per sequence) and ZOOPS (zero or one occurrence per
sequence) modes of previous motif discovery algorithms
\cite{Bailey7584439}. For nucleotide sequences, there is an option to
consider both strands of the input sequences (direct and
reverse-complement).
\section*{Scoring Scheme}
\subsection*{Overview}
In order to define the ``strongest'' motif (i.e. alignment of
substrings to key positions), GLAM2 uses a formula to assign a numeric
score to any given alignment. The task, then, is to find the alignment
with maximum score. While the precise formula is given below, some
desirable characteristics for any sensible formula are noted
here. Firstly, the formula should favour alignments where key
positions are occupied by identical or chemically similar
residues. Secondly, deletions and insertions should be penalized in
general. Thirdly, deletions and insertions should be penalized less
strongly when they are concentrated in a few positions in the
alignment, suggesting that those positions in the motif are more prone
to deletion or insertion, than when they are scattered across many
positions. The formula used by GLAM2 and described below has each of
these characteristics.
\subsection*{Motif Model}
The formula is motivated by a simple statistical model of a motif,
with position-specific residue probabilities, position-specific
deletion probabilities, and position-specific insertion
probabilities. Thus, each key position has characteristic
probabilities, $\theta_i$, of containing the $i$\textsuperscript{th}
residue type. ($1 \leq i \leq A$, where $A$ is the size of the
alphabet, 20 for proteins and 4 for nucleotides.) Also, each key
position has probability $\phi$ of being deleted. Finally, the
probability that $x$ residues are inserted between two key positions
is $\psi^x(1-\psi)$. The use of a geometric distribution for insertion
length is arbitrary, but it is the simplest choice, and since $\psi$
may vary with position, we doubt that more complex schemes would lead
to much benefit. This model can be regarded as a hidden Markov model,
where $\theta_i$ are emission probabilities and $\phi$ and $\psi$ are
transition probabilities.
Hence, in a set of $s$ independent motif instances, the probability
that a key position is deleted in a particular subset of $d$ instances
and present in the other $m = s-d$ instances is:
\begin{equation}
\phi^d (1-\phi)^m
\end{equation}
The probability that $r$ residues in total (in all instances) are
inserted between two key positions is:
\begin{equation}
\psi^r (1-\psi)^s
\end{equation}
The probability of observing particular residues in one key position,
where $c_i$ is the count of the $i$\textsuperscript{th} residue type
(so $m = \sum_{i=1}^A c_i$), is:
\begin{equation}
\prod_{i=1}^A \theta_i^{c_i}
\end{equation}
\subsection*{Motif Priors}
The values of $\theta_i$, $\phi$, and $\psi$ are, of course, unknown,
but prior probability distributions are assigned to these
parameters. Then, total probabilities can be obtained by integrating
over the parameter values. The total probability of observing $d$
deletions is:
\begin{equation}
\int_0^1 \phi^d (1-\phi)^m \cdot prior(\phi) \ud \phi
\end{equation}
The total probability of observing $r$ insertions is:
\begin{equation}
\int_0^1 \psi^r (1-\psi)^s \cdot prior(\psi) \ud \psi
\end{equation}
The total probability of observing $\vec c$ residue counts is:
\begin{equation}
\int \prod_{i=1}^A \theta_i^{c_i} \cdot prior(\vec \theta) \ud \vec \theta
\end{equation}
The prior distributions of $\phi$ and $\psi$ are assumed to be Beta
distributions (the simplest kind of Dirichlet distribution):
\begin{align}
prior(\phi) & = \phi^{D-1} (1-\phi)^{E-1} / Z
\\
prior(\psi) & = \psi^{I-1} (1-\psi)^{J-1} / Z'
\end{align}
The values of $D$, $E$, $I$, and $J$ are chosen so as to give GLAM2
the desired level of aversion to deletions and insertions. For reasons
that will become clearer in the description of the search algorithm,
these parameters are referred to as pseudocounts: $D$ is the deletion
pseudocount, $E$ is the no-deletion (or match) pseudocount, $I$ is the
insertion pseudocount, and $J$ is the no-insertion pseudocount. $Z$
and $Z'$ are normalization constants. The use of Beta distributions
here is ultimately arbitrary, but they are a natural choice since they
are conjugate priors of the binomial distributions for deletion and
insertion probabilities, and they allow the integrals to be solved
analytically. Also, the shapes of these distributions allow
considerable flexibility by varying the pseudocount values.
The prior distribution of $\vec \theta$ should incorporate prior
knowledge of the functional similarities between residues, at least in
the case of proteins. To accomplish this, a Dirichlet mixture
distribution is used (a weighted sum of Dirichlet distributions):
\begin{align}
prior(\vec \theta) & = \sum_{j=1}^C \frac{w_j}{Z_j} \prod_{i=1}^A \theta_i^{\alpha_{ji}-1}
\end{align}
$C$ is the number of components in the mixture, $w_j$ is the weight of
the $j$\textsuperscript{th} component, $Z_j$ is the normalization
constant of the $j$\textsuperscript{th} component, and $\alpha_{ji}$
is the $j$\textsuperscript{th} component's pseudocount for the
$i$\textsuperscript{th} residue type. So, this distribution has a
potentially huge number of parameters, which should be exquisitely
chosen so as to favour biologically realistic values of $\vec
\theta$. Fortunately, Dirichlet mixture priors for proteins have been
extensively investigated by a team at UC Santa Cruz, and we simply use
their parameters \cite{ComputApplBiosci.:12:327}. For the case of
nucleotides, we use a one-component mixture with all pseudocounts =
0.4 \cite{NucleicAcidsRes.:32:189}. These are the defaults:
alternative Dirichlet mixture parameters may be supplied as input to
GLAM2.
As far as we can tell, previous implementations of Gibbs Sampling have
only used simple Dirichlet priors rather than Dirichlet mixtures,
hence have lacked prior information about functional similarities
between amino acids. This ought to give GLAM2 a significant advantage
in finding protein motifs, even leaving aside the treatment of
insertions and deletions.
Using these prior distributions, the total probability integrals can
be solved analytically. The solutions contain Gamma functions, denoted
by $\Gamma(\cdot)$. The total probability of observing $d$ deletions
becomes:
\begin{equation}
\label{eq:totdel}
\frac{\Gamma(D+E) \Gamma(d+D) \Gamma(m+E)}{\Gamma(s+D+E) \Gamma(D) \Gamma(E)}
\end{equation}
The total probability of observing $r$ insertions becomes:
\begin{equation}
\label{eq:totins}
\frac{\Gamma(I+J) \Gamma(r+I) \Gamma(s+J)}{\Gamma(r+s+I+J) \Gamma(I) \Gamma(J)}
\end{equation}
The total probability of observing $\vec c$ residue counts becomes:
\begin{equation}
\label{eq:totmat}
\sum_{j=1}^C
\frac{w_j \cdot \Gamma(A_j)}{\Gamma(m+A_j)}
\prod_{i=1}^A \frac{\Gamma(c_i + \alpha_{ji})}{\Gamma(\alpha_{ji})}
\end{equation}
where $A_j = \sum_{i=1}^A \alpha_{ji}$.
\subsection*{Background Model}
To score alignments, the motif model is compared to a background model
of independent sequences. The background model is that the residues
occur randomly and independently, with probabilities $p_i$. For the
nucleotide alphabet, the default is $p_i = 1/4$. For the protein
alphabet, the amino acid abundances of Robinson and Robinson, which
are frequently cited in publications on NCBI BLAST, are used
\cite{ProcNatlAcadSciUSA.:88:8880, NucleicAcidsRes.:25:3389}. These
probabilities can be adjusted by the user. Thus, the probability of
observing $\vec c$ residue counts in one key position given the
background model is:
\begin{equation}
\prod_{i=1}^A p_i^{c_i}
\end{equation}
In the motif model, residues inserted between key positions are
assumed to occur with these same background frequencies.
\subsection*{The Formula}
The score $S$ of an alignment $X$ is: the likelihood of the alignment
given the motif model, divided by the likelihood of the substrings in
the alignment given the background model.
\begin{equation}\begin{split}
S(X) = & \prod_{k=1}^W
\sum_{j=1}^C
\frac{w_j \cdot \Gamma(A_j)}{\Gamma(m_k+A_j)}
\prod_{i=1}^A
\frac{\Gamma(c_{ki} + \alpha_{ji})}{\Gamma(\alpha_{ji}) \cdot p_i^{c_{ki}}}
\times \\
& \prod_{k=1}^W
\frac{\Gamma(D+E) \Gamma(d_k+D) \Gamma(m_k+E)}{\Gamma(s+D+E) \Gamma(D) \Gamma(E)}
\times \\
& \prod_{k=1}^{W-1}
\frac{\Gamma(I+J) \Gamma(r_k+I) \Gamma(s+J)}{\Gamma(r_k+s+I+J) \Gamma(I) \Gamma(J)}
\end{split}\end{equation}
where $d_k$ is the number of deletions of the $k$\textsuperscript{th}
key position, $m_k$ is the number of residues in the
$k$\textsuperscript{th} key position, $c_{ki}$ is the count of the
$i$\textsuperscript{th} residue type in the $k$\textsuperscript{th}
key position, and $r_k$ is the number of insertions between key
positions $k$ and $k+1$. Since scores calculated by this formula can
easily be of unwieldy orders of magnitude, GLAM2 actually reports
$\log_2 S(X)$.
\subsection*{Fitting the Insertion and Deletion Priors}
It is desirable to choose the insertion and deletion pseudocounts,
$D$, $E$, $I$, and $J$, so as to maximise the model's fit to actual
alignments. Suppose we are given a training set of alignments, with a
total of $\mathbf{W}_D$ key positions, and $\mathbf{W}_I$ potential
insertion sites between key positions. We seek the values of $D$, $E$,
$I$, and $J$ that maximise the total probabilities:
\begin{align}
\argmax_{D,E} \quad & \prod_{k=1}^{\mathbf{W}_D}
\frac{\Gamma(D+E) \Gamma(d_k+D) \Gamma(m_k+E)}
{\Gamma(s_k+D+E) \Gamma(D) \Gamma(E)}
\\
\argmax_{I,J} \quad & \prod_{k=1}^{\mathbf{W}_I}
\frac{\Gamma(I+J) \Gamma(r_k+I) \Gamma(s_k+J)}
{\Gamma(r_k+s_k+I+J) \Gamma(I) \Gamma(J)}
\end{align}
This can be accomplished, crudely but effectively, by plotting the
function over a two-dimensional grid of values for $D$ and $E$ (or $I$
and $J$). In every case that we examined, the plot indicates a simple
hill-shaped function with a unique maximum.
\section*{Search Algorithm}
\subsection*{Overview}
The preceding sections have defined what is meant by a motif: an
alignment of substrings of the input sequences to a series of key
positions, and have suggested a formula to assign a score to any such
alignment. The aim is to find the alignment(s) with maximum
score. Unfortunately, the number of possible alignments will be
astronomically huge in most cases, and we do not know a practical
algorithm to guarantee finding the highest-scoring one. Therefore it
is necessary to use a heuristic algorithm.
GLAM2 uses simulated annealing, a very general technique for finding
the highest-scoring solution in a large ``search space'' of possible
solutions. Simulated annealing begins at some (presumably non-optimal)
location in the search space, and repeatedly performs stochastic moves
to ``nearby'' locations. Moves that increase the score are favoured,
but moves that decrease the score are also permitted, which allows the
algorithm to escape from local maxima. A temperature parameter, $T$,
controls the preference for higher scores: at low temperatures they
are strongly preferred, but at high temperatures they are only weakly
preferred. More precisely, the stochastic moves in simulated annealing
should satisfy the so-called detailed balance condition:
\begin{equation}
S(X)^{1/T} P(X \rightarrow Y) = S(Y)^{1/T} P(Y \rightarrow X)
\end{equation}
where $X$ and $Y$ indicate locations in the search space, $S(X)$ is
the score of location $X$, and $P(X \rightarrow Y)$ is the
probability, when in location $X$, of moving to location $Y$. Note
that detailed balance implies reversibility: if it is possible to move
from $X$ to $Y$, then it must be possible to move from $Y$ to $X$.
The temperature starts out high and is gradually reduced, giving the
process a chance to evade local maxima and settle into the global
maximum.
The stochastic moves can be generated in many different ways that
satisfy detailed balance: generating them in a way that explores the
search space effectively is key to a successful algorithm. For
example, some previous multiple alignment algorithms have employed
simulated annealing with simple moves that adjust the position of a
gap in one of the aligned sequences \cite{ComputApplBiosci.:9:267,
ComputApplBiosci.:10:419}. Thus, each move explores just one location
in the vast search space, and many such moves may separate a local
optimum from the global optimum. In contrast, GLAM2 uses an extremely
clever technique, the \emph{stochastic traceback} \cite{Eddy7584426,
Durbin00}, which allows a single move to explore efficiently an
astronomically large number of alignments (albeit an astronomically
small fraction of the total search space).
Moves using the stochastic traceback alone are prone to getting stuck
in a certain type of local optimum; so GLAM2 intersperses these with a
second type of move. The stochastic traceback moves are referred to as
site sampling, and the second type as column sampling, because they
are analogous to procedures with the same names in the original Gibbs
sampler \cite{ProteinSci.:4:1618}. At each iteration, one type of move
is chosen randomly with 50:50 probabilities. A modification of column
sampling allows the number of key positions in the motif to be
varied. Together, these moves can explore the space of all possible
motifs.
To assess the robustness of the result, GLAM2 performs several (by
default 10) annealing runs from different random starting points. Each
run continues until some number of iterations $n$ has passed without
finding a higher-scoring alignment. During each run, $T$ is reduced by
a fixed percentage at each iteration, by default starting at 1.2 and
decreasing 1.44-fold per $n$ iterations. The highest-scoring alignment
encountered in each run is reported at the end.
\subsection*{Site Sampling}
\subsubsection*{Overview}
In site sampling, one of the input sequences is chosen at random and
re-aligned to the motif. All possible alignments of substrings of this
sequence to the motif are considered, including the option of
excluding this sequence from the alignment altogether. However, if
excluding the sequence would bring the number of substrings in the
alignment below the minimum permitted number (see above), then
exclusion is not considered. Both strands of the sequence are
considered, if the user selected this option. One alignment is chosen
at random, with probability proportional to the resulting alignment
score, as defined above, raised to the power of $1/T$.
\subsubsection*{Scoring}
The calculations can be simplified somewhat by dividing by a constant
factor: the score of the alignment of all the other sequences,
excluding the sequence that is being re-aligned. (This also scales the
values to moderate orders of magnitude that are likely to be
representable by the computer.) In the following, primed letters refer
to this alignment of all the other sequences: $s'$ is the number of
substrings in the alignment, $m'_k$ is the number of residues aligned
to key position $k$, $d'_k$ is the number of deletions of key position
$k$, $r'_k$ is the number of insertions between $k$ and $k+1$, and
$c'_{ki}$ is the count of residue type $i$ in position $k$.
Using the fact that $\Gamma(x) = (x-1) \Gamma(x-1)$, the score for
deleting key position $k$ (cf. Equation \ref{eq:totdel}) becomes:
\begin{equation}
\delta(k) = \frac{d'_k+D}{s'+D+E}
\end{equation}
The score for inserting $x$ residues between $k$ and $k+1$
(cf. Equation \ref{eq:totins}) becomes:
\begin{equation}
\iota(k,x) = \frac{s'+J}{r'_k+s'+I+J} \cdot
\prod_{i=0}^{x-1} \frac{r'_k+I+i}{r'_k+s'+I+J+i+1}
\end{equation}
The score for aligning a residue of type $X$ with key position $k$
(cf. Equation \ref{eq:totmat}) becomes:
\begin{equation}
\mu(k,X) = \frac{m'_k+E}{s'+D+E} \cdot
\sum_{j=1}^C v_{kj} \frac{c'_{kX}+\alpha_{jX}}{m'_k+A_j} \bigg/ p_X
\end{equation}
where:
\begin{equation}
v_{kj} = \frac{
\frac{w_j \cdot \Gamma(A_j)}{\Gamma(m'_k+A_j)}
\prod_{i=1}^A \frac{\Gamma(c'_{ki}+\alpha_{ji})}{\Gamma(\alpha_{ji})}
}{
\sum_{j=1}^C \frac{w_j \cdot \Gamma(A_j)}{\Gamma(m'_k+A_j)}
\prod_{i=1}^A \frac{\Gamma(c'_{ki}+\alpha_{ji})}{\Gamma(\alpha_{ji})}
}
\end{equation}
(Hence $D$, $E$, $I$, and $J$ are called pseudocounts, because they
get added to the observed counts: $d'_k$, $m'_k$, $r'_k$, and $s'$.)
\subsubsection*{Dynamic Programming}
To set up the stochastic traceback, GLAM2 calculates values $M(i,j)$:
the sum of the scores of all alignments ending at the
$i$\textsuperscript{th} key position in the motif and the
$j$\textsuperscript{th} residue in the sequence. The calculation is
made slightly simpler by constructing intermediate values
$N(i,j)$. The quantities $\mu$, $\delta$, and $\iota$ are raised to
the power of $1/T$ prior to using them in this procedure. In the
following, $L$ is the length of the sequence. Boundary cases:
\begin{align}
N(0,j) & = 1 && (0 \leq j \leq L)
\\
M(i,0) & = \delta(i) \cdot N(i-1,0) && (1 \leq i \leq W)
\end{align}
Main cases:
\begin{align}
M(i,j) & = \mu(i,X_j) \cdot N(i-1,j-1) + \delta(i) \cdot N(i-1,j)
&&
\left( \begin{array}{l}
1 \leq j \leq L \\ 1 \leq i \leq W
\end{array} \right)
\\
\label{eq:Nij}
N(i,j) & = \sum_{x=0}^j \iota(i,x) \cdot M(i,j-x)
&&
\left( \begin{array}{l}
0 \leq j \leq L \\ 1 \leq i < W
\end{array} \right)
\end{align}
This procedure is similar to the standard algorithms for pair-wise
sequence alignment, but much slower: the summation in Equation
(\ref{eq:Nij}) causes the whole calculation to require $O(WL^2)$
operations rather than $O(WL)$ operations. Ways to speed up the
calculation are described below.
\subsubsection*{Stochastic Traceback}
The values $M(i,j)$ are used to pick a random alignment, with
probabilities proportional to their scores to the power of $1/T$. Note
that $\sum_{j=0}^L M(W,j)$ is the sum of the scores of all alignments
of the sequence to the motif. If both strands are being considered, a
second matrix $M_R(i,j)$ is constructed using the reverse complement
of the sequence. GLAM2 randomly decides to align the sequence on the
forward strand, the reverse stand, or not at all, with probabilities
in the ratio $\sum_{j=0}^L M(W,j)$ : $\sum_{j=0}^L M_R(W,j)$ :
1. Supposing the forward strand is chosen, the endpoint of the
alignment in the sequence, $0 \leq j \leq L$, is then picked randomly,
with probabilities proportional to $M(W,j)$.
Having chosen the endpoint $j$, the alignment is determined by
iterating the following steps, with $i$ initially equal to
$W$. Firstly, GLAM2 randomly chooses to either align residue $j$ to
key position $i$, or delete key position $i$, with probabilities
defined in Equations (\ref{eq:tbm}) and (\ref{eq:tbd}). Then $i$ is
decremented, and unless deletion was chosen, $j$ is
decremented. Secondly, GLAM2 randomly picks a number of residues, $0
\leq x \leq j$, to insert between key positions $i$ and $i+1$, with
probabilities given in Equation (\ref{eq:tbi}). $x$ is then subtracted
from $j$. These steps are repeated until $i$ or $j$ reaches zero.
\begin{align}
\label{eq:tbm}
\textrm{prob}(\textrm{match})
& = \mu(i,X_j) \cdot N(i-1,j-1) \; / \; M(i,j)
\\
\label{eq:tbd}
\textrm{prob}(\textrm{delete})
& = \delta(i) \cdot N(i-1,j) \; / \; M(i,j)
\\
\label{eq:tbi}
\textrm{prob}(\textrm{insert }x\textrm{ residues})
& = \iota(i,x) \cdot M(i,j-x) \; / \; N(i,j)
\end{align}
The traceback can be performed in $O(W+L)$ operations.
\subsubsection*{Faster Algorithms}
GLAM2 provides two options to speed up the dynamic programming
procedure. The first option, noting that Equation (\ref{eq:Nij}) is a
convolution, is to calculate it in $O(L \log L)$ operations using fast
Fourier transforms. Thus, the whole dynamic programming procedure
requires $O(W L \log L)$ operations. A potential disadvantage is loss
of accuracy for smaller values of $N(i,j)$. The values of $M(i,j)$ and
$N(i,j)$ (for fixed $i$) may differ by many orders of magnitude; since
the Fourier transform mixes these values, the smaller ones become
corrupted owing to limited precision of the computer's representation
of real numbers. Unfortunately, in some cases these smaller values of
$N(i,j)$ have large effects at later stages of the dynamic
programming.
The second option is to replace the residue insertion score,
$\iota(k,x)$, with a more convenient approximation:
\begin{equation}
\iota(k,x) = \iota_1(k) \cdot \iota_2(k)^x
\end{equation}
where:
\begin{align}
\iota_1(k) & = \frac{s'+J}{r'_k+s'+I+J} \\
\iota_2(k) & = \frac{r'_k+I}{r'_k+s'+I+J}
\end{align}
Here, $\iota_2$ and $\iota_1$ are the posterior mean estimators for
$\psi$ and $1-\psi$. With this change, $N(i,j)$ can be built up as
follows:
\begin{align}
N(i,0) & = \iota_1(i) \cdot M(i,0) \\
N(i,j) & = \iota_1(i) \cdot M(i,j) + \iota_2(i) \cdot N(i,j-1)
\end{align}
Thus, the whole dynamic programming procedure requires $O(WL)$
operations. This is the default algorithm for GLAM2. It is remarkable
that this algorithm samples from all gapped alignments using the same
number of operations, $O(WL)$, as the site sampling step of the
original Gibbs sampler \cite{Science.:262:208}, which only considers
ungapped alignments. The change to $\iota(k,x)$ introduces a slight
deviation from detailed balance, penalizing long insertions, but seems
to produce sensible alignments in practice.
\subsubsection*{Numerical Considerations}
The values of $M(i,j)$ and $N(i,j)$ can become too large or too small
to be represented by the computer using floating point. A rescaling
procedure is employed to mitigate this problem. After calculating
$N(i-1,j)$ for all $j$, and before calculating $M(i,j)$, GLAM2
rescales $\mu(i,X)$ and $\delta(i)$ as follows:
\begin{align}
\mu(i,X) & \leftarrow \frac{\mu(i,X)}{\max_{j=0}^L N(i-1,j)} \\
\delta(i) & \leftarrow \frac{\delta(i)}{\max_{j=0}^L N(i-1,j)}
\end{align}
The product of the rescale values, $R$, is recorded (in $\log$ space to
avoid overflow):
\begin{equation}
R = \prod_{i=1}^W \max_{j=0}^L N(i-1,j)
\end{equation}
In the stochastic traceback, the sum of alignment scores $\sum_{j=0}^L
M(W,j)$ needs to multiplied by $R$, and similarly for $M_R$, before
picking the strand. Note that this method is simpler than rescaling
procedures that have been described elsewhere \cite{Durbin00}: there
is no need to record an array of rescaling parameters, and the
traceback, after picking the strand, works properly without change.
This rescaling keeps the maximum values of $M(i,j)$ and $N(i,j)$ to
manageable orders of magnitude, but the minimum values may still
underflow. These small values ``usually'' correspond to extremely
improbable alignments, but in slightly pathological cases they can be
the seeds of highly probable alignments. GLAM2 attempts to warn when
this occurs, by checking whether the traceback passes through values
of $N(i,j)$ below the limit of accurately represented numbers
(DBL\_MIN). As noted above, the fast Fourier transform algorithm
entails a more severe loss of accuracy for small values: in this case,
the warning is triggered at values below DBL\_EPSILON. In addition,
illegitimate negative values of $N(i,j)$ emerging from the fast
Fourier transform are immediately reset to zero.
Finally, problems can occur with low values of $T$. For instance, an
entire row of $N(i,j)$ values (all $j$ for some $i$) can underflow to
zero. To avoid this, a lower bound is imposed on $T$, by default 0.1.
\subsection*{Column Sampling}
\subsubsection*{Motivation}
Site sampling alone finds high-scoring alignments, but often with
non-optimal distribution of the key positions. For example, there
might be a key position that is deleted in all substrings, while
further along the alignment there is a column of inserted residues
that are all identical. In this case, the score would increase if all
the letters between these two columns were shifted across by one
position, so that the identical residues become aligned to a key
position, and the deletions disappear. Unfortunately, site sampling
moves can only shift the letters of one substring at a time, which is
likely to decrease the score. So low-scoring intermediate alignments
must be traversed in order to reach the improved alignment. Column
sampling moves solve this difficulty by providing a more direct route
to the improved alignment.
\subsubsection*{Overview}
Column sampling has two steps. In the first step, one key position is
chosen at random, and removed from the alignment. In the second step,
a new key position is added to the alignment. These steps can be
viewed as moving a key position from one location to another in the
alignment.
Removal of a key position has the following effect. If an internal
key position (neither the leftmost nor the rightmost) is removed, the
residues that were aligned to it become insertions between the
preceding and following key positions. If the leftmost or rightmost
key position is chosen, the residues that were aligned to it cease to
be part of the alignment.
In general, the number of ways of adding a key position to an
alignment is vast, and the addition step needs to be restricted to a
manageable subset of these. Furthermore, this subset should be chosen
in a way that ensures reversibility and detailed balance of column
sampling moves. To achieve this, some properties of the key position
that was removed need to be preserved.
\subsubsection*{Details}
Before removing a key position, the following information is recorded
about it: whether it is deleted or matched in each substring, and the
number of insertions in each substring between this key position and
the one to either the left or the right. The direction, left or
right, is chosen randomly with 50:50 probability before choosing the
key position. When the direction is ``left'', the leftmost key
position is never chosen, and when the direction is ``right'', the
rightmost key position is never chosen. This procedure requires that
there are at least two key positions.
After removing the key position, GLAM2 considers all ways of adding a
key position with these properties to the alignment. The numbers of
insertions between it and the neighbouring key position are preserved
relatively rather than absolutely: in other words, they may be shifted
by a constant offset. The alignment score for each way of adding such
a key position is calculated, and one of them is chosen randomly with
probabilities proportional to their scores raised to the power of
$1/T$.
\subsection*{Changing the number of key positions}
\subsubsection*{Overview}
None of the moves described so far changes the number of key positions
in the motif, yet we wish to optimise this also. This is accomplished
by two modifications to the column sampling procedure. Firstly, in
the removal step, a key position is chosen and its properties are
recorded as usual, but it is not always removed. This allows the
number of key positions to grow. The decision to remove or not is
made stochastically, with probability $q$ of removal. The correct
value for $q$ is discussed below. Secondly, in the addition step, a
key position is not always added. This allows the number of key
positions to decrease.
\subsubsection*{Modified Addition Step}
Suppose that a key position was removed, and that there are no other
key positions with the same properties. If we do not add a key
position, there will be no way to reverse the move, in violation of
detailed balance. Thus, a key position must always be added, unless
there is another key position in the alignment with the same
properties. One way to think about this is that the key position to
be added can be ``absorbed'' by a key position with the same
properties. Specifically, GLAM2 counts the number, $V$, of key
positions with the same properties as the one being added, and adds
absorption by each one of them to the list of alignments that it
considers. Finally, as usual, one alignment is chosen stochastically
with probability proportional to the alignment score to the power of
$1/T$.
\subsubsection*{Probability of Removal}
The probability $q$ of removal should be chosen carefully to maintain
detailed balance. Consider the probabilities of moving between two
alignments $X$ and $Y$, where $Y$ has one extra key position compared
to $X$. In the following, $V_X$ is the number of key positions in $X$
with the same properties as the extra key position in $Y$.
\begin{equation}
P(X \rightarrow Y) = \frac{1}{2} \cdot \frac{1}{2} \cdot (1-q(X)) \cdot
\frac{V_X}{W_X-1} \cdot
\frac{S'(Y)}{V_X \cdot S'(X) + \sum_Z S'(Z)}
\end{equation}
where $W_X$ is the number of key positions in $X$ (one of which,
either the leftmost or rightmost, is never selected), $S'(X)$ is the
score of $X$ to the power of $1/T$, and $Z$ sums over all the ways of
adding a new key position. The first $1/2$ is the probability of
column sampling rather than site sampling; the second $1/2$ is the
probability of choosing the direction (left or right); $1-q(X)$ is the
probability of not removing the key position; $V_X / (W_X-1)$ is the
probability of choosing a key position with the right properties to
get to $Y$; the final fraction is the probability of adding a key
position in the right location to get to $Y$. Similarly:
\begin{equation}
P(Y \rightarrow X) = \frac{1}{2} \cdot \frac{1}{2} \cdot q(Y) \cdot
\frac{1}{W_Y-1} \cdot
\frac{V_X \cdot S'(X)}{V_X \cdot S'(X) + \sum_Z S'(Z)}
\end{equation}
Thus, to achieve detailed balance, the following must hold:
\begin{equation}
\frac{1-q(X)}{W_X-1} = \frac{q(Y)}{W_Y-1}
\end{equation}
where $W_X = W_Y-1$. This can be satisfied if $q$ is a function of
the number of key positions, $W$. To simplify things a bit, we
express $q$ as a function of the number of \emph{selectable} key
positions, $W' = W-1$, recalling that one of the two endmost key
positions is never selected. Then:
\begin{align}
& \frac{q(W'+1)}{W'+1} = \frac{1}{W'} - \frac{q(W')}{W'} \\
\label{eq:qW}
\Rightarrow & \frac{q(W')}{W'} =
(-1)^{W'-1} \left[ q(1) - \sum_{i=1}^{W'-1} \frac{(-1)^{i-1}}{i} \right]
\end{align}
The summation in Equation (\ref{eq:qW}) generates the Taylor expansion
of $ln(2)$ about 1, so the unique solution that prevents $q(W')$ from
diverging is, rather surprisingly, $q(1) = ln(2)$.
\subsection*{Initialization}
At the start of each annealing run, an initial alignment is
constructed in the following manner. The number of key positions is
set to an initial value chosen by the user (default: 20). Starting
with an empty alignment, the input sequences are taken one-by-one, in
a random order, and added to the alignment using a site sampling move
with $T = 1$.
\section*{Motif Scanning}
\subsection*{Overview}
In motif scanning, a sequence database is searched to find new
instances of a previously determined motif. An instance is a sequence
segment that has a high-scoring alignment to the key positions of the
motif. In this section, we first describe the scoring scheme for such
alignments. Secondly, we describe an algorithm to scan a motif
against one sequence, and find a guaranteed maximal-scoring alignment.
To scan a database, this algorithm can simply be repeated for each
sequence. However, some sequences may contain multiple motif
instances: we do not want to be limited to one hit per sequence.
Thus, we also describe a method for finding suboptimal matches. The
techniques used in this section are all standard \cite{Durbin00}.
\subsection*{Scoring Scheme}
\subsubsection*{Motif Model}
The scoring scheme derives from the statistical motif model described
earlier. To recap, the model has position-specific residue
probabilities: $\theta_i$, deletion probabilities: $\phi$, and
insertion probabilities: $\psi^x(1-\psi)$. For scanning, these
probabilities are estimated from the predetermined motif, and from
prior expectations. Specifically, the posterior mean estimators
\cite{Durbin00} are used.
We are given a motif with $W$ key positions and $s$ aligned sequences.
The $k$\textsuperscript{th} key position has $m_k$ residues and is
deleted $d_k$ times, so that $m_k + d_k = s$. There are $r_k$
insertions between key positions $k$ and $k+1$. $c_{kX}$ is the count
of residue type $X$ in key position $k$, so that $m_k = \sum_{X=1}^A
c_{kX}$. Using the same Beta and Dirichlet mixture priors as above,
the posterior mean estimators are:
\begin{align}
\hat\phi(k) & = \frac{d_k+D}{s+D+E}
\\
\hat\psi(k) & = \frac{r_k+I}{r_k+s+I+J}
\\
\hat\theta(k,X) & = \sum_{j=1}^C v_{kj} \frac{c_{kX}+\alpha_{jX}}{m_k+A_j}
\end{align}
where:
\begin{equation}
v_{kj} = \frac{
\frac{w_j \cdot \Gamma(A_j)}{\Gamma(m_k+A_j)}
\prod_{i=1}^A \frac{\Gamma(c_{ki}+\alpha_{ji})}{\Gamma(\alpha_{ji})}
}{
\sum_{j=1}^C \frac{w_j \cdot \Gamma(A_j)}{\Gamma(m_k+A_j)}
\prod_{i=1}^A \frac{\Gamma(c_{ki}+\alpha_{ji})}{\Gamma(\alpha_{ji})}
}
\end{equation}
\subsubsection*{Background Model}
We evaluate a potential motif instance by comparing the motif model to
a background model. As before, the background model is that the
residues occur randomly and independently, with probabilities $p_i$.
The $p_i$ values can be adjusted by the user. By default, For
proteins, the abundances of Robinson and Robinson are used
\cite{ProcNatlAcadSciUSA.:88:8880}, and for nucleotides, uniform
abundances are used.
\subsubsection*{Score Parameters}
The score of a sequence segment aligned to a motif is a log likelihood
ratio: the likelihood of the aligned segment given the motif model,
versus the likelihood of the segment given the background model.
Thus, the score for deleting key position $k$ is:
\begin{equation}
\delta(k) = \log_2 \left[ \hat\phi(k) \right]
\end{equation}
The score for aligning a residue of type $X$ with key position $k$ is:
\begin{equation}
\mu(k,X) = \log_2 \left[ (1 - \hat\phi(k)) \cdot \hat\theta(k,X) \; / \; p_X \right]
\end{equation}
The score for inserting $x$ residues between $k$ and $k+1$ is
$\iota_1(k) + \iota_2(k) \cdot x$, where:
\begin{align}
\iota_1(k) & = \log_2 \left[ 1 - \hat\psi(k) \right]
\\
\iota_2(k) & = \log_2 \left[ \hat\psi(k) \right]
\end{align}
\subsection*{Search Algorithm}
\subsubsection*{Overview}
We wish to find a maximal-scoring alignment of the motif's key
positions with a segment of the query sequence. This is a standard
alignment problem, which can be solved by dynamic programming followed
by a traceback.
\subsubsection*{Dynamic Programming}
Dynamic programming calculates values $M(i,j)$: the score of the
highest-scoring alignment ending at the $i$\textsuperscript{th} key
position in the motif and the $j$\textsuperscript{th} residue in the
sequence. The calculation is made slightly simpler by constructing
intermediate values $N(i,j)$. In the following, $L$ is the length of
the sequence. Boundary cases:
\begin{align}
N(0,j) & = 0 && (0 \leq j \leq L)
\\
M(i,0) & = \delta(i) + N(i-1,0) && (1 \leq i \leq W)
\\
N(i,0) & = \iota_1(i) + M(i,0) && (1 \leq i < W)
\end{align}
Main cases:
\begin{align}
M(i,j) & = \max \left\{ \begin{array}{l}
\mu(i,X_j) + N(i-1,j-1) \\
\delta(i) + N(i-1,j)
\end{array} \right.
&&
\left( \begin{array}{l}
1 \leq j \leq L \\ 1 \leq i \leq W
\end{array} \right)
\\
N(i,j) & = \max \left\{ \begin{array}{l}
\iota_1(i) + M(i,j) \\
\iota_2(i) + N(i,j-1)
\end{array} \right.
&&
\left( \begin{array}{l}
1 \leq j \leq L \\ 1 \leq i < W
\end{array} \right)
\end{align}
\subsubsection*{Traceback}
In this section, the values $M(i,j)$ and $N(i,j)$ are used to find a
highest-scoring alignment. First, a highest-scoring alignment
endpoint is found:
\begin{equation}
j = \argmax_{j=0}^L M(W,j)
\end{equation}
In case of ties, the lower value of $j$ is arbitrarily chosen.
Having chosen the endpoint $j$, the alignment is determined by
iterating the following steps, with $i$ initially equal to $W$.
Firstly, if $\delta(i) + N(i-1,j)$ is greater than $\mu(i,X_j) +
N(i-1,j-1)$, key position $i$ is deleted, else residue $j$ is aligned
to key position $i$. Then $i$ is decremented, and unless deletion was
chosen, $j$ is decremented. Secondly, if $\iota_2(i) + N(i,j-1)$ is
greater than $\iota_1(i) + M(i,j)$, a residue is inserted between key
positions $i$ and $i+1$, and $j$ is decremented. This is repeated
until $\iota_2(i) + N(i,j-1)$ ceases to be greater than $\iota_1(i) +
M(i,j)$. These steps are repeated until $i$ or $j$ reaches zero.
\subsection*{Suboptimal Matches}
\subsubsection*{Overview}
The previous algorithm finds one highest-scoring alignment of a motif
to a sequence, but we wish to be able to find more than one alignment.
The usual definitional problem arises: there are a vast number of
possible alignments, and the second-highest scoring one is likely to
be a minor variant of the highest-scoring one, but we are not
interested in such variants. Thus, we adopt the criterion of Waterman
and Eggert \cite{Waterman2448477}: subsequent alignments must not pair
residues and key positions that have been paired in any previous
alignment.
\subsubsection*{Method}
To satisfy this criterion, we maintain a matrix $F(i,j)$ that
indicates forbidden pairs. $F(i,j)$ is initialized to all zeros.
During the traceback, whenever residue $j$ is aligned to key position
$i$, $F(i,j)$ is set to 1. After the traceback, the $M$ and $N$
matrices are recalculated, with matches forbidden whenever $F(i,j) =
1$. Only small regions of $M$ and $N$ near the newly forbidden cells
need to be recalculated \cite{Waterman2448477}. Finally, a new
traceback is performed, avoiding old forbidden matches, and marking
new forbidden matches. These recalculation and traceback steps can be
repeated as often as desired.
\subsubsection*{Stopping Criteria}
Retrieval of suboptimal matches from one sequence stops when either of
two conditions are met. Firstly, if an alignment with no matches
(only deletions and insertions) is found, it is discarded and the
procedure is terminated. Such an alignment does not add any forbidden
pairings, so it necessarily terminates the Waterman-Eggert procedure.
Secondly, the total number of database hits is limited to some value
$n$ chosen by the user. If $n$ hits have already been found, and the
new alignment has a score no greater than any of these, it is
discarded and the procedure is terminated. The database hits are
stored in a heap, which is a kind of partially-sorted array that
allows efficient removal of the lowest-scoring hit and insertion of
new hits.
\bibliographystyle{plain}
\bibliography{GLAM2_method}
\end{document}
|