File: SeedExtension.rst

package info (click to toggle)
seqan2 2.5.2-1
  • links: PTS, VCS
  • area: main
  • in suites: forky, sid
  • size: 228,748 kB
  • sloc: cpp: 257,602; ansic: 91,967; python: 8,326; sh: 1,056; xml: 570; makefile: 229; awk: 51; javascript: 21
file content (273 lines) | stat: -rw-r--r-- 10,593 bytes parent folder | download | duplicates (9)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
.. sidebar:: ToC

    .. contents::

.. _tutorial-algorithms-seed-extension:

Seed Extension
==============

Learning Objective
  You will learn how to do seed-and-extend with SeqAn, how to do local and global chaining of seeds.
  Finally, you will learn how to create a banded alignment around a seed chain.

Difficulty
  Average

Duration
  2 h

Prerequisites
  :ref:`tutorial-datastructures-sequences`, :ref:`tutorial-datastructures-seeds`

Overview
--------

Many efficient heuristics to find high scoring, but inexact, local alignments between two sequences start with small exact (or at least highly similar) segments, so called **seeds**, and extend or combine them to get larger highly similar regions.
Probably the most prominent tool of this kind is BLAST :cite:`Altschul1990`, but there are many other examples like FASTA :cite:`Pearson1990` or LAGAN :cite:`Brudno2003`.

SeqAn's header file for all data structures and functions related to two-dimensional seeds is ``<seqan/seeds.h>``.

Seed Extension
--------------

Seeds are often created quickly using a *k*-mer index: When a *k*-mer of a given length is found in both sequences, we can use it as a seed.
However, the match can be longer than just *k* characters. To get longer matches, we use **seed extension**.

In the most simple case we simply look for matching characters in both sequences to the left and right end of the seed.
This is called **match extension** and available through the :dox:`Seed#extendSeed` function using the ``MatchExtend`` tag. Below example shows how to extend seeds to the right end.

.. includefrags:: demos/tutorial/seed_and_extend/example1.cpp
   :fragment: example

.. includefrags:: demos/tutorial/seed_and_extend/example1.cpp.stdout

Assignment 1
""""""""""""

.. container:: assignment

   Type
     Review

   Objective
     Change the example from above to extend the seed to both sides.

   Solution
     .. container:: foldable

        .. includefrags:: demos/tutorial/seed_and_extend/solution1.cpp

A more complex case is the standard bioinformatics approach of **x-drop extension**.

In the ungapped case, we extend the seed by comparing the *i*-th character to the left/right of the seed of the horizontal sequence (subject sequence) with the *j*-th character to the left/right of the seed in the vertical sequence (query sequence).
Matches and mismatches are assigned with scores (usually matches are assigned with positive scores and mismatches are assigned with negative scores).
The scores are summed up.
When one or more mismatches occur, the running total will drop.
When the sum drops more than a value *x*, the extension is stopped.

This approach is also available in the gapped case in the SeqAn library.
Here, creating gaps is also possible but also assigned negative scores.

.. includefrags:: demos/tutorial/seed_and_extend/example2.cpp
   :fragment: example

.. includefrags:: demos/tutorial/seed_and_extend/example2.cpp.stdout

Assignment 2
""""""""""""

.. container:: assignment

   Type
     Review

   Objective
     Change the example from above to use gapped instead of ungapped x-drop extension.

   Solution
     .. container:: foldable

        .. includefrags:: demos/tutorial/seed_and_extend/solution2.cpp

After extending a seed, we might wish to actually get the resulting alignment.
When using gapped x-drop extension, we need to perform a banded global alignment of the two sequence infixes that correspond to the seed.
This is shown in the following example:

.. includefrags:: demos/tutorial/seed_and_extend/example3.cpp
   :fragment: example

.. includefrags:: demos/tutorial/seed_and_extend/example3.cpp.stdout

Assignment 3
""""""""""""

.. container:: assignment

   Type
     Review

   Objective
     Change the example from above to a gap open score of ``-2`` and a gap extension score of ``-2``.

   Solution
     .. container:: foldable

	Note that we do not have to explicitely call Gotoh's algorithm in ``globalAlignment()``.
	The fact that the gap extension score is different from the gap opening score is enough.

        .. includefrags:: demos/tutorial/seed_and_extend/solution3.cpp

Local Chaining using Seed Sets
------------------------------

Usually, we quickly determine a large number of seeds.
When a seed is found, we want to find a "close" seed that we found previously and combine it to form a longer seed.
This combination is called **local chaining**. This approach has been pioneered in the CHAOS and BLAT programs.

SeqAn provides the :dox:`SeedSet` class as a data structure to efficiently store seeds and combine new seeds with existing ones.
The following example creates a :dox:`SeedSet` object ``seeds``, adds four seeds to it and then prints its contents.

.. includefrags:: demos/tutorial/seed_and_extend/example4.cpp
   :fragment: example

The output of the program above can be seen below.

.. includefrags:: demos/tutorial/seed_and_extend/example4.cpp.stdout

Note that we have used the ``Single()`` tag for adding the seeds.
This forces the seeds to be added independent of the current seed set contents.

By using different overloads of the :dox:`SeedSet#addSeed`, we can use various local chaining strategies when adding seed ``A``.

``Merge``
  If there is a seed ``B`` that overlaps with ``A`` and the difference in diagonals is smaller than a given threshold then ``A`` can be merged with ``B``.

``SimpleChain``
  If there is a seed ``B`` whose distance in both sequences is smaller than a given threshold then ``A`` can be chained to ``B``.

``Chaos``
  Following the strategy of CHAOS :cite:`Brudno2003b`, if ``A`` is within a certain distance to ``B`` in both sequences and the distance in diagonals is smaller than a given threshold then ``A`` can be chained to ``B``.

The :dox:`SeedSet#addSeed` function returns a boolean value indicating success in finding a suitable partner for chaining.
A call using the ``Single`` strategy always yields ``true``.

The following example shows how to use the ``SimpleChain`` strategy.

.. includefrags:: demos/tutorial/seed_and_extend/example5.cpp
   :fragment: example

As we can see, the seed ``TSeed(4, 2, 3)`` has been chained to ``TSeed(0, 0, 2)``.

.. includefrags:: demos/tutorial/seed_and_extend/example5.cpp.stdout

Assignment 4
""""""""""""

.. container:: assignment

   Type
     Review

   Objective
     Change the example above to use the ``Chaos`` strategy instead of the ``SimpleChain``.

   Solution
     .. container:: foldable

        .. includefrags:: demos/tutorial/seed_and_extend/solution4.cpp

Global Chaining
---------------

.. image:: GlobalChaining.png
   :align: right
   :width: 250px

After one has determined a set of candidate seeds, a lot of these seeds will conflict.
The image to the right shows an example.
Some conflicting seeds might be spurious matches or come from duplication events.

Often, we need to find a linear ordering of the seeds such that each seed starts after all of its predecessor end in both sequences.
This can be done efficiently using dynamic sparse programming (in time :math:`\mathcal{O}(n log n)` where :math:`n` is the number of seeds) as described in :cite:`Gusfield1997`.
The red seeds in the image to the right show such a valid chain.

This functionality is available in SeqAn using the :dox:`chainSeedsGlobally` function.
The function gets a sequence container of :dox:`Seed` objects for the result as its first parameter and a :dox:`SeedSet` as its second parameter.
A subset of the seeds from the :dox:`SeedSet` are then selected and stored in the result sequence.

The following shows a simple example.

.. includefrags:: demos/tutorial/seed_and_extend/example6.cpp
   :fragment: example

Assignment 5
""""""""""""

.. container:: assignment

   Type
     Review

   Objective
      Change the example from above to use a different chain of seeds.
      The seeds should be ``TSeed(1, 1, 3)``, ``TSeed(6, 9, 2)``, ``TSeed(10, 13, 3)``, and ``TSeed(20, 22, 5)``.

   Solution
     .. container:: foldable

        .. includefrags:: demos/tutorial/seed_and_extend/solution5.cpp

Banded Chain Alignment
----------------------

After obtaining such a valid seed chain, we would like to obtain an alignment along the chain.
For this, we can use the so-called banded chain alignment algorithm :cite:`Brudno2003`.
Around seeds, we can use banded DP alignment and the spaces between seeds can be aligned using standard DP programming alignment.

In SeqAn you can compute the banded chain alignment by calling the function :dox:`bandedChainAlignment`.
This function gets the structure in which the alignment should be stored as the first parameter.
This corresponds to the interface of the :dox:`globalAlignment` and allows the same input types.
Additionally, this function requires a non-empty, non-decreasing monotonic chain of seeds which is used as the rough global map for computing the global alignment.
The third required parameter is the :dox:`Score`.

Note, that there are a number of optional parameters that can be specified.
These include a second :dox:`Score` which, if specified, is used to evaluate the regions between two consecutive seeds differently than the regions around the seeds itself (for which then the first specified score is taken.).
As for the global alignment you can use the :dox:`AlignConfig` to specify the behavior for initial and end gaps.
The last optional parameter is the band extension.
This parameter specifies to which size the bands around the anchors should be extended.
The default value is 15 and conforms the default value in the LAGAN-algorithm :cite:`Brudno2003`.

.. important::

    At the moment the specified value for the band extension must be at least one.

.. includefrags:: demos/tutorial/seed_and_extend/example7.cpp
   :fragment: example

The output of the example above.

.. includefrags:: demos/tutorial/seed_and_extend/example7.cpp.stdout


Assignment 6
""""""""""""

.. container:: assignment

   Type
     Review

   Objective
     Change the example form above to use two different scoring schemes.
     The scoring scheme for the seeds should use the Levenshtein distance and the score for the gap regions should be an affine score with the following values: match = 2, mismatch = -1, gap open = -2, gap extend = -1.

     Furthermore, we are looking for a semi-global alignment here the initial and end gaps in the query sequence are free.

   Solution
     .. container:: foldable

        .. includefrags:: demos/tutorial/seed_and_extend/solution6.cpp


.. TODO: LAGAN demo should be refered to from here when it's done.