File: OnlinePatternMatching.rst

package info (click to toggle)
seqan2 2.5.2-1
  • links: PTS, VCS
  • area: main
  • in suites: forky, sid
  • size: 228,748 kB
  • sloc: cpp: 257,602; ansic: 91,967; python: 8,326; sh: 1,056; xml: 570; makefile: 229; awk: 51; javascript: 21
file content (184 lines) | stat: -rw-r--r-- 7,472 bytes parent folder | download | duplicates (9)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
.. sidebar:: ToC

   .. contents::

.. _tutorial-algorithms-pattern-matching-online:

Online Pattern Matching
=======================

Learning Objective
  In this tutorial you will learn how to use the SeqAn classes finder and pattern to search a known pattern in a string or :dox:`StringSet`.

Difficulty
  Average

Duration
  40 min

Prerequisites
  :ref:`tutorial-datastructures-sequences`, :ref:`tutorial-datastructures-indices`

For all online search algorithms, the :dox:`Finder` is an iterator that scans over the ``haystack``.
The :dox:`Pattern` is a search algorithm dependent data structure preprocessed from the ``needle``.
The second template argument of the :dox:`Pattern` selects the search algorithm.

Exact Search
------------

The following code snippet illustrates the usage of online search algorithms in SeqAn using the example of the Hoorspool algorithm :cite:`Horspool1980`.
We begin by creating two strings of type ``char`` containing the ``haystack`` and the ``needle``.

.. includefrags:: demos/tutorial/pattern_matching/find_exact.cpp
   :fragment: initialization

We then create :dox:`Finder` and :dox:`Pattern` objects of these strings and choose :dox:`HorspoolPattern Horspool` as the specialization in the second template argument of :dox:`Pattern`.

.. includefrags:: demos/tutorial/pattern_matching/find_exact.cpp
   :fragment: output

Program output:

.. includefrags:: demos/tutorial/pattern_matching/find_exact.cpp.stdout

Currently the following exact online algorithms for searching a single sequence are implemented in SeqAn:

:dox:`SimplePattern Simple`
  Brute force algorithm

:dox:`HorspoolPattern Horspool`
  :cite:`Horspool1980`

:dox:`BfamPattern Bfam`
  Backward Factor Automaton Matching

:dox:`BndmAlgoPattern BndmAlgo`
  Backward Nondeterministic DAWG Matching

:dox:`ShiftAndPattern ShiftAnd`
  Exact string matching using bit parallelism

:dox:`ShiftOrPattern ShiftOr`
  Exact string matching using bit parallelism

... and for multiple sequences:

:dox:`WuManberPattern WuManber`
  Extension of :dox:`HorspoolPattern Horspool`.

:dox:`MultiBfamPattern MultiBfam`
  Multiple version of :dox:`BfamPattern Bfam`, uses an automaton of reversed needles.

:dox:`SetHorspoolPattern SetHorspool`
  Another extension of :dox:`HorspoolPattern Horspool` using a trie of reversed needles.

:dox:`AhoCorasickPattern AhoCorasick`
  :cite:`Aho1975`

:dox:`MultipleShiftAndPattern MultipleShiftAnd`
  Extension of :dox:`ShiftAndPattern ShiftAnd`, should only be used if the sum of needle lengths doesn't exceed the machine word size.

Assignment 1
^^^^^^^^^^^^

.. container:: assignment

   Type
    Review

   Objective
    Use the given code example from below.
    Extend the code to search the given ``haystack`` simultaneously for "mo", "send" and "more".
    For every match output the begin and end position in the ``haystack`` and which ``needle`` has been found.

   Hint
     Online search algorithms for multiple sequences simply expect needles of type ``String<String<...> >``.

     .. includefrags:: demos/tutorial/pattern_matching/assignment1.cpp

     You can use the specialization :dox:`WuManberPattern WuManber`.

   Solution
      Click **more...** to see the solution.

      .. container:: foldable

         .. includefrags:: demos/tutorial/pattern_matching/assignment1_solution.cpp

	 We use a :dox:`Pattern` specialized with the :dox:`WuManberPattern WuManber` algorithm for the search and initialize it with our ``needles`` string.
	 For every match found by :dox:`Finder#find` we output the begin and end position and the match region in the ``haystack`` as well as the index of the found ``needle`` which is returned by ``position(pattern)``.

         .. includefrags:: demos/tutorial/pattern_matching/assignment1_solution.cpp.stdout

Approximate Search
------------------

The approximate search can be used to find segments in the ``haystack`` that are similar to a ``needle`` allowing errors, such as mismatches or indels.
Note that if only mismatches are allowed, the difference of the end and begin position of a match is the length of the found ``needle``.
However, in the case of indels this difference may vary and is only a rough estimate for the length.
Therefore, to find a begin position for a certain end position the :dox:`Finder#findBegin` interface should be used.
The usage is similar to :dox:`Finder#find` and is shown in the next example.
We want to find all semi-global alignments of a ``needle`` "more" with a :dox:`SimpleScore` of at least -2 using the scoring scheme (0,-2,-1) (match,mismatch,gap).

Again, we create ``haystack`` and ``needle`` strings first:

.. includefrags:: demos/tutorial/pattern_matching/find_approx.cpp
   :fragment: initialization

We then create :dox:`Finder` and :dox:`Pattern` objects of these strings and choose :dox:`DPSearchPattern DPSearch` as the specialization in the second template argument of :dox:`Pattern`.
:dox:`DPSearchPattern DPSearch` expects the scoring function as the first template argument which is :dox:`SimpleScore` in our example.
The pattern is constructed using the ``needle`` as a template and our scoring object is initialized with the appropriate scores for match, mismatch and gap.
As in the previous example, the main iteration uses :dox:`Finder#find` to iterate over all end positions with a minimum best score of -2.
If such a semi-global alignment end position is found the begin position is searched via :dox:`Finder#findBegin`.
Please note that we have to set the minimum score to the score of the match found (:dox:`LocalAlignmentEnumerator#getScore`) in order to find the begin of a best match.
We then output all begin and end positions and the corresponding ``haystack`` segment for each match found.

.. includefrags:: demos/tutorial/pattern_matching/find_approx.cpp
   :fragment: output

Program output:

.. includefrags:: demos/tutorial/pattern_matching/find_approx.cpp.stdout


The following specializations are available:

Specialization :dox:`DPSearchPattern DPSearch`
  Dynamic programming algorithm for many kinds of scoring scheme

Specialization :dox:`MyersPattern Myers`
  :cite:`Myers1999`, :cite:`Ukkonen1985`

Specialization :dox:`PexPattern Pex`
  :cite:`BaezaYates1999`

Specialization :dox:`AbndmAlgoPattern AbndmAlgo`
  Approximate Backward Nondeterministic DAWG Matching, adaption of :dox:`AbndmAlgoPattern AbndmAlgo`

Assignment 2
^^^^^^^^^^^^

.. container:: assignment

   Type
     Application

   Objective
     Use the example from above.
     Modify the code to search with the :dox:`MyersPattern Myers` algorithm for matches of ``"more"`` with an edit distance of at most 2.

   Solution
      Click **more...** to see the solution.

      .. container:: foldable

         .. includefrags:: demos/tutorial/pattern_matching/assignment2_solution.cpp

	 We again set the ``needle`` to ``"more"``.
	 We then change the specialization tag of the :dox:`Pattern` to :dox:`MyersPattern Myers` with default arguments.
	 As :dox:`MyersPattern Myers` algorithm is only applicable to edit distance searches it cannot be specialized or initialized with a scoring scheme.
	 In SeqAn, edit distance corresponds to the scoring scheme (0,-1,-1) (match, mismatch, gap) and an edit distance of 2 corresponds to a minimum score of -2 given to the :dox:`Finder#find` function.

	 The program's output is as follows.

         .. includefrags:: demos/tutorial/pattern_matching/assignment2_solution.cpp.stdout