1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184
|
.. sidebar:: ToC
.. contents::
.. _tutorial-algorithms-pattern-matching-online:
Online Pattern Matching
=======================
Learning Objective
In this tutorial you will learn how to use the SeqAn classes finder and pattern to search a known pattern in a string or :dox:`StringSet`.
Difficulty
Average
Duration
40 min
Prerequisites
:ref:`tutorial-datastructures-sequences`, :ref:`tutorial-datastructures-indices`
For all online search algorithms, the :dox:`Finder` is an iterator that scans over the ``haystack``.
The :dox:`Pattern` is a search algorithm dependent data structure preprocessed from the ``needle``.
The second template argument of the :dox:`Pattern` selects the search algorithm.
Exact Search
------------
The following code snippet illustrates the usage of online search algorithms in SeqAn using the example of the Hoorspool algorithm :cite:`Horspool1980`.
We begin by creating two strings of type ``char`` containing the ``haystack`` and the ``needle``.
.. includefrags:: demos/tutorial/pattern_matching/find_exact.cpp
:fragment: initialization
We then create :dox:`Finder` and :dox:`Pattern` objects of these strings and choose :dox:`HorspoolPattern Horspool` as the specialization in the second template argument of :dox:`Pattern`.
.. includefrags:: demos/tutorial/pattern_matching/find_exact.cpp
:fragment: output
Program output:
.. includefrags:: demos/tutorial/pattern_matching/find_exact.cpp.stdout
Currently the following exact online algorithms for searching a single sequence are implemented in SeqAn:
:dox:`SimplePattern Simple`
Brute force algorithm
:dox:`HorspoolPattern Horspool`
:cite:`Horspool1980`
:dox:`BfamPattern Bfam`
Backward Factor Automaton Matching
:dox:`BndmAlgoPattern BndmAlgo`
Backward Nondeterministic DAWG Matching
:dox:`ShiftAndPattern ShiftAnd`
Exact string matching using bit parallelism
:dox:`ShiftOrPattern ShiftOr`
Exact string matching using bit parallelism
... and for multiple sequences:
:dox:`WuManberPattern WuManber`
Extension of :dox:`HorspoolPattern Horspool`.
:dox:`MultiBfamPattern MultiBfam`
Multiple version of :dox:`BfamPattern Bfam`, uses an automaton of reversed needles.
:dox:`SetHorspoolPattern SetHorspool`
Another extension of :dox:`HorspoolPattern Horspool` using a trie of reversed needles.
:dox:`AhoCorasickPattern AhoCorasick`
:cite:`Aho1975`
:dox:`MultipleShiftAndPattern MultipleShiftAnd`
Extension of :dox:`ShiftAndPattern ShiftAnd`, should only be used if the sum of needle lengths doesn't exceed the machine word size.
Assignment 1
^^^^^^^^^^^^
.. container:: assignment
Type
Review
Objective
Use the given code example from below.
Extend the code to search the given ``haystack`` simultaneously for "mo", "send" and "more".
For every match output the begin and end position in the ``haystack`` and which ``needle`` has been found.
Hint
Online search algorithms for multiple sequences simply expect needles of type ``String<String<...> >``.
.. includefrags:: demos/tutorial/pattern_matching/assignment1.cpp
You can use the specialization :dox:`WuManberPattern WuManber`.
Solution
Click **more...** to see the solution.
.. container:: foldable
.. includefrags:: demos/tutorial/pattern_matching/assignment1_solution.cpp
We use a :dox:`Pattern` specialized with the :dox:`WuManberPattern WuManber` algorithm for the search and initialize it with our ``needles`` string.
For every match found by :dox:`Finder#find` we output the begin and end position and the match region in the ``haystack`` as well as the index of the found ``needle`` which is returned by ``position(pattern)``.
.. includefrags:: demos/tutorial/pattern_matching/assignment1_solution.cpp.stdout
Approximate Search
------------------
The approximate search can be used to find segments in the ``haystack`` that are similar to a ``needle`` allowing errors, such as mismatches or indels.
Note that if only mismatches are allowed, the difference of the end and begin position of a match is the length of the found ``needle``.
However, in the case of indels this difference may vary and is only a rough estimate for the length.
Therefore, to find a begin position for a certain end position the :dox:`Finder#findBegin` interface should be used.
The usage is similar to :dox:`Finder#find` and is shown in the next example.
We want to find all semi-global alignments of a ``needle`` "more" with a :dox:`SimpleScore` of at least -2 using the scoring scheme (0,-2,-1) (match,mismatch,gap).
Again, we create ``haystack`` and ``needle`` strings first:
.. includefrags:: demos/tutorial/pattern_matching/find_approx.cpp
:fragment: initialization
We then create :dox:`Finder` and :dox:`Pattern` objects of these strings and choose :dox:`DPSearchPattern DPSearch` as the specialization in the second template argument of :dox:`Pattern`.
:dox:`DPSearchPattern DPSearch` expects the scoring function as the first template argument which is :dox:`SimpleScore` in our example.
The pattern is constructed using the ``needle`` as a template and our scoring object is initialized with the appropriate scores for match, mismatch and gap.
As in the previous example, the main iteration uses :dox:`Finder#find` to iterate over all end positions with a minimum best score of -2.
If such a semi-global alignment end position is found the begin position is searched via :dox:`Finder#findBegin`.
Please note that we have to set the minimum score to the score of the match found (:dox:`LocalAlignmentEnumerator#getScore`) in order to find the begin of a best match.
We then output all begin and end positions and the corresponding ``haystack`` segment for each match found.
.. includefrags:: demos/tutorial/pattern_matching/find_approx.cpp
:fragment: output
Program output:
.. includefrags:: demos/tutorial/pattern_matching/find_approx.cpp.stdout
The following specializations are available:
Specialization :dox:`DPSearchPattern DPSearch`
Dynamic programming algorithm for many kinds of scoring scheme
Specialization :dox:`MyersPattern Myers`
:cite:`Myers1999`, :cite:`Ukkonen1985`
Specialization :dox:`PexPattern Pex`
:cite:`BaezaYates1999`
Specialization :dox:`AbndmAlgoPattern AbndmAlgo`
Approximate Backward Nondeterministic DAWG Matching, adaption of :dox:`AbndmAlgoPattern AbndmAlgo`
Assignment 2
^^^^^^^^^^^^
.. container:: assignment
Type
Application
Objective
Use the example from above.
Modify the code to search with the :dox:`MyersPattern Myers` algorithm for matches of ``"more"`` with an edit distance of at most 2.
Solution
Click **more...** to see the solution.
.. container:: foldable
.. includefrags:: demos/tutorial/pattern_matching/assignment2_solution.cpp
We again set the ``needle`` to ``"more"``.
We then change the specialization tag of the :dox:`Pattern` to :dox:`MyersPattern Myers` with default arguments.
As :dox:`MyersPattern Myers` algorithm is only applicable to edit distance searches it cannot be specialized or initialized with a scoring scheme.
In SeqAn, edit distance corresponds to the scoring scheme (0,-1,-1) (match, mismatch, gap) and an edit distance of 2 corresponds to a minimum score of -2 given to the :dox:`Finder#find` function.
The program's output is as follows.
.. includefrags:: demos/tutorial/pattern_matching/assignment2_solution.cpp.stdout
|