File: IndexedPatternMatching.rst

package info (click to toggle)
seqan2 2.5.2-1
  • links: PTS, VCS
  • area: main
  • in suites: forky, sid
  • size: 228,748 kB
  • sloc: cpp: 257,602; ansic: 91,967; python: 8,326; sh: 1,056; xml: 570; makefile: 229; awk: 51; javascript: 21
file content (143 lines) | stat: -rw-r--r-- 6,479 bytes parent folder | download | duplicates (9)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
.. sidebar:: ToC

    .. contents::

.. _tutorial-algorithms-pattern-matching-indexed:

Indexed Pattern Matching
========================

Learning Objective
  In this tutorial you will learn how to use the SeqAn classes finder and pattern to search a known pattern in a string or :dox:`StringSet`.

Difficulty
  Average

Duration
  30 min

Prerequisites
  :ref:`tutorial-datastructures-sequences`, :ref:`tutorial-datastructures-indices`, :ref:`tutorial-algorithms-pattern-matching-online`

Overview
--------

The :dox:`Finder` is an object that stores all necessary information for searching for a pattern. We have learned how to use it in :ref:`tutorial-algorithms-pattern-matching-online` tutorial.

The following line of code shows how the :dox:`Finder` is initialized with an index. In this example, we search for the pattern ``ACGT``.

.. includefrags:: demos/tutorial/indices/base.cpp
      :fragment: finder

Calling the function :dox:`Finder#find` invokes the localization of all occurrences of a given pattern.
It works by modifying pointers of the ``Finder`` to tables of the index.
For example, the :dox:`Finder` of ``esaIndex`` stores two pointers, pointing to the first and last suffix array entry that stores an occurrence of the pattern. The return value of the :dox:`Finder#find` function tells us whether or not a given pattern occurs in the text. Furthermore, if there are several instances of a pattern, consecutive calls of :dox:`Finder#find` will modify the :dox:`Finder` such that it points to the next occurrence after each call:

.. includefrags:: demos/tutorial/indices/base.cpp
      :fragment: finder_multiple

The above code is not very useful, since we do not know the locations of the first, second or third pattern occurrence.
The function :dox:`Finder#position` will help here.
:dox:`Finder#position` called on a finder returns the location of the ``x``\ th pattern, where ``x`` can be the first, second, or any other occurrence of the pattern.

.. includefrags:: demos/tutorial/indices/base.cpp
      :fragment: finder_position

.. tip::

   Indices in SeqAn are built on demand.
   That means that the index tables are not build when the constructor is called, but when we search for a pattern for the first time.

Exact Search
------------

For the index based search the :dox:`Finder` needs to be specialized with an :dox:`Index` of the ``haystack`` in the first template argument.
The index itself requires two template arguments, the ``haystack`` type and a index specialization.
In contrast, since the ``needle`` is not preprocessed the second template argument of the :dox:`Pattern` has to be omitted.
The following source illustrates the usage of an index based search in SeqAn using the example of the :dox:`IndexEsa` index (an enhanced suffix array index).
This is the default index specialization if no second template argument for the index is given.
We begin to create an index object of our ``haystack`` ``"tobeornottobe"`` and a ``needle`` ``"be"``.

.. includefrags:: demos/tutorial/pattern_matching/find_index.cpp
   :fragment: initialization

We proceed to create a :dox:`Pattern` of the needle and conduct the search in the usual way.

.. includefrags:: demos/tutorial/pattern_matching/find_index.cpp
   :fragment: output

Instead of creating and using a pattern solely storing the ``needle`` we can pass the needle directly to :dox:`Finder#find`.
Please note that an :dox:`Index` based :dox:`Finder` has to be reset with :dox:`Finder#clear` before conducting another search.

.. includefrags:: demos/tutorial/pattern_matching/find_index.cpp
   :fragment: output_short

Program output:

.. includefrags:: demos/tutorial/pattern_matching/find_index.cpp.stdout


All indices also support :dox:`StringSet` texts and can therefore be used to search multiple ``haystacks`` as the following example shows.
We simply exchange the :dox:`CharString` of the haystack with a :dox:`StringSet` of :dox:`CharString` and append some strings to it.

.. includefrags:: demos/tutorial/pattern_matching/find_index_multiple.cpp
   :fragment: initialization

The rest of the program remains unchanged.

.. includefrags:: demos/tutorial/pattern_matching/find_index_multiple.cpp
   :fragment: output

.. includefrags:: demos/tutorial/pattern_matching/find_index_multiple.cpp.stdout


The following index specializations support the :dox:`Finder` interface as described above.

Specialization :dox:`IndexEsa`
  Enhanced suffix array based index.
  Supports arbitrary needles.

Specialization :dox:`IndexQGram`
  Q-gram index.
  Needle mustn't exceed the size of the q-gram.

Specialization :dox:`OpenAddressingQGramIndex Open Adressing QGram Index`
  Q-gram index with open addressing.
  Supports larger q-grams.
  Needle and q-gram must have the same size.

Besides the :dox:`Finder#find` interface there is another interface for indices using suffix tree iterators to search exact ``needle`` occurrences described in the tutorial :ref:`tutorial-datastructures-indices`.

Assignment 1
^^^^^^^^^^^^

.. container:: assignment

     Type
       Application

     Objective
       Modify the example above to search with a :dox:`OpenAddressingQGramIndex Open Adressing QGram Index` q-gram index for matches of "tobe" in "tobeornottobe".

     Solution
      Click **more...** to see the solution.

      .. container:: foldable

         .. includefrags:: demos/tutorial/pattern_matching/assignment3_solution.cpp

	 We simply add a second template argument to the definition of the :dox:`Index` as described in the documentation of the :dox:`OpenAddressingQGramIndex Open Adressing QGram Index`.
	 As shape we can use an :dox:`UngappedShape` of length 4.

	 Program output:

         .. includefrags:: demos/tutorial/pattern_matching/assignment3_solution.cpp.stdout

Approximate Filtration
----------------------

Currently there are no indices directly supporting an approximate search.
But nevertheless, there are approximate search filters available that can be used to filter out regions of the ``haystack`` that do not contain an approximate match, see :dox:`SwiftFinder` and :dox:`SwiftPattern`.
The regions found by these filters potentially contain a match and must be verified afterwards.
:dox:`Finder#beginPosition`, :dox:`Finder#endPosition` and :dox:`Finder#infix` can be used to return the boundaries or sequence of such a potential match.
For more details on using filters, see the article :ref:`how-to-recipes-filter-similar-sequences`.