File: OptimalSearchSchemes.rst

package info (click to toggle)
seqan2 2.5.2-1
  • links: PTS, VCS
  • area: main
  • in suites: forky, sid
  • size: 228,748 kB
  • sloc: cpp: 257,602; ansic: 91,967; python: 8,326; sh: 1,056; xml: 570; makefile: 229; awk: 51; javascript: 21
file content (72 lines) | stat: -rw-r--r-- 3,431 bytes parent folder | download | duplicates (7)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
.. sidebar:: ToC

    .. contents::

.. _tutorial-algorithms-optimal-search-schemes:

Optimal Search Schemes
======================

Learning Objective
  In this tutorial you will learn how to search a known pattern in a string or :dox:`StringSet` using a bidirectional index.
  The implemented algorithm in SeqAn is based on Optimal Search Schemes from the `FAMOUS paper <https://arxiv.org/abs/1711.02035>`_ which allows for searching for up to 4 errors based on Hamming distance or Edit distance.

Difficulty
  Average

Duration
  15 min

Prerequisites
  :ref:`tutorial-datastructures-sequences`, :ref:`tutorial-datastructures-indices`

Overview
--------

Searching in an index with more than two errors is computationally very expensive and thus trivial approaches such as simple backtracking are not recommended.
Optimal Search Schemes make use of a bidirectional index, split the pattern to be searched into multiple pieces and enumerate it in a more efficient fashion.
To use this algorithm, you can simply call the ``find()`` method.

.. includefrags:: demos/tutorial/indices/find2_index_approx.cpp
      :fragment: SinglePattern

The first argument is the delegate function that will be called if a match of the pattern is found.
It gets an iterator of the bidirectional index and you can choose how to proceed with the hits.
Additionally it passes a reference to the original pattern searched as well as the number of errors for the current hit in the index.

.. includefrags:: demos/tutorial/indices/find2_index_approx.cpp
      :fragment: Delegate

The second argument has to be a bidirectional index, currently SeqAn offers only bidirectional FM indices.
(Please check the corresponding tutorial on :ref:`tutorial-datastructures-indices-fm-index` for more information how FM indices can be configured and constructed.)

The third argument is a :dox:`String` that you want to search.

The number of allowed errors (lower and upper bounds) are passed as template arguments.
Please note that these parameters have to be ``constexpr``.
The distance metric is passed as fourth argument. You can choose between ``HammingDistance()`` and ```EditDistance()`` (also known as Levenshtein distance).

You also have to possibility to pass a :dox:`StringSet` of patterns to search instead of a single :dox:`String`.
The search can then be parallized by passing a fifth parameter tag ``Parallel()`` instead of ``Serial()``.

.. includefrags:: demos/tutorial/indices/find2_index_approx.cpp
      :fragment: MultiplePatterns

.. warning::

      Please be aware that the same hits might be reported multiple times and you might have to filter these duplicated depending on your application, especially with larger errors numbers.

.. note::

      When searching a StringSet in parallel mode the delegate is likely being executed by different threads at the same time.
      You have to take care by yourself that the threads do not interfere, e.g. write not into a variable simultaneously.
      One way of achieving this is by using lock guards.
      The following example buffers all hits in a set (to filter duplicates) and prints them afterwards.

      .. includefrags:: demos/tutorial/indices/find2_index_approx.cpp
            :fragment: ParallelMode

Here is a complete example for searching a string or a stringset (in serial mode):

.. includefrags:: demos/tutorial/indices/find2_index_approx.cpp
      :fragment: Complete