File: Finder.dddoc

package info (click to toggle)
seqan 1.4.1%2Bdfsg-2
  • links: PTS, VCS
  • area: main
  • in suites: jessie, jessie-kfreebsd
  • size: 257,080 kB
  • ctags: 38,576
  • sloc: cpp: 301,711; python: 26,086; sh: 659; xml: 188; awk: 129; makefile: 53
file content (109 lines) | stat: -rw-r--r-- 6,007 bytes parent folder | download
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
.Page.Searching:
..XXXcat:Tutorials
..summary:Searching in SeqAn.

.Page.Searching.description:
..contents
..image:searching

..section:# Overview
..text:
This tutorial is about searching a known string (the @glos:needle@) in another string (the @glos:haystack@). 
The searching can be exact or inexact. 
SeqAn also offers methods for searching multiple @glos:needles@ at once.

..text:
In SeqAn, all searching is done by calling the function @Function.find@, which takes at least two arguments:
\br 1: A @Concept.Finder|finder@ that stores all necessary information about the @glos:haystack@ 
and the last found position of the @glos:needle@ within the @glos:haystack@.
\br 2: A @Concept.Pattern|pattern@ that stores all information about the @glos:needle@.
\br Some variants of @Function.find@ support further arguments.

..text:

Each call of @Function.find@ finds only one occurrence of the @glos:needle@ within the @glos:haystack@.
The @Concept.Finder@ can be asked for the @Function.position@ of the last found occurrence.
Subsequent calls of @Function.find@ can be used to find more occurrences of the @glos:needle@, 
until no more occurrences can be found and @Function.find@ returns $false$.

..text:
Search algorithms like @Spec.Horspool|horspool@ or @Spec.ShiftAnd|shift-And@ scan the @glos:haystack@ itself 
in order to find instances of the @glos:needle@.
Many of these algorithms preprocess the @glos:needle@ in some way. 
In this case, the preprocessing data is stored in the @Concept.Pattern|pattern object@.
..text:Example:
..code:
String<char> my_haystack = "Mississippi";
String<char> my_needle = "is";
Finder<String<char> > finder(my_haystack);
Pattern<String<char>, Horspool> pattern(my_needle);
while (find(finder, pattern)) 
{
	::std::cout << position(finder) << ",";
}
..output:1,4

..text:
Other searching methods preprocess the @glos:haystack@ instead of the @glos:needle@.
In this case, the @glos:haystack@ is a @glos:index@.
\br Example: See @Demo.Index Finder@.


..section:# Searching Algorithms
..subsection:#.# Exact String Matching
..text:Exact matching means to find all positions at which the @glos:haystack@ contains the complete @glos:needle@ without errors.
SeqAn offers several algorithms for exact searching for different purposes. 
The following table lists the @glos:specializations@ of @Class.Pattern@ that can be used for exact searching.

..tableheader:Algorithm|Description
..table:@Spec.Horspool@|A common and rather simple algorithm that uses a bad character rule to speed up the search.
Works very nice in practice for most applications, though the worst case running time is $O(n\squared)$.
Not optimal for small alphabets or very long patterns.
..table:@Spec.Bfam@|The Backward Oracle Matching algorithm that uses a special automaton to scan through the @glos:haystack@.
A good choice for long patterns.
..table:@Spec.BndmAlgo@|The Backward Nondeterministic DAWG (Directed Acyclic Word Graph) Matching algorithm uses 
another special automaton to scan through the @glos:haystack@. An alternative to @Spec.Bfam@ for long patterns.
..table:@Spec.ShiftOr@|An algorithm that uses bit parallelism. 
Should only be used for patterns that are not longer than a machine word, i.e. 32 or 64 characters.
Even for small patterns, it is outperformed by @Spec.Horspool@ for alphabets larger than @Spec.Dna@.
..table:@Spec.ShiftAnd@|An alternative to @Spec.ShiftOr@ with similar characteristics.

 
..subsection:#.# Exact Multiple Pattern Matching
..text:The following algorithms can be used to search several @glos:needles@ at once:
..tableheader:Algorithm|Description
..table:@Spec.SetHorspool@|An extension of the @Spec.Horspool@ algorithm that uses a @glos:trie@ to scan backwards through the @glos:haystack@.
..table:@Spec.AhoCorasick@|A multiple pattern search algorithm that uses a @glos:trie@ to scan forwards through the @glos:haystack@.
..table:@Spec.MultipleShiftAnd@|An extension of the @Spec.ShiftAnd@ algorithm for searching multiple @glos:needles@ at once.
Should only be used for patterns that are not longer than a machine word, i.e. 32 or 64 characters.
..XXXtable:@Spec.WuManber@|A string matching algorithm similar to @Spec.Horspool@ for multiple patterns.
Less advisable if the minimum length of the patterns is very small.


..subsection:#.# Approximative Searching
..text:Approximative searching means to find all occurrences of the @glos:needle@ not below a given score limit.
..tableheader:Algorithm|Description
..table:@Spec.DPSearch@|A dynamic programming algorithm for approximate string-matching.
Rather slow but can be applied to many kinds of @glos:scoring scheme@. 
..table:@Shortcut.MyersUkkonen@|A fast bit-parallel approximate string matching algorithm for edit distance.
..text:
The scoring model can be accessed by @Function.scoringScheme@ and @Function.setScoringScheme@,
the limit by @Function.scoreLimit@ and @Function.setScoreLimit@,
and the score of the last found match by @Function.getScore@.
All these functions are applied to the @Class.Pattern@ object that stores all information about the approximative search.
..text:Note: The position stored in the @Class.Finder@ is the end position, not the start position, of the last found match.
..text:Example: See @Demo.Approximate Searching@.
  

..subsection:#.# Wildcard Searching
..text:Wildcard searching algorithms make it possible to determine a more complex @glos:needle@. 
The used syntax depends on the algorithms used. 
A typical wildcard pattern could be "$ab+c$" that matches to "$abc$", "$abbc$", "$abbbc$", \ldots.
..tableheader:Algorithm|Description
..table:@Spec.WildShiftAnd@|An extension of @Spec.ShiftAnd@ supports a subset of regular expressions, namely "$*$", "$+$", "$?$", braces to specify the number of repeated characters, 
and character classes that are specified in square brackets.
..text:Example: See @Demo.Wildcard Searching@.


..section:# Index Searching
..text:For more information about searching with indices, see @Page.Indices|this tutorial@.