File: QgramIndex.rst

package info (click to toggle)
seqan2 2.5.2-1
  • links: PTS, VCS
  • area: main
  • in suites: forky, sid
  • size: 228,748 kB
  • sloc: cpp: 257,602; ansic: 91,967; python: 8,326; sh: 1,056; xml: 570; makefile: 229; awk: 51; javascript: 21
file content (189 lines) | stat: -rw-r--r-- 11,245 bytes parent folder | download | duplicates (2)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
.. sidebar:: ToC

    .. contents::

.. _tutorial-datastrucures-indices-q-gram-index:

Q-gram Index
============

Learning Objective
  You will know the features of the q-gram Index, how it can be used for searching and how to access the different fibres.

Difficulty
  Average

Duration
  1 h

Prerequisites
  :ref:`tutorial-datastructures-sequences`

The Q-gram Index
----------------

A q-gram index can be used to efficiently retrieve all occurrences of a certain q-gram in the text.
It consists of various tables, called fibres (see :ref:`how-to-recipes-access-index-fibres`), to retrieve q-gram positions, q-gram counts, etc.
However, it has no support for suffix tree iterators.
A q-gram index must be specialized with a :dox:`Shape` type.
A :dox:`Shape` defines q, the number of characters in a q-gram and possibly gaps between these characters.
There are different specializations of :dox:`Shape` available:

+-----------------------+--------------------+----------------------+
| Specialization        | Modifiable         | Number of Gaps       |
+=======================+====================+======================+
| :dox:`UngappedShape`  | \-                 | 0                    |
+-----------------------+--------------------+----------------------+
| :dox:`SimpleShape`    | \+                 | 0                    |
+-----------------------+--------------------+----------------------+
| :dox:`OneGappedShape` | \+                 | 0/1                  |
+-----------------------+--------------------+----------------------+
| :dox:`GappedShape`    | \-                 | any                  |
+-----------------------+--------------------+----------------------+
| :dox:`GenericShape`   | \+                 | any                  |
+-----------------------+--------------------+----------------------+

* \- *fixed at compile time*, \+ *can be changed at runtime*

Each shape evaluates a gapped or ungapped sequence of q characters to a hash value by the Functions :dox:`Shape#hash`, :dox:`Shape#hashNext`, etc.
For example, the shape ``1101`` represents a 3-gram with one gap of length 1.
This shape overlayed with the :dox:`Dna` text ``"GATTACA"`` at the third position corresponds to ``"TT-C"``.
The function :dox:`Shape#hash` converts this 3-gram into :math:`61 = (\mathbf{3} \cdot 4 + \mathbf{3}) \cdot 4 + 1`.
4 is the alphabet size in this example (see :dox:`FiniteOrderedAlphabetConcept#ValueSize`).

With :dox:`Shape#hash` and :dox:`Shape#hash hashNext`, we can compute the hash values of arbitrary / adjacent q-grams and a loop that outputs the hash values of all overlapping ungapped 3-grams could look as follows:

.. includefrags:: demos/tutorial/q_gram_index/index_qgram_hash.cpp
   :fragment: hash_loop1

Note that the shape not only stores the length and gaps of a q-gram shape but also stores the hash value returned by the last hash/hashNext call.
This hash value can be retrieved by calling :dox:`Shape#value` on the shape.
However, one drawback of the example loop above is that the first hash value must be computed with :dox:`Shape#hash` while the hash values of the following overlapping q-grams can more efficiently be computed by :dox:`Shape#hashNext`.
This complicates the structure of algorithms that need to iterate all hash values, as they have to handle this first hash differently.
As a remedy, the :dox:`Shape#hashInit` function can be used first and then :dox:`Shape#hashNext` on the first and all following text positions in the same way:

.. includefrags:: demos/tutorial/q_gram_index/index_qgram_hash.cpp
   :fragment: hash_loop2

The q-gram index offers different functions to search or count occurrences of q-grams in an indexed text, see :dox:`IndexQGram#getOccurrences`, :dox:`IndexQGram#countOccurrences`.
A q-gram index over a :dox:`StringSet` stores occurrence positions in the same way and in the same fibre (FibreSA) as the ESA index.
If only the number of q-grams per sequence are needed the QGramCounts and QGramCountsDir fibres can be used.
They store pairs ``(seqNo, count)``, ``count``>0, for each q-gram that occurs ``counts`` times in sequence number ``seqNo``.

To efficiently retrieve all occurrence positions or all pairs ``(seqNo, count)`` for a given q-gram, these positions or pairs are stored in contiguous blocks (in QGramSA, QGramCounts fibres), called buckets.
The begin position of bucket i is stored in directory fibres (QGramDir, QGramCountsDir) at position i, the end position is the begin positions of the bucket i+1.
The default implementation of the :dox:`IndexQGram` index maps q-gram hash values 1-to-1 to bucket numbers.
For large q or large alphabets the :dox:`OpenAddressingQGramIndex Open Addressing QGram Index` can be more appropriate as its directories are additionally bound by the text length.
This is realized by a non-trivial mapping from q-gram hashes to bucket numbers that requires an additional fibre (QGramBucketMap).

For more details on q-gram index fibres see :ref:`how-to-recipes-access-index-fibres` or :dox:`QGramIndexFibres QGram Index Fibres`.

Example
-------

We want to construct the q-gram index of the string ``"CATGATTACATA"`` and output the occurrences of the ungapped 3-gram ``"CAT"``.
As 3 is fixed at compile-time and the shape has no gaps we can use an :dox:`UngappedShape` which is the first template argument of :dox:`IndexQGram`, the second template argument of :dox:`Index`.
Next we create the string ``"CATGATTACATA"`` and specialize the first index template argument with the type of this string.
The string can be given to the index constructor.

.. includefrags:: demos/tutorial/q_gram_index/index_qgram.cpp
   :fragment: initialization

To get all occurrences of a q-gram, we first have to hash it with a shape of the same type as the index shape (we can even use the index shape returned by :dox:`IndexQGram#indexShape`).
The hash value returned by :dox:`Shape#hash` or :dox:`Shape#hashNext` is also stored in the shape and is used by the function :dox:`IndexQGram#getOccurrences` to retrieve all occurrences of our 3-gram.

.. includefrags:: demos/tutorial/q_gram_index/index_qgram.cpp
   :fragment: output

Program output:

.. includefrags:: demos/tutorial/q_gram_index/index_qgram.cpp.stdout

Assignment 1
^^^^^^^^^^^^

.. container:: assignment

   Type
     Review

   Objective
     Write a program that outputs all occurrences of the gapped q-gram "AT-A" in "CATGATTACATA".

   Solution
     .. container:: foldable

	Before we can create a :dox:`DnaString` index of "CATGATTACATA", we have to choose an appropriate :dox:`Shape`.
	Because our shape ``1101`` is known at compile-time and contains only one gap we could choose :dox:`OneGappedShape`, :dox:`GappedShape`, or :dox:`GenericShape` (see the commented-out code).
	Although the :dox:`GenericShape` could be used for every possible shape, it is a good idea to choose a :dox:`Shape` with restrictions as its :dox:`Shape#hash` functions are more efficient in general.

	.. includefrags:: demos/tutorial/q_gram_index/index_assignment5.cpp
	   :fragment: initialization

	Please note that the :dox:`Shape` object that corresponds to the :dox:`IndexQGram` index is empty initially and has to be set by :dox:`Shape#stringToShape` or :dox:`Shape#resize`.
	This initialization is not necessary for :dox:`Shape` that are defined at compile-time, i.e. :dox:`UngappedShape` and :dox:`GappedShape`.
	To search for "AT-A" we first have to hash it with the index shape or any other :dox:`Shape` with the same bitmap.
	The we can use :dox:`IndexQGram#getOccurrences` to output all matches.

	.. includefrags:: demos/tutorial/q_gram_index/index_assignment5.cpp
          :fragment: output

	.. tip::

	   Instead of ``length(getOccurrences(...))`` we could have used :dox:`IndexQGram#countOccurrences`.
	   But beware that :dox:`IndexQGram#countOccurrences` requires only the ``QGram_Dir`` fibre, whereas :dox:`IndexQGram#getOccurrences` requires both ``QGram_Dir`` and  ``QGram_SA``, see :ref:`how-to-recipes-access-index-fibres`.
	   Because ``QGram_SA`` can be much more efficiently constructed during the construction of ``QGram_Dir``, ``QGram_Dir`` would be constructed twice.

	Program output:

	.. includefrags:: demos/tutorial/q_gram_index/index_assignment5.cpp.stdout

Assignment 2
^^^^^^^^^^^^

.. container:: assignment

   Type
     Review

   Objective
     Create and output a matrix M where M(i,j) is the number of common ungapped 5-grams between sequence i and sequence j for 3 random :dox:`Dna` sequences, each not longer than 200 characters.
     Optional: Run the matrix calculation twice, once for an :dox:`IndexQGram` and once for an :dox:`OpenAddressingQGramIndex Open Addressing QGram Index` and output the directory sizes (QGram_Dir, QGram_CountsDir fibre).

   Hint
     A common q-gram that occurs :math:`a` times in one and :math:`b` times in the other sequence counts for :math:`\min(a,b)`.

   Solution
     .. container:: foldable

        For generating random numbers we use the `std::mt19937 <https://www.cplusplus.com/reference/random/mt19937/>`_.
        The random numbers returned by the random number engine are arbitrary ``unsigned int`` values which we downscale to values between 0 and 3 and convert into :dox:`Dna` characters.
        The 3 generated strings are of random length and appended to a :dox:`StringSet`.
        The main algorithm is encapsulated in a template function ``qgramCounting`` to easily switch between the two :dox:`IndexQGram` specializations.

        .. includefrags:: demos/tutorial/q_gram_index/index_assignment6.cpp
           :fragment: initialization

        The main function expects the :dox:`StringSet` and the :dox:`Index` specialization as a tag.
        First, we define lots of types we need to iterate and access the fibres directly.
        We then notify the index about the fibres we require.
        For storing the common q-grams we use a 2-dimensional :dox:`Matrix` object whose lengths have to be set with ``setLength`` for each dimension.
        The matrix is initialized with zeros by :dox:`Matrix#resize`.

        .. includefrags:: demos/tutorial/q_gram_index/index_assignment6.cpp
           :fragment: matrix_init

        The main part of the function iterates over the CountsDir fibre.
        Each entry in this directory represents a q-gram bucket, a contiguous interval in the Counts fibre storing for every sequence the q-gram occurs in the number of occurrences in pairs (seqNo,count).
        The interval begin of each bucket is stored in the directory and the interval end is the begin of the next bucket.
        So the inner loops iterate over all non-empty buckets and two pairs (seqNo1,count1) and (seqNo2,count2) indicate that seqNo1 and seqNo2 have a common q-gram.
        At the end the matrix can simply be output by shifting it to the ``cout`` stream.

        .. includefrags:: demos/tutorial/q_gram_index/index_assignment6.cpp
           :fragment: matrix_calculation

        Please note that the :dox:`OpenAddressingQGramIndex open addressing` q-gram index directories are smaller than the :dox:`IndexQGram` index directories.

        Program output:

        .. includefrags:: demos/tutorial/q_gram_index/index_assignment6.cpp.stdout