File: AlignmentGaps.rst

package info (click to toggle)
seqan2 2.5.2-1
  • links: PTS, VCS
  • area: main
  • in suites: forky, sid
  • size: 228,748 kB
  • sloc: cpp: 257,602; ansic: 91,967; python: 8,326; sh: 1,056; xml: 570; makefile: 229; awk: 51; javascript: 21
file content (219 lines) | stat: -rw-r--r-- 11,013 bytes parent folder | download | duplicates (9)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
.. sidebar:: ToC

    .. contents::

.. _tutorial-datastructures-alignment-alignment-gaps:

Alignment Representation (Gaps)
=================================

Learning Objective
  This tutorial introduces you to the gaps data structures that can be used to represent an alignment in SeqAn.
  You will learn basic techniques to create and modify such data structures and how to access certain information from these data structures.

Difficulty
  Basic

Duration
  15 min

Prerequisites
  :ref:`tutorial-getting-started-first-steps-in-seqan`, :ref:`tutorial-datastructures-sequences`

------------------------------------------

The :dox:`Align` data structure is simply a set of multiple :dox:`Gaps` data structures.
A Gaps data structure is a container storing gap information for a given source sequence.
The gap information is put on top of the source sequence (coordinates of the gapped sequence refer to the **gap space**) without directly applying them to the source (coordinates of the ungapped sequence refer to the **source space**).
This way operating with gaps sustains very flexible.

Gaps data structures
^^^^^^^^^^^^^^^^^^^^^^^^^

There are two specializations available for the Gaps data structures:
:dox:`ArrayGaps Array Gaps` and :dox:`AnchorGaps Anchor Gaps`.
They differ in the way they implement the gap space.

.. note::
   In general, using :dox:`ArrayGaps Array Gaps` is sufficient for most applications.
   This specialization is also the default one if nothing else is specified.
   It simply uses an array which stores the counts of gaps and characters in an alternating order.
   Thus, it is quite efficient to extend existing gaps while it is more expensive to search within the gapped sequence or insert new gaps.
   Alternatively, one should prefer :dox:`AnchorGaps Anchor Gaps` if many conversions between coordinates of the gap and the source space are needed as binary search can be conducted to search for specific positions.


Constructing an alignment
^^^^^^^^^^^^^^^^^^^^^^^^^

Now, let's start by constructing our first alignment.
Before we can make use of any of the mentioned data structures, we need to tell the program where to find the definitions.
This can be achieved by including the header file ``<seqan/align.h>`` which contains the necessary data structures and functions associated with the alignments.
The next steps would be to implement the main function of our program and to define the types that we want to use.

.. includefrags:: demos/tutorial/alignment/align.cpp
   :fragment: main

We first define the type of the input sequences (``TSequence``).
Then we can define the type of our actual Align object we want to use.
In an Align object, the gapped sequences are arranged in rows.
You can use the Metafunction :dox:`Align#Row` to get the correct type of the used Gaps objects.
In the following we use the term ``row`` to explicitly refer to a single gapped sequence in the Align object.
We will use the term ``gapped sequence`` to describe functionalities that are related to the Gaps data structure independent of the Align object.

.. includefrags:: demos/tutorial/alignment/align.cpp
   :fragment: typedefs

After defining the types, we can continue to actually construct our own Align object.
Therefore, we need to resize the alignment object in order to reserve space for the sequences we want to add.
In our case, we assume a pairwise alignment, hence we reserve space for 2 sequences.
With the function :dox:`Align#row`, we get access to the gapped sequence at a specific row in the alignment object.
This is similar to the :dox:`RandomAccessContainerConcept#value` function used in :dox:`StringSet String Sets`.
Now, we can assign the source to the corresponding gapped sequence.

.. includefrags:: demos/tutorial/alignment/align.cpp
   :fragment: init

.. includefrags:: demos/tutorial/alignment/align.cpp.stdout
   :fragment: output_init

.. note::

   The second string ``CDEFGAHGC`` of the alignment is cropped in the output to
   ``CDEFGA``, such that they are of equal length. Note that the string itself
   is not modified, i.e. not shortened.

After assigning the sources to the gapped sequences, we need to add some gaps to make it look like a real alignment.
You can use the functions :dox:`Gaps#insertGap insertGap()` and :dox:`Gaps#removeGap removeGap()` to insert and delete one gap or :dox:`Gaps#insertGaps insertGaps()` and :dox:`Gaps#removeGaps removeGaps()` to insert and delete multiple gaps in a gapped sequence.

.. includefrags:: demos/tutorial/alignment/align.cpp
   :fragment: manipulation

.. includefrags:: demos/tutorial/alignment/align.cpp.stdout
   :fragment: output_manipulation

Congratulations!
You have created your first alignment.
Note that we used a reference declaration ``TRow &`` for the variables ``row1`` and ``row2``.
Without the reference, we would only modify copies of rows and the changes would not effect our ``align`` object.


Gap Space vs. Source Space
^^^^^^^^^^^^^^^^^^^^^^^^^^

.. image:: ../../../../../dox/images/docs2/gaps_illustration.png

In the next steps, we want to dig a little deeper to get a feeling for the gap space and the source space.
As mentioned above, the gaps are not inserted into the source but put on top of it in a separate space, the gap space.
When inserting gaps, the gap space is modified and all coordinates right of the inserted gap are shifted to the right by the size of the gap.
At the same time, the coordinates of the source remain unchanged.
Using the function :dox:`Gaps#toSourcePosition toSourcePosition()`, we can determine which position of the current gapped sequence (gap space) corresponds to the position in the source space.

.. includefrags:: demos/tutorial/alignment/align.cpp
   :fragment: printingSourcePos

.. includefrags:: demos/tutorial/alignment/align.cpp.stdout
   :fragment: output_source_positions

If the position in the gap space is actually a gap, then :dox:`Gaps#toSourcePosition toSourcePosition()` returns the source position of the next character to the right that is not a gap.
Vice versa, we can determine where our current source position maps into the gap space using the function :dox:`Gaps#toViewPosition toViewPosition()`.

.. includefrags:: demos/tutorial/alignment/align.cpp
   :fragment: printingViewPos

.. includefrags:: demos/tutorial/alignment/align.cpp.stdout
   :fragment: output_view_positions

In the first alignment, it seems that the end of the second row is cropped off to match the size of the first one.
This effect takes place only in the visualization but is not explicitly applied to the gapped sequence.
The second alignment is the one we manually constructed.
Here, you can see that the second row is expanded to its full size while it matches the size of the first row.
However, it is possible to explicitly crop off the ends of a gapped sequence by using the functions :dox:`Gaps#setClippedBeginPosition setClippedBeginPosition()` and :dox:`Gaps#setClippedEndPosition setClippedEndPosition()`.
These functions shrink the gap space and can be understood as defining an infix of the gapped sequence.
After the clipping, the relative view position changes according to the clipping and so does the mapping of the source positions to the gap space.
The mapping of the view positions to the source space does not change.

.. includefrags:: demos/tutorial/alignment/align.cpp
   :fragment: clipping

.. includefrags:: demos/tutorial/alignment/align.cpp.stdout
   :fragment: output_clipping

Here the output of the clipping procedure.

.. includefrags:: demos/tutorial/alignment/align.cpp
   :fragment: clipping_positions

.. includefrags:: demos/tutorial/alignment/align.cpp.stdout
   :fragment: output_clipping_positions

.. note::
   It is important to understand the nature of the clipping information.
   It virtually shrinks the gap space not physically.
   That means the information before/after the begin/end of the clipping still exists and the physical gap space remains unchanged.
   To the outer world it seems the alignment is cropped off irreparably.
   But you can expand the alignment again by resetting the clipping information.

Iterating over Gapped Sequences
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

In the last part of this section, we are going to iterate over a :dox:`Gaps` object.
This can be quite useful if one needs to parse the alignment rows to access position specific information.
First, we have to define the type of the ``Iterator``, which can be easily done by using the metafunction :dox:`ContainerConcept#Iterator`.
Remember that we iterate over an ``TRow`` object.
Then we have to construct the iterators ``it`` which points to the begin of ``row1`` using the :dox:`ContainerConcept#begin begin()` function and ``itEnd`` which points behind the last value of ``row1`` using the :dox:`ContainerConcept#end end()` function.
If you need to refresh the **Iterator Concept** you can read the iterator section :ref:`tutorial-datastructures-sequences-strings-and-segments-iterators`.
While we iterate over the gapped sequence, we can ask if the current value, at which the iterator ``it`` points to, is a gap or not by using the function :dox:`Gaps#isGap isGap()`.
Use :dox:`AlphabetWithGapsConcept#gapValue` to print the correct gap symbol.

.. includefrags:: demos/tutorial/alignment/align.cpp
   :fragment: iteratingRowClipped

.. includefrags:: demos/tutorial/alignment/align.cpp.stdout
   :fragment: output_iteratingRowClipped

We will now reset the clipping of ``row1`` using :dox:`Gaps#clearClipping` and iterate again over it to see its effect.

.. includefrags:: demos/tutorial/alignment/align.cpp
   :fragment: iteratingRowClipped2

.. includefrags:: demos/tutorial/alignment/align.cpp
   :fragment: return

.. includefrags:: demos/tutorial/alignment/align.cpp.stdout
   :fragment: output_iteratingRowClipped2

Here you can see how resetting the clipping positions brings back our complete row.

Assignment 1
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. container:: assignment

   Type
     Review

   Objective
     Construct an alignment using the Align data structure for the sequences ``"ACGTCACCTC"`` and ``"ACGGGCCTATC"``.
     Insert two gaps at the second position and insert one gap at the fifth position of the first sequence.
     Insert one gap at the ninth position of the second sequence.
     Iterate over the rows of your Align object and print the total count of gaps that exist within the alignment.

   Hints
     .. container :: foldable

        You can use the function :dox:`Gaps#countGaps` to count the number of consecutive gaps starting from the current position of the iterator.

        The resulting alignment should look like:
         .. code-block:: console

            AC--GTC-ACCTC

            ACGGGCCTA--TC

   Solution
       .. container:: foldable

          .. includefrags :: demos/tutorial/alignment/align_assignment1.cpp
             :fragment: solution

          .. includefrags :: demos/tutorial/alignment/align_assignment1.cpp.stdout