1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299
|
.. sidebar:: ToC
.. contents::
.. _how-to-use-cases-simple-genome-alignment:
Simple Genome Alignment
=======================
Learning Objective
You will learn how to write a simple genome aligner for large-scale sequences.
Difficulty
Medium
Duration
4h
Prerequisites
:ref:`tutorial-getting-started-parsing-command-line-arguments`, :ref:`tutorial-algorithms-alignment-pairwise-sequence-alignment`, :ref:`tutorial-algorithms-seed-extension`, :ref:`tutorial-datastrucures-indices-q-gram-index`, :ref:`tutorial-io-sequence-io`
Introduction
""""""""""""
Pairwise sequence alignment is the tool of choice if one wants to compare two biological sequences.
However, for genome sized sequences the standard approach is to inefficient as it requires quadratic time and space to compute an alignment.
Alternatively, one could compute only a small band but it is unclear whether evolutionary events such as insertions or deletions would shift the band away from the main diagonal.
Thus the alignment would not reflect the actual alignment of the sequences.
To cope with this issue different methods have been developed that can efficiently deal with large-scale sequences.
One of these applications was the LAGAN (Limited Area Global Alignment of Nucleotides) [1.].
It is an iterative algorithm with the following three steps.
#. Generate local alignments with CHAOS chaining [2.].
#. Construct a rough global map
#. Compute final alignment along the global map
The CHAOS chaining finds local alignments depending on three parameters: the seed size (of the q-grams), the distance and the gap parameter.
The distance parameter limits the distance between two endpoints of two neighbouring seeds along the diagonal.
The band limits the shift of two seeds in vertical and horizontal direction of the matrix.
In order to find a good tradeoff between speed and sensitivity the algorithm performs step 1 and 2 recursively until two neighbouring anchors are less than a threshold apart.
In the first iteration the parameters are set more restrictive (large seed size, smaller distance and gap size.)
For every gap that is bigger than the given threshold, the parameters are set more permissive (smaller seed size and larger distance and gap sizes).
Finally, after the global map has been constructed, the algorithm performs a limited alignment around the anchors and connects the anchors with a standard alignment algorithm.
Main method and ArgumentParser
""""""""""""""""""""""""""""""
In this tutorial we want to implement a simplified version of the LAGAN approach using containing only of the first iteration.
At first we will look at the basic main method and setup the argument parser.
.. includefrags:: demos/tutorial/simple_genome_alignment/lagan.cpp
:fragment: initial_main
We create a ``LaganOption`` class where we store the arguments passed to our tool:
.. includefrags:: demos/tutorial/simple_genome_alignment/lagan.cpp
:fragment: lagan_option
After that, we parse given command line arguments using SeqAn's :dox:`ArgumentParser`.
Firstly, we include the source code for the argument parser by adding the following include directory:
.. includefrags:: demos/tutorial/simple_genome_alignment/lagan.cpp
:fragment: include_arg_parse
.. includefrags:: demos/tutorial/simple_genome_alignment/lagan.cpp
:fragment: parse_arguments
.. hint::
If you want to learn more about parsing arguments with SeqAn read the :ref:`tutorial-getting-started-parsing-command-line-arguments` tutorial.
With this, we have set up our initial tool. Let's start implementing the algorithm.
To do so, we need to first load the sequences using the class :dox:`SeqFileIn`.
We can get access to the data structures and methods by including the following modules:
.. includefrags:: demos/tutorial/simple_genome_alignment/lagan.cpp
:fragment: include_seq_io
Assignment 1
^^^^^^^^^^^^
.. container:: assignment
Type
Application
Objective
Use the code template below (click **more...**) and implement the function ``loadSequence`` to load a single sequence file from the specified path.
Use the file paths given in the options object and report an error if the files could not be opened.
.. container:: foldable
.. includefrags:: demos/tutorial/simple_genome_alignment/lagan.cpp
:fragment: load_sequence_template
Hint
* :dox:`SeqFileIn` constructor accepts a c-style string.
* Use `string::c_str <https://www.cplusplus.com/reference/string/string/c_str>`_ to convert the option strings into C-style strings.
* The function :dox:`SeqFileIn#readRecord` expects the input file, a sequence, e.g. :dox:`Dna5String` and an id, e.g. :dox:`CharString`.
Solution
.. container:: foldable
.. includefrags:: demos/tutorial/simple_genome_alignment/lagan.cpp
:fragment: load_sequence_solution
Finally, we can update our main method and use our ``loadSequence`` function to load sequence 1 ...
.. includefrags:: demos/tutorial/simple_genome_alignment/lagan.cpp
:fragment: load_seq_1
and sequence 2 ...
.. includefrags:: demos/tutorial/simple_genome_alignment/lagan.cpp
:fragment: load_seq_2
Filtering seeds
"""""""""""""""
After we read the sequences from the command line it is time to write our actual algorithm.
A naive algorithm would scan for every q-gram of sequence 2 the complete sequence 1 to find all possible positions.
But this approach of course is to slow for large-scale sequences and we need a better strategy.
SeqAn provides for this kind of task indexes which can be queried efficiently to find all occurrences of a pattern in an indexed text.
.. hint::
There are several index implementations and we recommend to read :ref:`tutorial-datastructures-indices` to learn more about the available index data structures.
In this tutorial we are going to use a :dox:`IndexQGram` (see :ref:`tutorial-datastrucures-indices-q-gram-index` for more information), which can be included with the following module:
.. includefrags:: demos/tutorial/simple_genome_alignment/lagan.cpp
:fragment: include_index
This index type will create a directory with all distinct q-grams and stores the positions of the indexed sequence, where a specific q-gram occurs.
It therefor will generate a suffix array, which is sorted by the first ```q`` symbols of every suffix.
The following line declares our q-gram index type:
.. includefrags:: demos/tutorial/simple_genome_alignment/lagan.cpp
:fragment: declare_index
The ``Index`` type is a template class which requires two type template parameters.
The first type template parameter names the sequence type that this index is constructed for.
In our case this will be a `Dna5String`.
The second type template parameter is a tag or also known a policy, that defines the type of index to use.
In our case we use the :dox:`IndexQGram` policy, which itself can be further specialized through two type template parameters.
We need to select the policy used for the q-gram shape and the policy for managing the q-grams.
In our example we will need a :dox:`SimpleShape`, which is a variable length ungapped shape.
Thus, we are able to change the size of the q-gram at runtime.
Note, there also constant length shapes, whose sizes are fixed at compile time.
And as a storage policy we use :dox:`OpenAdressingTags#OpenAdressing`, which allows us to use longer values for our q-gram.
Alternatively, we could leave the parameter unspecified and would therefor enable the default behavior which is direct addressing.
Direct addressing, however, will create a lookup table for every possible q-mer (:math:`\Sigma^{q}`), which can become quite large for small q already.
In the next step we are going to initialize the index.
.. includefrags:: demos/tutorial/simple_genome_alignment/lagan.cpp
:fragment: init_index
First the index is constructed with the sequence it should be created for.
Note, that this will not yet create the index.
The creation will be triggered in a lazy manner, which means it will be first created when an access to the index is requested.
Before the index can be created we need to give the index the size of the q-gram shape.
This is done in the second line of the above snippet.
The method :dox:`IndexQGram#indexShape` returns a reference to the shape stored within the index.
We resize this shape to the requested length.
The last line initializes the index shape with the first q-gram of the second sequence.
Assignment 2
^^^^^^^^^^^^
.. container:: assignment
Type
Application
Objective
Write a loop over the second sequence and write the number of occurrences per q-gram to the console.
Hint
* Use the function :dox:`Shape#hashNext` to update the shape for the current q-gram.
* Use the function :dox:`IndexQGram#getOccurrences` to get a list of hits.
Solution
.. container:: foldable
.. includefrags:: demos/tutorial/simple_genome_alignment/lagan.cpp
:fragment: solution_assignment2
Local chaining
""""""""""""""
Now we can stream over the second sequence and can extract all locations of a given q-gram in the indexed sequence.
To implement the second step of the LAGAN algorithm, we need to apply local chaining to the filtered q-grams and extend them to longer anchors.
SeqAn offers a data structure called :dox:`SeedSet` for this.
The following snippet shows the declaration of a ``SeedSet``:
.. includefrags:: demos/tutorial/simple_genome_alignment/lagan.cpp
:fragment: include_seeds
.. includefrags:: demos/tutorial/simple_genome_alignment/lagan.cpp
:fragment: declare_seed_set
The first line declares the type of the seed we want to use.
We can define the chaining policy as type template parameter.
In our case we use the :dox:`ChainedSeed` policy, which enables us to locally chain the seeds.
In addition we define a :dox:`SeedSet` with our ``ChainedSeed`` as type template parameter.
Now we create an instance of the seed set and of a scoring scheme, which we will need to score the local chain.
.. includefrags:: demos/tutorial/simple_genome_alignment/lagan.cpp
:fragment: init_seed_set
Assignment 3
^^^^^^^^^^^^
.. container:: assignment
Type
Application
Objective
Update the loop from assignment 2 and fill the previously created seed set.
Use the CHAOS chaining method to chain seeds locally, using the scoring scheme, the gap and distance criteria
and the current position of the matching q-grams.
Hint
* The method :dox:`SeedSet#addSeed` has different overloads for various chaining policies.
* If the seed could not be combined to any other in the set it must be added as a single seed to set.
Solution
.. container:: foldable
.. includefrags:: demos/tutorial/simple_genome_alignment/lagan.cpp
:fragment: solution_seeding
Global chaining
"""""""""""""""
After scanning the second sequence and filling the seed set the highest scoring global chain must be computed,
which gives a map of good matching anchors.
Assignment 4
^^^^^^^^^^^^
.. container:: assignment
Type
Application
Objective
Read the documentation to :dox:`chainSeedsGlobally` and build a global chain over
the local anchors stored in the current seed set.
Solution
.. container:: foldable
.. includefrags:: demos/tutorial/simple_genome_alignment/lagan.cpp
:fragment: chain_seeds
Final alignment
"""""""""""""""
In the original algorithm, the steps from above would be repeated for the gaps between the anchors selected by the global chaining algorithm.
In this tutorial we skip the iterative step and directly compute the final alignment along the global map produced by the chaining algorithm.
SeqAn already offers an alignment function for filling the gaps and connecting them with the anchors, which is available in the ``seeds`` module.
Assignment 5
^^^^^^^^^^^^
.. container:: assignment
Type
Application
Objective
Compute an alignment around the global anchors identified by the chaining algorithm.
Hint
* Use the :dox:`bandedChainAlignment` function to compute the alignment.
Solution
.. container:: foldable
.. includefrags:: demos/tutorial/simple_genome_alignment/lagan.cpp
:fragment: build_alignment
Finally we can output the alignment in the specified output file.
.. includefrags:: demos/tutorial/simple_genome_alignment/lagan.cpp
:fragment: output_alignment
Congratulation, you wrote a simple genome aligner!!!
You can use the code to add iterative steps to make it more sensitive for large indels.
References
""""""""""
#. Brudno M. and Morgenstern, B., 2002. Fast and sensitive alignment of large genomic sequences. In Proceeding of the IEEE Computer Society Bioinformatics Conference (CSB).
#. Brudno, M., Do, C. B., Cooper, G. M., Kim, M. F., Davydov, E., Green, E. D., Sidow, A., Batzoglou, S., and and NISC Comparative Sequencing Program. (2003). LAGAN and Multi-LAGAN: efficient tools for large-scale multiple alignment of genomic DNA. Genome research, 13(4), 721-731.
|