1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241
|
# Implementing your own read mapper with SeqAn {#tutorial_read_mapper}
<b>Learning Objective:</b><br>
In this tutorial you will learn how to combine the components of previous tutorials to create your very first
SeqAn application: a read mapper!
\tutorial_head{High, 90 Minutes, All,}
[TOC]
---
# Introduction
Read mapping is a common task in bioinformatics and is often the first step of an in-depth analysis of
[Next Generation Sequencing](https://en.wikipedia.org/wiki/DNA_sequencing#High-throughput_methods) data. Its aim
is to identify positions where a query sequence (read) matches with up to `e` errors to a reference sequence.
In this example we will implement a read mapper step by step and make use of what we have learned in
the previous tutorials.
As it is common practice with read mappers, we will first create an *indexer* that creates an index from the reference
and stores it to disk. After this, we will implement the actual read mapper that will use the stored index and
map the reads.
# Agenda
* Implementing an indexer
* Parse arguments
* Read input files
* Create and store index
* Implementing a read mapper
* Parse arguments
* Read and load input, search for approximate matches
* Align the search results
* Write final results into a SAM file
# The data
We provide an example [reference](https://ftp.imp.fu-berlin.de/pub/SeqAn/seqan3_read_mapper/reference.fasta)
and an example [query](https://ftp.imp.fu-berlin.de/pub/SeqAn/seqan3_read_mapper/query.fastq) file.
# The indexer
## Step 1 - Parsing arguments
As a first step, we want to parse command line arguments for our indexer. If you get into trouble,
you can take a peek at the \ref tutorial_argument_parser "Argument Parser Tutorial" or the API documentation of
the seqan3::argument_parser for help.
\assignment{Assignment 1: Parsing arguments}
Let's start our application by setting up the argument parser with the following options:
* The path to the reference file
* An output path for the index
Follow the best practice and create:
* A function `run_program` that prints the parsed arguments
* A struct `cmd_arguments` that stores the arguments
* A function `initialise_argument_parser` to add meta data and options to the parser
* A `main` function that parses the arguments and calls `run_program`
Use validators where applicable!
Your `main` may look like this:
\hint
\snippet doc/tutorial/read_mapper/read_mapper_indexer_step1.cpp main
\endhint
\endassignment
\solution
\include doc/tutorial/read_mapper/read_mapper_indexer_step1.cpp
\endsolution
## Step 2 - Reading the input
As a next step, we want to use the parsed file name to read in our reference data. This will be done
using seqan3::sequence_file_input class. As a guide, you can take a look at the
\ref tutorial_sequence_file "Sequence I/O Tutorial".
\assignment{Assignment 2: Reading the input}
Extend your program to store the sequence information contained in the reference file into a struct.
To do this, you should create:
* A struct `reference_storage_t` that stores the sequence information for both reference and query information
within member variables
* A function `read_reference` that fills a `reference_storage_t` object with information from the files and
prints the reference IDs
You should also perform the following changes in `run_program`:
* Construct of an object `storage` of type `reference_storage_t`
* Add a call to `read_reference`
This is the signature of `read_reference`:
\hint
\snippet doc/tutorial/read_mapper/read_mapper_indexer_step2.cpp read_reference
\endhint
This is the `reference_storage_t`:
\hint
\snippet doc/tutorial/read_mapper/read_mapper_indexer_step2.cpp reference_storage_t
\endhint
\endassignment
\solution
\snippet doc/tutorial/read_mapper/read_mapper_indexer_step2.cpp solution
Here is the complete program:
\hint
\include doc/tutorial/read_mapper/read_mapper_indexer_step2.cpp
\endhint
\endsolution
## Step 3 - Index
Now that we have the necessary sequence information, we can create an index and store it. Read up on the
\ref tutorial_index_search "Index Tutorial" if you have any questions.
\assignment{Assignment 3: Index}
We want to create a new function `create_index`:
* It takes `index_path` and `storage` as parameters
* Creates a bi_fm_index
* Stores the bi_fm_index
We also need to change:
* `run_program` to now also call `create_index`
* `run_program` and `read_reference` to not print the debug output anymore
This is the signature of `create_index`:
\hint
\snippet doc/tutorial/read_mapper/read_mapper_indexer_step3.cpp create_index
\endhint
\endassignment
\solution
\snippet doc/tutorial/read_mapper/read_mapper_indexer_step3.cpp solution
Here is the complete program:
\hint
\snippet doc/tutorial/read_mapper/read_mapper_indexer_step3.cpp complete
\endhint
\endsolution
# The read mapper
## Step 1 - Parsing arguments
Again, we want to parse command line arguments for our read mapper as a first step. If you get into trouble,
you can take a peek at the [Argumet Parser Tutorial](#tutorial_argument_parser) or the API documentation of
the seqan3::argument_parser for help.
\assignment{Assignment 4: Parsing arguments}
Let's start our application by setting up the argument parser with the following options:
* The path to the reference file
* The path to the query file
* The path to the index file
* An output path
* The maximum number of errors we want to allow (between 0 and 4)
Follow the best practice and create:
* A function `run_program` that prints the parsed arguments
* A struct `cmd_arguments` that stores the arguments
* A function `initialise_argument_parser` to add meta data and options to the parser
* A `main` function that parses the arguments and calls `run_program`
Use validators where applicable!
Your `main` may look like this:
\hint
\snippet doc/tutorial/read_mapper/read_mapper_step1.cpp main
\endhint
\endassignment
\solution
\include doc/tutorial/read_mapper/read_mapper_step1.cpp
\endsolution
## Step 2 - Reading the input and searching
We also want to read the reference in the read mapper. This is done the same way as for the indexer.
We can now load the index and conduct a search. Read up on the \ref tutorial_index_search "Search Tutorial"
if you have any questions.
\assignment{Assignment 5: Reading the input}
Extend your program to read the reference file the same way the indexer does.
After this you can load the index and print results of a search.
To do this, you should:
* Carry over the `read_reference` function and the `reference_storage_t` struct from the indexer
* Create a function `map_reads` that loads the index and prints the results of the search (allowing all error types)
for the first 20 queries
You should also perform the following changes in `run_program`:
* Remove the debug output
* Construct an object `storage` of type `reference_storage_t`
* Add a call to `read_reference` and `map_reads`
This is the signature of `map_reads`:
\hint
\snippet doc/tutorial/read_mapper/read_mapper_step2.cpp map_reads
\endhint
\endassignment
\solution
\snippet doc/tutorial/read_mapper/read_mapper_step2.cpp solution
Here is the complete program:
\hint
\snippet doc/tutorial/read_mapper/read_mapper_step2.cpp complete
\endhint
\endsolution
## Step 3 - Alignment
We can now use the obtained positions to align each query against the reference. Refer to the
\ref tutorial_pairwise_alignment "Alignment Tutorial" if you have any questions.
\assignment{Assignment 6: Alignment}
We want to extend `map_reads` to:
* Use the output of the search to align the query against the reference
* Print the query ID, alignment score, subrange of the reference sequence and the query (for the first 20 queries)
This is the alignment config:
\hint
\snippet doc/tutorial/read_mapper/read_mapper_step3.cpp alignment_config
\endhint
\endassignment
\solution
\snippet doc/tutorial/read_mapper/read_mapper_step3.cpp solution
Here is the complete program:
\hint
\snippet doc/tutorial/read_mapper/read_mapper_step3.cpp complete
\endhint
\endsolution
## Step 4 - Alignment output
Finally, we can write our results into a SAM file.
\assignment{Assignment 7: SAM out}
We further need to extend `map_reads` to write the alignment results into a SAM file.
Additionally, there should be no more debug output.
Try to write all available information into the SAM file. We can introduce a naive mapping quality by using
`mapping quality = 60 + alignment score`.
This is the alignment_file_output construction:
\hint
\snippet doc/tutorial/read_mapper/read_mapper_step4.cpp alignment_file_output
\endhint
\endassignment
\solution
\snippet doc/tutorial/read_mapper/read_mapper_step4.cpp solution
Here is the complete program:
\hint
\snippet doc/tutorial/read_mapper/read_mapper_step4.cpp complete
\endhint
\endsolution
|