File: SimpleRnaSeq.rst

package info (click to toggle)
seqan2 2.5.2-1
  • links: PTS, VCS
  • area: main
  • in suites: forky, sid
  • size: 228,748 kB
  • sloc: cpp: 257,602; ansic: 91,967; python: 8,326; sh: 1,056; xml: 570; makefile: 229; awk: 51; javascript: 21
file content (437 lines) | stat: -rw-r--r-- 20,919 bytes parent folder | download | duplicates (2)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
.. sidebar:: ToC

    .. contents::

.. _how-to-use-cases-simple-rna-seq:

Simple RNA-Seq
==============

Learning Objective
 You will learn how to write a simple gene quantification tool based on RNA-Seq data.

Difficulty
  Hard

Duration
  2h

Prerequisites
  :ref:`tutorial-datastructures-store-genome-annotations`, :ref:`tutorial-datastructures-store-fragment-store`, experience with OpenMP (optional)

RNA-Seq refers to high-throughput sequencing of cDNA in order to get information about the RNA molecules available in a sample.
Knowing the sequence and abundance of mRNA allows to determine the (differential) expression of genes, to detect alternative splicing variants, or to annotate yet unknown genes.

In the following tutorial you will develop a simple gene quantification tool.
It will load a file containing gene annotations and a file with RNA-Seq read alignments, compute abundances, and output RPKM values for each expressed gene.

Albeit its simplicity, this example can be seen as a starting point for more complex applications, e.g. to extend the tool from quantification of genes to the quantification of (alternatively spliced) isoforms, or to de-novo detect yet unannotated isoforms/genes.

You will learn how to use the :dox:`FragmentStore` to access gene annotations and alignments and how to use the :dox:`IntervalTree` to efficiently determine which genes overlap a read alignment.

Introduction to the used Data Structures
----------------------------------------

This section introduces the :dox:`FragmentStore` and the :dox:`IntervalTree`, which are the fundamental data structures used in this tutorial to represent annotations and read alignments and to efficiently find overlaps between them.
You may skip one or both subsections if you are already familiar with one or both data structures.

Fragment Store
^^^^^^^^^^^^^^

The :dox:`FragmentStore` is a data structure specifically designed for read mapping, genome assembly or gene annotation.
These tasks typically require lots of data structures that are related to each other such as

* reads, mate-pairs, reference genome
* pairwise alignments, and
* genome annotation.

The Fragment Store subsumes all these data structures in an easy to use interface.
It represents a multiple alignment of millions of reads or mate-pairs against a reference genome consisting of multiple contigs.
Additionally, regions of the reference genome can be annotated with features like 'gene', 'mRNA', 'exon', 'intron' (see :dox:`FragmentStore::PredefinedAnnotationTypes`) or custom features.
The Fragment Store supports I/O functionality to read/write a read alignment in `SAM <https://samtools.sourceforge.net/>`_ or `AMOS <https://amos.sourceforge.net/wiki/index.php/AMOS>`_ format and to read/write annotations in `GFF <https://genome.ucsc.edu/FAQ/FAQformat.html#format3>`_ or `GTF <https://genome.ucsc.edu/FAQ/FAQformat.html#format4>`_ format.

The Fragment Store can be compared with a database where each table (called "store") is implemented as a :dox:`String` member of the :dox:`FragmentStore` class.
The rows of each table (implemented as structs) are referred by their ids which are their positions in the string and not stored explicitly.
The only exception is the alignedReadStore whose elements of type :dox:`AlignedReadStoreElement` contain an id-member as they may be rearranged in arbitrary order, e.g. by increasing genomic positions or by readId.
Many stores have an associated name store to store element names.
Each name store is a :dox:`StringSet` that stores the element name at the position of its id.
All stores are present in the Fragment Store and empty if unused.
The concrete types, e.g. the position types or read/contig alphabet, can be easily changed by defining a custom config struct which is a template parameter of the Fragment Store class.

Annotation Tree
^^^^^^^^^^^^^^^

Annotations are represented as a tree that at least contains a root node where all annotations are children or grandchildren of.
A typical annotation tree looks as follows:

.. figure:: ../../DataStructures/Store/AnnotationTree.png

   Annotation tree example


In the Fragment Store the tree is represented by :dox:`FragmentStore::annotationStore`, :dox:`FragmentStore::annotationTypeStore`, :dox:`FragmentStore::annotationKeyStore`, and others.
Instead of accessing these tables directly, the :dox:`AnnotationTreeIterator AnnotationTree Iterator` provides a high-level interface to traverse and access the annotation tree.

Interval Tree
^^^^^^^^^^^^^

The :dox:`IntervalTree` is a data structure that stores one-dimensional intervals in a balanced tree and efficiently answers `range queries <https://en.wikipedia.org/wiki/Range_query>`_.
A range query is an operation that returns all tree intervals that overlap a given query point or interval.

The interval tree implementation provided in SeqAn is based on a :dox:`Tree` which is balanced if all intervals are given at construction time.
Interval tree nodes are objects of the :dox:`IntervalAndCargo` class and consist of 2 interval boundaries and additional user-defined information, called cargo.
To construct the tree on a set of given interval nodes use the function :dox:`IntervalTree#createIntervalTree`.
The functions :dox:`IntervalTree#addInterval` and :dox:`IntervalTree#removeInterval` should only be used if the interval tree needs to be changed dynamically (as they not yet balance the tree).

Import Alignments and Gene Annotations from File
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

At first, our application should create an empty ``FragmentStore`` object into which we import a gene annotation file and a file with RNA-Seq alignments.
An empty ``FragmentStore`` can simply be created with:

.. includefrags:: demos/tutorial/simple_rna_seq/base.cpp
      :fragment: store

Files can be read from disk with the function :dox:`File#read` that expects an open stream (e.g. a STL `ifstream <https://www.cplusplus.com/reference/iostream/ifstream>`_), a ``FragmentStore`` object, and a :dox:`FileFormats File Format` tag.
The contents of different files can be loaded with subsequent calls of ``read``.
As we want the user to specify the files via command line, our application will parse them using the :dox:`ArgumentParser` and store them in an option object.

In your first assignment you need to complete a given code template and implement a function that loads a SAM file and a GTF file into the ``FragmentStore``.

Assignment 1
""""""""""""

.. container:: assignment

   Type
     Application

   Objective
     Use the code template below (click **more...**) and implement the function ``loadFiles`` to load the annotation and alignment files.
     Use the file paths given in the options object and report an error if the files could not be opened.

     .. container:: foldable

        .. includefrags:: demos/tutorial/simple_rna_seq/genequant_assignment1.cpp

   Hint
     * Open STL `std::fstream <https://www.cplusplus.com/reference/iostream/ifstream>`_ objects and use the function :dox:`File#read` with a SAM or GTF tag.
     * `ifstream::open <https://www.cplusplus.com/reference/iostream/ifstream/open>`_ requires the file path to be given as a C-style string (``const char *``).
     * Use `string::c_str <https://www.cplusplus.com/reference/string/string/c_str>`_ to convert the option strings into C-style strings.
     * The function :dox:`File#read` expects a stream, a :dox:`FragmentStore` and a tag, i.e. ``Sam()`` or ``Gtf()``.

   Solution
     .. container:: foldable

        .. includefrags:: demos/tutorial/simple_rna_seq/genequant_solution1.cpp
           :fragment: solution

Extract Gene Intervals
^^^^^^^^^^^^^^^^^^^^^^

Now that the Fragment Store contains the whole annotation tree, we want to traverse the genes and extract the genomic ranges they span.
In the annotation tree, genes are (the only) children of the root node.
To efficiently retrieve the genes that overlap read alignments later, we want to use interval trees, one for each contig.
To construct an interval tree, we first need to collect :dox:`IntervalAndCargo` objects in a string and pass them to :dox:`IntervalTree#createIntervalTree`.
See the interval tree demo in ``demos/interval_tree.cpp`` for more details.
As cargo we use the gene's annotation id to later retrieve all gene specific information.
The strings of ``IntervalAndCargo`` objects should be grouped by ``contigId`` and stored in an (outer) string of strings.
For the sake of simplicity we don't differ between genes on the forward or reverse strand and instead always consider the corresponding intervals on the forward strand.

To define this string of strings of ``IntervalAndCargo`` objects, we first need to determine the types used to represent an annotation.
All annotations are stored in the :dox:`FragmentStore::annotationStore` which is a Fragment Store member and whose type is :dox:`FragmentStore::TAnnotationStore`.
The value type of the annotation store is the class :dox:`AnnotationStoreElement`.
Its member typedefs :dox:`AnnotationStoreElement::TPos` and :dox:`AnnotationStoreElement::TId` define the types it uses to represent a genomic position or the annotation or contig id:

.. includefrags:: demos/tutorial/simple_rna_seq/base.cpp
      :fragment: typedefs

The string of strings of intervals can now be defined as:

.. includefrags:: demos/tutorial/simple_rna_seq/base.cpp
      :fragment: interval

In your second assignment you should use an :dox:`AnnotationTreeIterator AnnotationTree Iterator` to traverse all genes in the annotation tree.
For each gene, determine its genomic range (projected to the forward strand) and add a new ``TInterval`` object to the ``intervals[contigId]`` string, where ``contigId`` is the id of the contig containing that gene.

Assignment 2
""""""""""""

.. container:: assignment

   Type
     Application

   Objective
     Use the code template below (click **more..**).
     Implement the function ``extractGeneIntervals`` that should extract genes from the annotation tree (see :dox:`AnnotationTreeIterator AnnotationTree Iterator`) and create strings of :dox:`IntervalAndCargo` objects - one for each config - that contains the interval on the forward contig strand and the gene's annotation id.

     .. container:: foldable

        Extend the definitions:

        .. includefrags:: demos/tutorial/simple_rna_seq/genequant_assignment2.cpp
           :fragment: definitions

        Add a function:

        .. includefrags:: demos/tutorial/simple_rna_seq/genequant_assignment2.cpp
           :fragment: yourcode

        Extend the ``main`` function:

        .. includefrags:: demos/tutorial/simple_rna_seq/genequant_assignment2.cpp
           :fragment: main

        and

        .. includefrags:: demos/tutorial/simple_rna_seq/genequant_assignment2.cpp
           :fragment: main2

   Hint
     .. container:: foldable

        You can assume that all genes are children of the root node, i.e. create an :dox:`AnnotationTreeIterator AnnotationTree Iterator`, :dox:`AnnotationTreeIterator#goDown go down` to the first gene and :dox:`AnnotationTreeIterator#goRight go right` to visit all other genes.
        Use :dox:`AnnotationTreeIterator#getAnnotation` to access the gene annotation and :dox:`IteratorAssociatedTypesConcept#value` to get the annotation id.

     .. container:: foldable

        Make sure that you append :dox:`IntervalAndCargo` objects, where ``i1`` < ``i2`` holds, as opposed to annotations where ``beginPos`` > ``endPos`` is possible.
        Remember to ensure that ``intervals`` is of appropriate size, e.g. with

        .. includefrags:: demos/tutorial/simple_rna_seq/base.cpp
              :fragment: resize

        Use :dox:`StringConcept#appendValue` to add a new ``TInverval`` object to the inner string, see :dox:`IntervalAndCargo::IntervalAndCargo IntervalAndCargo constructor` for the constructor.

   Solution
     .. container:: foldable

        .. includefrags:: demos/tutorial/simple_rna_seq/genequant_solution2.cpp
           :fragment: solution

Construct Interval Trees
^^^^^^^^^^^^^^^^^^^^^^^^

With the strings of gene intervals - one for each contig - we now can construct interval trees.
Therefore, we specialize an :dox:`IntervalTree` with the same position and cargo types as used for the :dox:`IntervalAndCargo` objects.
As we need an interval tree for each contig, we instantiate a string of interval trees:

.. includefrags:: demos/tutorial/simple_rna_seq/base.cpp
      :fragment: tree

Your third assignment is to implement a function that constructs the interval trees for all contigs given the string of interval strings.

Assignment 3
""""""""""""

.. container:: assignment

   Type
     Application

   Objective
     Use the code template below (click **more...**).
     Implement the function ``constructIntervalTrees`` that uses the interval strings to construct for each contig an interval tree.
     **Optional:** Use OpenMP to parallelize the construction over the contigs, see :dox:`SEQAN_OMP_PRAGMA`.

     .. container:: foldable


        Extend the definitions:

        .. includefrags:: demos/tutorial/simple_rna_seq/genequant_assignment3.cpp
           :fragment: definitions

        Add a function:

        .. includefrags:: demos/tutorial/simple_rna_seq/genequant_assignment3.cpp
           :fragment: yourcode

        Extend the ``main`` function:

        .. includefrags:: demos/tutorial/simple_rna_seq/genequant_assignment3.cpp
           :fragment: main

        and

        .. includefrags:: demos/tutorial/simple_rna_seq/genequant_assignment3.cpp
           :fragment: main2

   Hint
     .. container:: foldable

        First, resize the string of interval trees accordingly:

        .. includefrags:: demos/tutorial/simple_rna_seq/base.cpp
              :fragment: resize_tree

   Hint
     .. container:: foldable

        Use the function :dox:`IntervalTree#createIntervalTree`.

        **Optional:** Construct the trees in parallel over all contigs with an OpenMP parallel for-loop, see `here <https://developers.sun.com/solaris/articles/openmp.html>`_ for more information about OpenMP.

   Solution
     .. container:: foldable

        .. includefrags:: demos/tutorial/simple_rna_seq/genequant_solution3.cpp
           :fragment: solution

Compute Gene Coverage
^^^^^^^^^^^^^^^^^^^^^

To determine gene expression levels, we first need to compute the read coverage, i.e. the total number of reads overlapping a gene.
Therefore we use a string of counters addressed by the annotation id.

.. includefrags:: demos/tutorial/simple_rna_seq/base.cpp
      :fragment: reads

For each read alignment we want to determine the overlapping genes by conducting a range query via :dox:`IntervalTree#findIntervals` and then increment their counters by 1.
To address the counter of a gene, we use its annotation id stored as cargo in the interval tree.

Read alignments are stored in the :dox:`FragmentStore::alignedReadStore`, a string of :dox:`AlignedReadStoreElement AlignedReadStoreElements` objects.
Their actual type can simply be determined as follows:

.. includefrags:: demos/tutorial/simple_rna_seq/base.cpp
      :fragment: read_alignment_type

Given the :dox:`AlignedReadStoreElement::contigId`, :dox:`AlignedReadStoreElement::beginPos`, and :dox:`AlignedReadStoreElement::endPos` we will retrieve the annotation ids of overlapping genes from the corresponding interval tree.

Your fourth assignment is to implement the count function that performs all the above described steps.
Optionally, use OpenMP to parallelize the counting.

Assignment 4
""""""""""""

.. container:: assignment

   Type
     Application

   Objective
     Use the code template below (click **more...**).
     Implement the function ``countReadsPerGene`` that counts for each gene the number of overlapping reads.
     Therefore determine for each :dox:`AlignedReadStoreElement` begin and end positions (on forward strand) of the alignment and increment the ``readsPerGene`` counter for each overlapping gene.

     **Optional:** Use OpenMP to parallelize the function, see :dox:`SEQAN_OMP_PRAGMA`.

     .. container:: foldable

        Extend the definitions:

        .. includefrags:: demos/tutorial/simple_rna_seq/genequant_assignment4.cpp
           :fragment: definitions

        Add a function:

        .. includefrags:: demos/tutorial/simple_rna_seq/genequant_assignment4.cpp
           :fragment: yourcode

        Extend the ``main`` function:

        .. includefrags:: demos/tutorial/simple_rna_seq/genequant_assignment4.cpp
           :fragment: main

        and

        .. includefrags:: demos/tutorial/simple_rna_seq/genequant_assignment4.cpp
           :fragment: main2

   Hint
     .. container:: foldable
        First, resize and zero the string of counters accordingly:

        .. includefrags:: demos/tutorial/simple_rna_seq/base.cpp
              :fragment: resize_reads

        Make sure that you search with :dox:`IntervalTree#findIntervals` where ``query_begin < query_end`` holds, as opposed to read alignments where ``beginPos`` > ``endPos`` is possible.

   Hint
     .. container:: foldable

        The result of a range query is a string of annotation ids given to :dox:`IntervalTree#findIntervals` by-reference:

        .. includefrags:: demos/tutorial/simple_rna_seq/base.cpp
              :fragment: result

        Reuse the result string for multiple queries (of the same thread, use ``private(result)`` for OpenMP).

   Solution
     .. container:: foldable

        .. includefrags:: demos/tutorial/simple_rna_seq/genequant_solution4.cpp
           :fragment: solution


Output RPKM Values
^^^^^^^^^^^^^^^^^^

In the final step, we want to output the gene expression levels in a normalized measure.
We therefore use **RPKM** values, i.e. the number of **r**\ eads **p**\ er **k**\ ilobase of exon model per **m**\ illion mapped reads (1).
One advantage of RPKM values is their independence of the sequencing throughput (normalized by total mapped reads), and that they allow to compare the expression of short with long transcripts (normalized by exon length).

The exon length of an mRNA is the sum of lengths of all its exons.
As a gene may have multiple mRNA, we will simply use the maximum of all their exon lengths.

Your final assignment is to output the RPKM value for genes with a read counter ``> 0``.
To compute the exon length of the gene (maximal exon length of all mRNA) use an :dox:`AnnotationTreeIterator AnnotationTree Iterator` and iterate over all mRNA (children of the gene) and all exons (children of mRNA).
For the number of total mapped reads simply use the number of alignments in the :dox:`FragmentStore::alignedReadStore`.
Output the gene names and their RPKM values separated by tabs as follows:

.. includefrags:: demos/tutorial/simple_rna_seq/genequant_solution5.cpp.stdout


.. todo: Move the files to somewhere else.

Download and decompress the attached mouse annotation (`Mus_musculus.NCBIM37.61.gtf.zip <https://ftp.seqan.de/manual_files/seqan-1.4/Mus_musculus.NCBIM37.61.gtf.zip>`_ and the alignment file of RNA-Seq reads aligned to chromosome Y (`sim40mio_onlyY.sam.zip <https://ftp.seqan.de/manual_files/seqan-1.4/sim40mio_onlyY.sam.zip>`_).
Test your program and compare your output with the output above.

Assignment 5
""""""""""""

.. container:: assignment

   Type
     Application

   Objective
     Use the code template below (click **more...**).
     Implement the function ``outputGeneCoverage`` that outputs for each expressed gene the gene name and the expression level as RPKM as tab-separated values.

     .. container:: foldable

        Add a function:

        .. includefrags:: demos/tutorial/simple_rna_seq/genequant_assignment5.cpp
           :fragment: yourcode

        Extend the ``main`` function:

        .. includefrags:: demos/tutorial/simple_rna_seq/genequant_assignment5.cpp
           :fragment: main

   Hint
     .. container:: foldable

        To compute the maximal exon length use three nested loops: (1) enumerate all genes, (2) enumerate all mRNA of the gene, and (3) enumerate all exons of the mRNA and sum up their lengths.

   Hint
     .. container:: foldable

        Remember that exons are not the only children of mRNA.

   Solution
     .. container:: foldable

        .. includefrags:: demos/tutorial/simple_rna_seq/genequant_solution5.cpp
           :fragment: solution

Next Steps
----------

* See :cite:`Mortazavi2008` for further reading.
* Read the :ref:`tutorial-io-sam-bam-io` Tutorial and change your program to stream a SAM file instead of loading it as a whole.
* Change the program such that it attaches the RPKM value as a key-value pair (see :dox:`AnnotationTreeIterator#assignValueByKey`) to the annotation of each gene and output a GFF file.
* Continue with the :ref:`tutorial` rest of the tutorials]].