File: BlastIO.rst

package info (click to toggle)
seqan2 2.5.2-1
  • links: PTS, VCS
  • area: main
  • in suites: forky, sid
  • size: 228,748 kB
  • sloc: cpp: 257,602; ansic: 91,967; python: 8,326; sh: 1,056; xml: 570; makefile: 229; awk: 51; javascript: 21
file content (554 lines) | stat: -rw-r--r-- 22,550 bytes parent folder | download | duplicates (2)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
.. sidebar:: ToC

    .. contents::

.. _tutorial-io-blast-io:

Blast I/O
=========

Learning Objective
  In this tutorial, you will learn about different the Blast file formats and how to interact with them in SeqAn.

Difficulty
  Average

Duration
  1h30min - 2h30min

Prerequisite Tutorials
  :ref:`tutorial-datastructures-sequences`, :ref:`tutorial-io-input-output-overview`, :ref:`tutorial-datastructures-alignment`, :ref:`tutorial-algorithms-alignment-pairwise-sequence-alignment`

Other recommended reading
  `Basics of Blast Statistics <https://www.ncbi.nlm.nih.gov/BLAST/tutorial/Altschul-1.html>`_

Technical requirements
  Full C++11 support required in compiler
  (GCC :math:`\ge` 4.9, Clang :math:`\ge` 3.4 or MSVC :math:`\ge` 2015)


The Basic local alignment search tool (BLAST) by the NCBI is one of the most widely used tools in Bioinformatics.
It supports a variety of formats which, although widely used, are not standardized and partly even declared as
"subject to unannounced and undocumented change". This makes implementing them very hard and is one of the reasons
dealing with Blast IO is more difficult than with other file formats.

Furthermore it is important to distinguish between the Blast version written in C (known by its `blastall` executable)
and the C++ version (individual `blastn`, `blastp`... executables), also known as BLAST+.
The formats and their identifiers changed between versions.
The following table gives an overview:

+---------------------------------------------------+------------------+-------------+
| Description                                       | `blastall`       |  `blast*`   |
+===================================================+==================+=============+
| pairwise                                          |  -m 0            |  -outfmt 0  |
+---------------------------------------------------+------------------+-------------+
| query-anchored showing identities                 |  -m 1            |  -outfmt 1  |
+---------------------------------------------------+------------------+-------------+
| query-anchored no identities                      |  -m 2            |  -outfmt 2  |
+---------------------------------------------------+------------------+-------------+
| flat query-anchored, show identities              |  -m 3            |  -outfmt 3  |
+---------------------------------------------------+------------------+-------------+
| flat query-anchored, no identities                |  -m 4            |  -outfmt 4  |
+---------------------------------------------------+------------------+-------------+
| query-anchored no identities and blunt ends       |  -m 5            |             |
+---------------------------------------------------+------------------+-------------+
| flat query-anchored, no identities and blunt ends |  -m 6            |             |
+---------------------------------------------------+------------------+-------------+
| XML Blast output                                  |  -m 7            |  -outfmt 5  |
+---------------------------------------------------+------------------+-------------+
| tabular                                           |  -m 8            |  -outfmt 6  |
+---------------------------------------------------+------------------+-------------+
| tabular with comment lines                        |  -m 9            |  -outfmt 7  |
+---------------------------------------------------+------------------+-------------+
| Text ASN.1                                        |  -m 10           |  -outfmt 8  |
+---------------------------------------------------+------------------+-------------+
| Binary ASN.1                                      |  -m 11           |  -outfmt 9  |
+---------------------------------------------------+------------------+-------------+
| Comma-separated values                            |                  |  -outfmt 10 |
+---------------------------------------------------+------------------+-------------+
| BLAST archive format (ASN.1)                      |                  |  -outfmt 11 |
+---------------------------------------------------+------------------+-------------+

The files written by `blastall` are considered the
"legacy"-format in SeqAn. SeqAn has support for the following formats:

+-----------------------------+----------------+----------------+--------------------+
| Format                      | read support   |  write support | based on version   |
+=============================+================+================+====================+
| pairwise                    |                |       ✓        |  Blast-2.2.26+     |
+-----------------------------+----------------+----------------+--------------------+
| tabular                     |      ✓         |       ✓        |  Blast-2.2.26+     |
+-----------------------------+----------------+----------------+--------------------+
| tabular (legacy)            |      ✓         |       ✓        |  Blast-2.2.26      |
+-----------------------------+----------------+----------------+--------------------+
| tabular w comments          |      ✓         |       ✓        |  Blast-2.2.26+     |
+-----------------------------+----------------+----------------+--------------------+
| tabular w comments (legacy) |      ✓         |       ✓        |  Blast-2.2.26      |
+-----------------------------+----------------+----------------+--------------------+

.. caution::

    Please note that *Blast-2.2.26+* is **not the same** as *Blast-2.2.26*! One is version 2.2.26 of the C++
    application suite (BLAST+) and the other is version 2.2.26 of the legacy application suite. There still are software
    releases for both generations.

There are different *program modes* in Blast which also influence the file format. In the legacy application suite
these where specified with the ``-p`` parameter, in the BLAST+ suite they each have their own executable.

+-----------------------------+------------------+--------------------+
| Program mode                | query alphabet   |  subject alphabet  |
+=============================+==================+====================+
| BlastN                      | nucleotide       |  nucleotide        |
+-----------------------------+------------------+--------------------+
| BlastP                      | protein          |  protein           |
+-----------------------------+------------------+--------------------+
| BlastX                      | translated nucl. |  protein           |
+-----------------------------+------------------+--------------------+
| TBlastN                     | protein          |  translated nucl.  |
+-----------------------------+------------------+--------------------+
| TBlastX                     | translated nucl. |  translated nucl.  |
+-----------------------------+------------------+--------------------+

Tabular formats
---------------

The tabular formats are tab-seperated-value formats (TSV), with twelve columns by default.
Each line represents one match (or *high scoring pair* in Blast terminology).
The twelve default columns are:

    1. Query sequence ID (truncated at first whitespace)
    2. Subject sequence ID (truncated at first whitespace)
    3. Percentage of identical positions
    4. Alignment length
    5. Number of mismatches
    6. Number of gap openings
    7. Start position of alignment on query sequence
    8. End position of alignment on query sequence
    9. Start position of alignment on subject sequence
    10. End position of alignment on subject sequence
    11. Expect value (length normalized bit score)
    12. Bit score (statistical significance indicator)

.. note:: Alignment positions in Blast

   #. **Interval notation:** Blast uses 1-based closed intervals for positions, i.e. a match from the 100th position to
      the 200th position of a sequence will be shown as ``100  200`` in the file. SeqAn internally uses 0-based
      half open intervals, i.e. it starts counting at position 0 and stores the first position behind the sequence
      as "end", e.g. position ``99`` and ``200`` for our example.
   #. **Reverse strands:** For matches found on the reverse complement strand the positions are counted backwards from
      the end of the sequence, e.g. a match from the 100th position to the 200th position on a reverse complement strand
      of a sequence of length 500 will be shown as ``400 300`` in the file.
   #. **Translation frames:** Positions given in the file are always on the original untranslated sequence!

   The ``writeRecord()`` function automatically does all of these conversions!


A **tabular** file could look like this (matches per query are sorted by e-value):

.. literalinclude:: ../../../../tests/blast/defaultfields.m8

The **tabular with comment lines** format additionally prefixes every block belonging to one query sequence with
comment lines that include the program version, the database name and column labels. The above example would look
like this:

.. literalinclude:: ../../../../tests/blast/defaultfields.m9
    :lines: 5-37

As you can see, comment lines are also printed for query sequences which don't have any matches.
The major
difference of these formats in BLAST+ vs the legacy application are that the *mismatches* column used to
include the number of gap characters, but it does not in BLAST+. The comments also look slightly different
in the **tabular with comment lines (legacy)** format:

.. literalinclude:: ../../../../tests/blast/defaultfields_legacy.m9
    :lines: 5-35



Pairwise format
---------------

The pairwise format is the default format in Blast. It is more verbose than the tabular formats and very human readable.

This is what the last record from above would look like (the other queries are omitted):

.. literalinclude:: ../../../../demos/tutorial/blast_io/plus_sub.m0


Blast formats in SeqAn
----------------------

There are three blast format related tags in SeqAn:

  #. :dox:`BlastReport` for the pairwise format with the :dox:`FormattedFile` output specialization
     :dox:`BlastReportFileOut`.
  #. :dox:`BlastTabular` for the tabular formats with the :dox:`FormattedFile` output and input specializations
     :dox:`BlastTabularFileOut` and :dox:`BlastTabularFileIn`.
  #. :dox:`BlastTabularLL` which provides light-weight, but very basic tabular IO.

The third tag can be used for simple file manipulation, e.g. filtering or column rearrangement, it is not covered in
this tutorial (see the dox for a simple example).

To work with the first two formats you need to understand at least the following data structures:
  * :dox:`BlastRecord`: the record covers all :dox:`BlastMatch` es belonging to one query sequence.
  * :dox:`FormattedFile`: one of :dox:`BlastReportFileOut`, :dox:`BlastTabularFileOut` and :dox:`BlastTabularFileIn`.
  * :dox:`BlastIOContext`: the context of the FormattedFile.

The context contains file-global data like the name of the database and can also be used to read/write certain file
format properties, e.g. "with comment lines" or "legacyFormat".

.. caution::
    Due to the structure of blast tabular files lots of information is repeated in every block of comment lines, e.g.
    the database name. Because it is expected that these stay the same they are saved in the context and not the record.
    You may still, however, check every time you ``readRecord()`` if you want to make sure.

File reading example
--------------------

Only tabular formats are covered in this example, because no input support is available for the pairwise format.

Copy the contents of the **tabular with comment lines** example above into a file and give it to the following
program as the only parameter. Please use ``.m9`` as file type extension.

.. literalinclude:: ../../../../demos/tutorial/blast_io/read_assignment.cpp
    :language: c++
    :lines: 1-18, 74-84

Assignment 1
""""""""""""

.. container:: assignment

  Objective
    Complete the above example by reading the file according to :dox:`BlastTabularFileIn`.
    For every record print the query ID, the number of contained matches and the bit-score of the best match.

  Solution
      Top
        .. container:: foldable

            .. literalinclude:: ../../../../demos/tutorial/blast_io/read_assignment.cpp
                :language: c++
                :lines: 1-18

      New code
        .. container:: foldable

            .. literalinclude:: ../../../../demos/tutorial/blast_io/read_assignment.cpp
                :language: c++
                :lines: 19-34, 57-60

      Bottom
          .. container:: foldable

            .. literalinclude:: ../../../../demos/tutorial/blast_io/read_assignment.cpp
                :language: c++
                :lines: 74-84

Assignment 2
""""""""""""

.. container:: assignment

  Objective
    Study the documentation of :dox:`BlastIOContext`. How can you adapt the previous program to check if there were any
    problems reading a record? If you have come up with a solution, try to read the file at
    ``tests/blast/defaultfields.m9``. What does the program print and why?

  Solution
      Top
        .. container:: foldable

            .. literalinclude:: ../../../../demos/tutorial/blast_io/read_assignment.cpp
                :language: c++
                :lines: 1-34

      New code
        .. container:: foldable

            .. literalinclude:: ../../../../demos/tutorial/blast_io/read_assignment.cpp
                :language: c++
                :lines: 41-60

            The program will print conformancyErrors for the last record, because there is a typo in the file
            ( ``Datacase`` instead of ``Database`` ).

      Bottom
          .. container:: foldable

            .. literalinclude:: ../../../../demos/tutorial/blast_io/read_assignment.cpp
                :language: c++
                :lines: 74-84

Assignment 3
""""""""""""

.. container:: assignment

  Objective
    Now that you have a basic understanding of :dox:`BlastIOContext`, also print the following information after
    reading the records:

      * file format (with comment lines or without, BLAST+ or legacy?)
      * blast program and version
      * name of database

    Verify that the results are as expected on the files ``tests/blast/defaultfields.m8``,
    ``tests/blast/defaultfields.m9`` and ``tests/blast/defaultfields_legacy.m9``.

  Solution
      Top
        .. container:: foldable

            .. literalinclude:: ../../../../demos/tutorial/blast_io/read_assignment.cpp
                :language: c++
                :lines: 1-34, 41-60

      New code
        .. container:: foldable

            .. literalinclude:: ../../../../demos/tutorial/blast_io/read_assignment.cpp
                :language: c++
                :lines: 61-73

      Bottom
          .. container:: foldable

            .. literalinclude:: ../../../../demos/tutorial/blast_io/read_assignment.cpp
                :language: c++
                :lines: 74-84

Assignment 4
""""""""""""

As previously mentioned, twelve columns are printed by default.
This can be changed in BLAST+, also by means of the ``--outfmt`` parameter.
A standards compliant **file with comment lines** and custom column composition can be read
without further configuration in SeqAn.

.. tip::
    Don't believe it? Look at ``tests/blast/customfields.m9``, as as you can see the bit score is in the 13th column
    (instead of the twelfth). If you run your program on this file, it should still print the correct bit-scores!

.. container:: assignment

  Objective
    Read :dox:`BlastIOContext` again focusing on :dox:`BlastIOContext::fields` and also read :dox:`BlastMatchField`.
    Now adapt the previous program to print for every record the ``optionLabel`` of each field used.

    Verify that the results are as expected on the files ``tests/blast/defaultfields.m9`` and
    ``tests/blast/customfields.m9``.

  Solution
      Top
        .. container:: foldable

            .. literalinclude:: ../../../../demos/tutorial/blast_io/read_assignment.cpp
                :language: c++
                :lines: 1-34

      New code
        .. container:: foldable

            .. literalinclude:: ../../../../demos/tutorial/blast_io/read_assignment.cpp
                :language: c++
                :lines: 35-40

      Bottom
          .. container:: foldable

            .. literalinclude:: ../../../../demos/tutorial/blast_io/read_assignment.cpp
                :language: c++
                :lines: 41-84

      If this was too easy, you can also try the same for tabular files without comment lines!

File writing example
--------------------

The following program stub creates three query sequences and two subject sequences in amino acid alphabet.
We will later generate records with matches and print these to disk.

.. literalinclude:: ../../../../demos/tutorial/blast_io/write_assignment.cpp
    :language: c++
    :lines: 1-49, 105-114

Assignment 5
""""""""""""

.. container:: assignment

  Objective
    Before we can begin to align, certain properties of the :dox:`BlastIOContext` have to be set.
    Which ones? Add a block with the necessary initialization to the above code, use blast defaults where possible.
    Although not strictly required at this point: include a call to ``writeHeader()``.

    .. caution::
        Alignment score computation works slightly different in Blast and in SeqAn, please have a look at
        :dox:`BlastScoringScheme`. For the task at hand it should suffice to simply use the corresponding
        ``set*()`` functions on :dox:`BlastIOContext::scoringScheme`.

  Solution
      Top
        .. container:: foldable

            .. literalinclude:: ../../../../demos/tutorial/blast_io/write_assignment.cpp
                :language: c++
                :lines: 1-49

      New code
        .. container:: foldable

            .. literalinclude:: ../../../../demos/tutorial/blast_io/write_assignment.cpp
                :language: c++
                :lines: 50-64

      Bottom
          .. container:: foldable

            .. literalinclude:: ../../../../demos/tutorial/blast_io/write_assignment.cpp
                :language: c++
                :lines: 106-116

Assignment 6
""""""""""""

.. container:: assignment

  Objective
        Next create a record for every query sequence, and in each record a match for every query-subject pair.
        Compute the local alignment for each of those matches. Use the align member of the match object.

  Solution
      Top
        .. container:: foldable

            .. literalinclude:: ../../../../demos/tutorial/blast_io/write_assignment.cpp
                :language: c++
                :lines: 1-64

      New Code
       .. container:: foldable

            .. literalinclude:: ../../../../demos/tutorial/blast_io/write_assignment.cpp
                :language: c++
                :lines: 65-82, 97-99, 103

      Bottom
       .. container:: foldable

            .. literalinclude:: ../../../../demos/tutorial/blast_io/write_assignment.cpp
                :language: c++
                :lines: 106-116

Assignment 7
""""""""""""

.. container:: assignment

  Objective
        Now that you have the align member computed for every match, also save the begin and end positions, as well
        as the lengths. Blast Output needs to now about the number of gaps, mismatches... of every match, how can
        they be computed?

  Solution
      Top
        .. container:: foldable

            .. literalinclude:: ../../../../demos/tutorial/blast_io/write_assignment.cpp
                :language: c++
                :lines: 1-82

      New Code
       .. container:: foldable

            .. literalinclude:: ../../../../demos/tutorial/blast_io/write_assignment.cpp
                :language: c++
                :lines: 83-92

      Bottom
       .. container:: foldable

            .. literalinclude:: ../../../../demos/tutorial/blast_io/write_assignment.cpp
                :language: c++
                :lines: 97-99, 103, 106-116

Assignment 8
""""""""""""

.. container:: assignment

  Objective
        Finally add e-value statistics and print the results to a file:
         * compute the bit score and e-value for every match
         * discard matches with an e-value greater than 1
         * for every record, sort the matches by bit-score
         * write each record
         * write the footer before exiting the program

  Solution
      Top
       .. container:: foldable

            .. literalinclude:: ../../../../demos/tutorial/blast_io/write_assignment.cpp
                :language: c++
                :lines: 1-92

      New Code
       .. container:: foldable

            .. literalinclude:: ../../../../demos/tutorial/blast_io/write_assignment.cpp
                :language: c++
                :lines: 93-96

      Bottom
       .. container:: foldable

            .. literalinclude:: ../../../../demos/tutorial/blast_io/write_assignment.cpp
                :language: c++
                :lines: 97-116

      Your output file should look like this:
        .. container:: foldable

            .. literalinclude:: ../../../../demos/tutorial/blast_io/write_assignment.m9

Assignment 9
""""""""""""

.. container:: assignment

  Objective
    Up until now you have only printed the **tabular with comment lines** format. What do you have to do to print
    without comment lines? In legacy format? What about the pairwise format?

  Solution
    Tabular without comment lines
      .. container:: foldable

        Add ``context(outfile).tabularSpec = BlastTabularSpec::NO_COMMENTS`` at l.53.
        Remember to use ``.m8`` as file extension!

        The result should look like this:

        .. literalinclude:: ../../../../demos/tutorial/blast_io/write_assignment.m8

    Tabular without comment lines (legacy)
      .. container:: foldable

        To print in legacy tabular format (with or without comment lines), add ``context(outfile).legacyFormat = true`` at l.53.

        The result should look like this(legacy and NO_COMMENTS):

        .. literalinclude:: ../../../../demos/tutorial/blast_io/write_assignment_legacy.m8

    Pairwise format
      .. container:: foldable

        To print in the pairwise format replace l.52 with ``BlastReportFileOut<TContext> outfile(argv[1]);``.

        Remember to use ``.m0`` as file extension!

        The result should look like this:

        .. literalinclude:: ../../../../demos/tutorial/blast_io/write_assignment.m0