File: dealing_with_hts_data.rst

package info (click to toggle)
python-cogent 1.5.3-2
  • links: PTS, VCS
  • area: main
  • in suites: jessie, jessie-kfreebsd
  • size: 16,424 kB
  • ctags: 24,343
  • sloc: python: 134,200; makefile: 100; ansic: 17; sh: 10
file content (43 lines) | stat: -rw-r--r-- 1,438 bytes parent folder | download | duplicates (4)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
*********************
Dealing with HTS data
*********************

FASTQ formatted files
=====================

Parsing
-------

FASTQ format can be exported by Illumina's pipeline software.

.. doctest::
    
    >>> from cogent.parse.fastq import MinimalFastqParser
    >>> for label, seq, qual in MinimalFastqParser('data/fastq.txt'):
    ...     print label
    ...     print seq
    ...     print qual
    GAPC_0015:6:1:1259:10413#0/1
    AACACCAAACTTCTCCACCACGTGAGCTACAAAAG
    ````Y^T]`]c^cabcacc`^Lb^ccYT\T\Y\WF
    GAPC_0015:6:1:1283:11957#0/1
    TATGTATATATAACATATACATATATACATACATA
    ]KZ[PY]_[YY^```ac^\\`bT``c`\aT``bbb...


Converting quality scores to numeric data
-----------------------------------------

In FASTQ format, ASCII characters are used to represent base-call quality. Unfortunately, vendors differ in the range of characters used. According to their documentation, Illumina uses the character range from 64-104. We parse the sequence file and convert the characters into integers on the fly.

.. doctest::
    
    >>> from cogent.parse.fastq import MinimalFastqParser
    >>> for label, seq, qual in MinimalFastqParser('data/fastq.txt'):
    ...     qual = map(lambda x: ord(x)-64, qual)
    ...     print label
    ...     print seq
    ...     print qual
    GAPC_0015:6:1:1259:10413#0/1
    AACACCAAACTTCTCCACCACGTGAGCTACAAAAG
    [32, 32, 32, 32, 25, 30, 20, 29, 32, 29, 35, 30, 35, 33, 34, 35, 33, ...