File: fastx_reader.rst

package info (click to toggle)
htseq 2.0.9%2Bdfsg-1
links: PTS, VCS
area: main
in suites: forky, sid, trixie
size: 103,476 kB
sloc: python: 6,280; sh: 211; cpp: 147; makefile: 80
file content (114 lines) | stat: -rw-r--r-- 5,183 bytes
parent folder | download | duplicates (2)
.. _tutorial_fastx:

**********************************************
Tutorial: Using Fasta/Fastq parsers
**********************************************

.. currentmodule:: HTSeq

Fasta and fastq are the de facto standard formats for raw reads generated by
high-throuput sequencing machines. Fasta files do not have a quality score associated
with each base, while fastq do, the so-called "phred score". ``HTSeq`` has dedicated
parsers for them and their gzipped counterparts.

This tutorial will walk you through a few routine operations with fasta/fastq files.
The example files used here are all in the ``example_data`` folder of the ``HTSeq``
repository on GitHub.

Parsing a fasta file
--------------------
Let's start by reading a fasta file using :class:`FastaReader`::

   >>> import HTSeq
   >>> f = HTSeq.FastaReader('fastaEx.fa')

You can then iterate over the sequences::

   >>> for read in f:
   ...     print(read)
   AGTACGTAGTCGCTGCTGCTACGGGCGCTAGCTAGTACGTCACGACGTAGATGCTAGCTGACTAAACGATGC
   AAACGATCGATCGTACTCGACTGATGTAGTATATACGTCGTACGTAGCATCGTCAGTTACTGCATGCGGG

Each read is a :class:`Sequence` object with a few properties, notably ``name``, ``seq``,
and ``descr`` (for "description"). Notice that ``seq`` is not a Python string but a
*bytestring* instead, i.e. each character only uses exactly 1 byte. This is for efficiency
purposes. To convert each ``read.seq`` into a string, you can use the standard ``decode``
method::

   >>> read.seq
   b'AAACGATCGATCGTACTCGACTGATGTAGTATATACGTCGTACGTAGCATCGTCAGTTACTGCATGCGGG'
   >>> read.seq.decode()
   'AAACGATCGATCGTACTCGACTGATGTAGTATATACGTCGTACGTAGCATCGTCAGTTACTGCATGCGGG'

Although opening files like this is possible, :class:`FastaReader` supports context
management, which is safer because it automatically closes the file for you::

   >>> with HTSeq.FastaReader('fastaEx.fa') as f:
   ...     for read in f:
   ...         print(read)

.. note:: ``HTSeq`` does not load the whole file in memory, so you can parse files
   much larger than your computer RAM as long as you don't store the reads along the way.

Parsing a fastq file
--------------------
Fastq files are handled approximately in the same way, but use :class:`FastqReader`::

   >>> with HTSeq.FastqReader('fastqEx.fastq') as f:
   ...     for read in f:
   ...         print(read.seq)
   ...         print(read.qualstr)
   b'AGTACGTAGTCGCTGCTGCTACGGGCGCTAGCTAGTACGTCACGACGTAGATGCTAGCTGACTAAACGATGC'
   b'GGGGGGGGGGGGGGGGGGGGGGGGGCGBBBBBBBBBBBBBBBHHHHHHHHHHHHHHHHHHHHHHHHHHHHHH'
   b'AAACGATCGATCGTACTCGACTGATGTAGTATATACGTCGTACGTAGCATCGTCAGTTACTGCATGCGGG'
   b'BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBDDDDDDDDD???????????????????????'

Of course, each read has now quality scores (that's the whole point of fast**Q**),
which are also a bytestring (again, for efficiency reason). It is also customary to
use numeric values for the base-by-base qualities or phred scores::

   >>> read.qual
   array([33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33,
       33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33, 33,
       33, 33, 33, 33, 35, 35, 35, 35, 35, 35, 35, 35, 35, 30, 30, 30, 30,
       30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30,
       30, 30], dtype=uint8)

Instead of returning a bytestring, ``read.qual`` returns a numpy array of unsigned
integers (i.e. bytes) containing the numeric value of the quality score in the
usual phred or 33-based ASCII encoding. This number ``q`` can be related to the likelihood of
a mistake in the base calling by the formula :math:`q = -10 * log_{10}[ P_{mistake} ]`.
Of course, the higher the quality, the lower the chance of it being a mistake (since
the log is a monotonically increasing function).

Parsing gzip-compressed files
-----------------------------
Both parsers handle gzipped files, which are commonly used to save disk space,
transparently::

   >>> with HTSeq.FastqReader('fastqExgzip.fastq.gz') as f:
   ...     for read in f:
   ...         print(read.seq)
   ...         print(read.qualstr)
   ... 
   ... 
   b'AGTACGTAGTCGCTGCTGCTACGGGCGCTAGCTAGTACGTCACGACGTAGATGCTAGCTGACTAAACGATGC'
   b'GGGGGGGGGGGGGGGGGGGGGGGGGCGBBBBBBBBBBBBBBBHHHHHHHHHHHHHHHHHHHHHHHHHHHHHH'
   b'AAACGATCGATCGTACTCGACTGATGTAGTATATACGTCGTACGTAGCATCGTCAGTTACTGCATGCGGG'
   b'BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBDDDDDDDDD???????????????????????'

.. note:: Although some technologies produce reads with constant length (e.g. Illumina),
   in other scenarios (e.g. Oxford Nanopore) the read length might be heterogeneous. Be
   careful if you rely on read length for your computations.

Comparison with other libraries
-------------------------------
- `Biopython`_  is a set of freely available tools for biological computation that includes
  fasta/fastq manipulation. Compared to ``HTSeq``, which is focused on high-throughput
  sequencing analyses on large genomes, it has a broader scope. If you are having trouble
  running a specific analysis with our parsers, you should take a look at it.

- `scikit-bio`_ is also broader in scope and implements parsers for fasta/fastq.

.. _`Biopython`: https://biopython.org/
.. _`scikit-bio`: http://scikit-bio.org/