1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200
|
pbcore.io
=========
The ``pbcore.io`` package provides a number of lightweight interfaces
to PacBio data files and other standard bioinformatics file formats.
Preferred usage is to import classes directly from the ``pbcore.io``
package, e.g.::
>>> from pbcore.io import CmpH5Reader
The classes within ``pbcore.io`` adhere to a few conventions, in order
to provide a uniform API:
- Each data file type is thought of as a container of a `Record`
type; all `Reader` classes support streaming access by iterating on the
reader object, and
`CmpH5Reader`, `BasH5Reader` and `IndexedBarReader` additionally
provide random-access
to alignments/reads.
For example::
from pbcore.io import *
with IndexedBamReader(filename) as f:
for r in f:
process(r)
To make scripts a bit more user friendly, a progress bar can be
easily added using the `tqdm` third-party package::
from pbcore.io import *
from tqdm import tqdm
with IndexedBamReader(filename) as f:
for r in tqdm(f):
process(r)
- The constructor argument needed to instantiate `Reader` and
`Writer` objects can be either a filename (which can be suffixed
by ".gz" for all but the h5 file types) or an open file handle.
The reader/writer classes will do what you would expect.
- The reader/writer classes all support the context manager idiom.
Meaning, if you write::
>>> with CmpH5Reader("aligned_reads.cmp.h5") as r:
... print r[0].read()
the `CmpH5Reader` object will be automatically closed after the
block within the "with" statement is executed.
BAM/cmp.h5 compatibility: quick start
-------------------------------------
If you have an application that uses the `CmpH5Reader` and you want to
start using BAM files, your best bet is to use the following generic
factory functions:
.. autofunction:: pbcore.io.openIndexedAlignmentFile
.. autofunction:: pbcore.io.openAlignmentFile
.. note::
Since BAM files contain a subset of the information that was
present in cmp.h5 files, you will need to provide these functions
an indexed FASTA file for your reference. For *full*
compatibility, you need the `openIndexedAlignmentFile` function,
which requires the existence of a `bam.pbi` file (PacBio BAM index
companion file).
`bas.h5` / `bax.h5` Formats (PacBio basecalls file)
---------------------------------------------------
The `bas.h5`/ `bax.h5` file formats are container formats for PacBio
reads, built on top of the HDF5 standard. Originally there was just
one `bas.h5`, but eventually "multistreaming" came along and we had to
split the file into three `bax.h5` *parts* and one `bas.h5` file
containing pointers to the *parts*. Use ``BasH5Reader`` to read any
kind of `bas.h5` file, and ``BaxH5Reader`` to read a `bax.h5`.
.. note::
In contrast to GFF, for example, the `bas.h5` read coordinate
system is 0-based and start-inclusive/end-exclusive, i.e. the same
convention as Python and the C++ STL.
.. autoclass:: pbcore.io.BasH5Reader
:members:
:undoc-members:
.. autoclass:: pbcore.io.BasH5IO.Zmw
:members:
:undoc-members:
.. autoclass:: pbcore.io.BasH5IO.ZmwRead
:members:
:undoc-members:
BAM format
----------
The BAM format is a standard format described aligned and unaligned
reads. PacBio is transitioning from the cmp.h5 format to the BAM
format. For basic functionality, one should use :class:`BamReader`;
for full compatibility with the :class:`CmpH5Reader` API (including
alignment index functionality) one should use
:class:`IndexedBamReader`, which requires the auxiliary *PacBio BAM
index file* (``bam.pbi`` file).
.. autoclass:: pbcore.io.BamAlignment
:members:
:undoc-members:
.. autoclass:: pbcore.io.BamReader
:members:
:undoc-members:
.. autoclass:: pbcore.io.IndexedBamReader
:members:
:undoc-members:
`cmp.h5` format (legacy PacBio alignment file)
----------------------------------------------
The `cmp.h5` file format is an alignment format built on top of the HDF5
standard. It is a simple container format for PacBio alignment records.
.. note::
In contrast to GFF, for example, all `cmp.h5` coordinate systems
(refererence, read) are 0-based and start-inclusive/end-exclusive,
i.e. the same convention as Python and the C++ STL.
.. autoclass:: pbcore.io.CmpH5Reader
:members:
:undoc-members:
.. autoclass:: pbcore.io.CmpH5Alignment
:members:
:undoc-members:
FASTA Format
------------
FASTA is a standard format for sequence data. We recommmend using the
`FastaTable` class, which provides random access to indexed FASTA
files (using the conventional SAMtools "fai" index).
.. autoclass:: pbcore.io.FastaTable
:members:
.. autoclass:: pbcore.io.FastaRecord
:members:
.. autoclass:: pbcore.io.FastaReader
:members:
.. autoclass:: pbcore.io.FastaWriter
:members:
FASTQ Format
------------
FASTQ is a standard format for sequence data with associated quality scores.
.. autoclass:: pbcore.io.FastqRecord
:members:
.. autoclass:: pbcore.io.FastqReader
:members:
.. autoclass:: pbcore.io.FastqWriter
:members:
GFF Format (Version 3)
----------------------
The GFF format is an open and flexible standard for representing genomic features.
.. autoclass:: pbcore.io.Gff3Record
:members:
.. autoclass:: pbcore.io.GffReader
:members:
.. autoclass:: pbcore.io.GffWriter
:members:
|