File: pbcore.io.rst

package info (click to toggle)
python-pbcore 1.6.5%2Bdfsg-1
  • links: PTS, VCS
  • area: main
  • in suites: buster
  • size: 19,168 kB
  • sloc: python: 25,497; xml: 2,846; makefile: 251; sh: 24
file content (200 lines) | stat: -rw-r--r-- 5,538 bytes parent folder | download | duplicates (2)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
pbcore.io
=========

The ``pbcore.io`` package provides a number of lightweight interfaces
to PacBio data files and other standard bioinformatics file formats.
Preferred usage is to import classes directly from the ``pbcore.io``
package, e.g.::

    >>> from pbcore.io import CmpH5Reader

The classes within ``pbcore.io`` adhere to a few conventions, in order
to provide a uniform API:

  - Each data file type is thought of as a container of a `Record`
    type; all `Reader` classes support streaming access by iterating on the 
    reader object, and
    `CmpH5Reader`, `BasH5Reader` and `IndexedBarReader` additionally 
    provide random-access
    to alignments/reads.
    
    For example::
    
      from pbcore.io import *
      with IndexedBamReader(filename) as f:
        for r in f:
            process(r)
    
    To make scripts a bit more user friendly, a progress bar can be
    easily added using the `tqdm` third-party package::
    
      from pbcore.io import *
      from tqdm import tqdm
      with IndexedBamReader(filename) as f:
        for r in tqdm(f):
            process(r)
    

  - The constructor argument needed to instantiate `Reader` and
    `Writer` objects can be either a filename (which can be suffixed
    by ".gz" for all but the h5 file types) or an open file handle.
    The reader/writer classes will do what you would expect.


  - The reader/writer classes all support the context manager idiom.
    Meaning, if you write::

      >>> with CmpH5Reader("aligned_reads.cmp.h5") as r:
      ...   print r[0].read()

    the `CmpH5Reader` object will be automatically closed after the
    block within the "with" statement is executed.

BAM/cmp.h5 compatibility: quick start
-------------------------------------

If you have an application that uses the `CmpH5Reader` and you want to
start using BAM files, your best bet is to use the following generic
factory functions:

.. autofunction:: pbcore.io.openIndexedAlignmentFile

.. autofunction:: pbcore.io.openAlignmentFile

.. note::

   Since BAM files contain a subset of the information that was
   present in cmp.h5 files, you will need to provide these functions
   an indexed FASTA file for your reference.  For *full*
   compatibility, you need the `openIndexedAlignmentFile` function,
   which requires the existence of a `bam.pbi` file (PacBio BAM index
   companion file).




`bas.h5` / `bax.h5` Formats (PacBio basecalls file)
---------------------------------------------------

The `bas.h5`/ `bax.h5` file formats are container formats for PacBio
reads, built on top of the HDF5 standard.  Originally there was just
one `bas.h5`, but eventually "multistreaming" came along and we had to
split the file into three `bax.h5` *parts* and one `bas.h5` file
containing pointers to the *parts*.  Use ``BasH5Reader`` to read any
kind of `bas.h5` file, and ``BaxH5Reader`` to read a `bax.h5`.

.. note::

    In contrast to GFF, for example, the `bas.h5` read coordinate
    system is 0-based and start-inclusive/end-exclusive, i.e. the same
    convention as Python and the C++ STL.

.. autoclass:: pbcore.io.BasH5Reader
    :members:
    :undoc-members:

.. autoclass:: pbcore.io.BasH5IO.Zmw
    :members:
    :undoc-members:

.. autoclass:: pbcore.io.BasH5IO.ZmwRead
    :members:
    :undoc-members:


BAM format
----------

The BAM format is a standard format described aligned and unaligned
reads.  PacBio is transitioning from the cmp.h5 format to the BAM
format.  For basic functionality, one should use :class:`BamReader`;
for full compatibility with the :class:`CmpH5Reader` API (including
alignment index functionality) one should use
:class:`IndexedBamReader`, which requires the auxiliary *PacBio BAM
index file* (``bam.pbi`` file).

.. autoclass:: pbcore.io.BamAlignment
    :members:
    :undoc-members:

.. autoclass:: pbcore.io.BamReader
    :members:
    :undoc-members:

.. autoclass:: pbcore.io.IndexedBamReader
    :members:
    :undoc-members:



`cmp.h5` format (legacy PacBio alignment file)
----------------------------------------------

The `cmp.h5` file format is an alignment format built on top of the HDF5
standard.  It is a simple container format for PacBio alignment records.

.. note::

    In contrast to GFF, for example, all `cmp.h5` coordinate systems
    (refererence, read) are 0-based and start-inclusive/end-exclusive,
    i.e. the same convention as Python and the C++ STL.


.. autoclass:: pbcore.io.CmpH5Reader
    :members:
    :undoc-members:

.. autoclass:: pbcore.io.CmpH5Alignment
    :members:
    :undoc-members:


FASTA Format
------------

FASTA is a standard format for sequence data.  We recommmend using the
`FastaTable` class, which provides random access to indexed FASTA
files (using the conventional SAMtools "fai" index).

.. autoclass:: pbcore.io.FastaTable
    :members:

.. autoclass:: pbcore.io.FastaRecord
    :members:

.. autoclass:: pbcore.io.FastaReader
    :members:

.. autoclass:: pbcore.io.FastaWriter
    :members:


FASTQ Format
------------

FASTQ is a standard format for sequence data with associated quality scores.

.. autoclass:: pbcore.io.FastqRecord
    :members:

.. autoclass:: pbcore.io.FastqReader
    :members:

.. autoclass:: pbcore.io.FastqWriter
    :members:



GFF Format (Version 3)
----------------------

The GFF format is an open and flexible standard for representing genomic features.

.. autoclass:: pbcore.io.Gff3Record
    :members:

.. autoclass:: pbcore.io.GffReader
    :members:

.. autoclass:: pbcore.io.GffWriter
    :members: