1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64
|
Notes on BAM file semantics
---------------------------
These are some implementation notes about how BAM files are handled by
mod:`pybedtools` for those interested in the implementation.
The initial creation of a :class:`BedTool` that points to a file will
trigger a check on the first 15 bytes of a file to see if it's a BAM file.
If so, then the BedTool's `_isbam` attribute is set to `True`. If the
:class:`BedTool` is a stream, then the check will not be made, and it is up
to the creator (whether it's the user on the command line or a method or
function) to set the BAM-streaming BedTool's `._isbam` attribute to `True`.
This is handled automatically for wrapped BEDTools programs (described
below).
Some BEDTools programs natively handle BAM files. The `@_wraps` decorator
that is used to wrap each method has a `bam` kwarg that specifies what
input argument the wrapped tool will accept as BAM (for example, the
wrapper for `intersectBed` has the kwarg `bam="abam"`).
If `self._isbam == True`, then `self.fn` is passed to the `bam` input arg
instead of the default implicit input arg (so `intersectBed`, `self.fn` is
passed as `abam` instead of `-a`).
Trying to call a method that does not have a `bam` kwarg registered will
result in a ValueError, along with a message that says to use
:meth:`BedTool.bam_to_bed()` first. For example, `subtractBed` currently
doesn't accept BAM files as input, so this doesn't work::
>>> a = pybedtools.example_bedtool('gdc.bam')
>>> b = pybedtools.example_bedtool('gdc.gff')
>>> # doesn't work:
>>> c = a.subtract(b)
However, converting to `a` to BED format first (and setting `stream=True`
to save on disk I/O) works fine::
>>> # works:
>>> c = a.bam_to_bed(stream=True).subtract(b)
Iterating over a file-based BedTool that points to a BAM will call
`samtools view` and yields lines which sent to `IntervalIterator`, which
splits the lines and passes them to `create_interval_from_list` which in
turn decides on the fly whether it's gff, bed, or sam.
However, we can't easily check the first 15 bytes of a streaming BedTool,
because that would consume those bytes. The `@_wraps` decorator needs to
know some information about which arguments to a wrapped program result in
BAM output and which result in non-BAM output.
Given `a = BedTool('x.bam')`:
* `c = a.intersect(b)` creates BAM output, so it returns a new BedTool with
`c._isbam = True`.
* `a.intersect(b, bed=True)` returns BED output. `@_wraps` needs to know, if the
input was BAM, which kwarg[s] disable BAM output. For example, if `-bed`
is passed to `intersectBed`, the output will NOT be BAM. This is
implemented with the `nonbam` kwarg for :func:`_wraps`. In this case,
the resulting BED file is treated like any other BED file.
* `c = a.intersect(b, stream=True)` returns streaming BAM output. In this
case, iterating over `c` will send the BAM stream to stdin of a samtools
call
|