File: topical-bam.rst

package info (click to toggle)
python-pybedtools 0.10.0-5
  • links: PTS, VCS
  • area: main
  • in suites: forky, sid
  • size: 16,620 kB
  • sloc: python: 10,030; cpp: 899; makefile: 142; sh: 57
file content (148 lines) | stat: -rw-r--r-- 5,096 bytes parent folder | download | duplicates (2)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
.. include:: includeme.rst

.. _bam:

Working with BAM files
======================
Some BEDTools programs, like `intersecteBed`, support BAM files as input.
From the command line, you would need to specify the `-abam`
argument to do so.  However, :mod:`pybedtools` auto-detects BAM files and
passes the `abam` argument automatically for you.  That means if you create
a :class:`BedTool` out of a BAM file, like this:

.. doctest::

    x = pybedtools.example_bedtool('gdc.bam')

you can intersect it with a BED file without doing anything special:

.. doctest::

    b = pybedtools.example_bedtool('gdc.gff')
    y = x.intersect(b)

The output of this operation follows the semantics of BEDTools.  That is,
for programs like `intersectBed`, if `abam` is used then the output will be
BAM format as well.  But if the `-bed` argument is passed, then the output
will be BED format. Similarly, in :mod:`pybedtools`, if a BAM file is used
to create the :class:`BedTool` then the results will also be in BAM
format.  If the `bed=True` kwarg is passed, then the results be in BED
format.

As an example, let's intersect a BAM file of reads with annotations using
files that ship with :mod:`pybedtools`.  First, we create the
:class:`BedTool` objects:

.. doctest::

    >>> a = pybedtools.example_bedtool('x.bam')
    >>> b = pybedtools.example_bedtool('dm3-chr2L-5M.gff.gz')

The first call below will return BAM results, and the second will return
BED results.

.. doctest::

    >>> bam_results = a.intersect(b)
    >>> str(bam_results.file_type) == 'bam'
    True

    >>> bed_results = a.intersect(b, bed=True)
    >>> str(bed_results.file_type) == 'bed'
    True


We can iterate over BAM files to get :class:`Interval` objects just like
iterating over BED or GFF files.  Indexing works, too:

.. doctest::
    :options: +ELLIPSIS +NORMALIZE_WHITESPACE

    >>> for i in bam_results[:2]:
    ...     print(i)
    HWUSI-NAME:2:69:512:1017#0	16	chr2L	9330	3	36M	*	0	0	TACAAATCTTACGTAAACACTCCAAGCATGAATTCG	Y`V_a_TM[\_V`abb`^^Q]QZaaaaa_aaaaaaa	NM:i:0	NH:i:2	CC:Z:chrX	CP:i:19096815
    <BLANKLINE>
    HWUSI-NAME:2:91:1201:1113#0	16	chr2L	10213	255	36M	*	0	0	TGTAGAATGCAAAAATTACATTTGTGAGTATCATCA	UV[aY`]\VZ`baaaZa`_aab_`_`a`ab``b`aa	NM:i:0	NH:i:1
    <BLANKLINE>

    >>> bam_results[0]
    Interval(chr2L:9329-9365)

    >>> bam_results[:10]
    <itertools.islice object at ...>

    >>> cigar_string = i[5]

There are several things to watch out for here.

First, note that :mod:`pybedtools` uses the convention that BAM features in
plain text format are considered SAM features, so these SAM features are
**one-based and include the stop coordinate** as illustrated below. (Note that
there is some additional complexity here due to supporting Python 2 and
3 simultaneously in this tested documentation)

.. doctest::

    >>> bam_results[0].start
    9329

    >>> isinstance(bam_results[0][3], str)
    True

    >>> print(bam_results[0][3])
    9330


Second, the stop coordinate is defined as the *start coord plus the
length of the sequence*; eventually a more sophisticated, CIGAR-aware
approach may be used.  Similarly, the length is defined to be `stop -
start` -- again, not CIGAR-aware at the moment.  For more sophisticated
low-level manipulation of BAM features, you might want to consider using
HTSeq_.

Third, while we can iterate over a BAM file and manipulate the features as
shown above, *calling BEDTools programs on a BAM-based generator is not
well-supported*.

Specifically::

    >>> a = pybedtools.example_bed('gdc.bam')
    >>> b = pybedtools.example_bed('b.bed')

    >>> # works, gets BAM results
    >>> results = a.intersect(b)

    >>> # make a generator of features in `a`
    >>> a2 = pybedtools.BedTool(i for i in a)

    >>> # this does NOT work
    >>> a2.intersect(b)

When we specified the `bed=True` kwarg above, the intersected BAM results
are converted to BED format.  We can use those like a normal BED file.
Note that since we are viewing BED output, *the start and stops are 0-based*:

.. doctest::
    :options: +NORMALIZE_WHITESPACE

    >>> d = a.intersect(b, bed=True)
    >>> d.head(3)
    chr2L	9329	9365	HWUSI-NAME:2:69:512:1017#0	3	-	9329	9365	0,0,0	1	36,	0,
    chr2L	9329	9365	HWUSI-NAME:2:69:512:1017#0	3	-	9329	9365	0,0,0	1	36,	0,
    chr2L	9329	9365	HWUSI-NAME:2:69:512:1017#0	3	-	9329	9365	0,0,0	1	36,	0,

Consistent with BEDTools programs, BAM files are **not** supported as the
second input argument.  In other words, `intersectBed` does not have both
`-abam` and `-bbam` arguments, so :mod:`pybedtools` will not not allow this
either.

However, :mod:`pybedtools` does allow streaming BAM files to be the input of
methods that allow BAM input as the first input. In this [trivial] example, we
can stream the first intersection to save disk space, and then send that
streaming BAM to the next :meth:`BedTool.intersect` call. Since it's not
streamed, the second intersection will be saved as a temp BAM file on disk::


    >>> a.intersect(b, stream=True).intersect(b)

.. _HTSeq: http://www-huber.embl.de/users/anders/HTSeq/doc/overview.html