File: long_range_interaction.py

package info (click to toggle)
python-pybedtools 0.8.0-5
  • links: PTS, VCS
  • area: main
  • in suites: bullseye
  • size: 16,140 kB
  • sloc: python: 9,589; cpp: 899; makefile: 149; sh: 116
file content (403 lines) | stat: -rw-r--r-- 16,488 bytes parent folder | download | duplicates (2)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
from __future__ import print_function
import os
import sys
import itertools
import six
import time
import pysam
import pybedtools


def tag_bedpe(bedpe, queries, verbose=False):
    """
    Tag each end of a BEDPE with a set of (possibly many) query BED files.

    For example, given a BEDPE of interacting fragments from a Hi-C experiment,
    identify the contacts between promoters and ChIP-seq peaks. In this case,
    promoters and ChIP-seq peaks of interest would be provided as BED files.

    The strategy is to split the BEDPE into two separate files.  Each file is
    intersected independently with the set of queries.  The results are then
    iterated through in parallel to tie the ends back together. It is this
    iterator that is returned (see example below).

    Parameters
    ----------

    bedpe : str
        BEDPE-format file. Must be name-sorted.

    queries : dict
        Dictionary of BED/GFF/GTF/VCF files to use. After splitting the BEDPE,
        these query files (values in the dictionary) will be passed as the `-b`
        arg to `bedtools intersect`. The keys are passed as the `names`
        argument for `bedtools intersect`

        *Features in each file must have unique names*. Use
        :func:`pybedtools.featurefuncs.UniqueID` to help fix this.

        *Each file must be BED3 to BED6*.

    Returns
    -------
    Tuple of (iterator, n, extra).

    `iterator` is described below. `n` is the total number of lines in the
    BEDPE file, which is useful for calculating percentage complete for
    downstream work. `extra` is the number of extra fields found in the BEDPE
    (also useful for downstream processing).

    `iterator` yields tuples of (label, end1_hits, end2_hits) where `label` is
    the name field of one line of the original BEDPE file. `end1_hits` and
    `end2_hits` are each iterators of BED-like lines representing all
    identified intersections across all query BED files for end1 and end2 for
    this pair.

    Recall that BEDPE format defines a single name and a single score for each
    pair. For each item in `end1_hits`, the fields are::

        chrom1
        start1
        end1
        name
        score
        strand1
        [extra fields]
        query_label
        fields_from_query_intersecting_end1

    where `[extra fields]` are any additional fields from the original BEDPE,
    `query_label` is one of the keys in the `beds` input dictionary, and the
    remaining fields in the line are the intersecting line from the
    corresponding BED file in the `beds` input dictionary.

    Similarly, each item in `end2_hits` consists of:

        chrom2
        start2
        end2
        name
        score
        strand2
        [extra fields]
        query_label
        fields_from_query_intersecting_end2

    At least one line is reported for every line in the BEDPE file. If there
    was no intersection, the standard BEDTools null fields will be shown. In
    `end1_hits` and `end2_hits`, a line will be reported for each hit in each
    query.

    Example
    -------

    Consider the following BEDPE (where "x1" is an aribtrary extra field).

    >>> bedpe = pybedtools.example_bedtool('test_bedpe.bed')
    >>> print(bedpe) # doctest: +NORMALIZE_WHITESPACE
    chr1  1  10  chr1  50   90   pair1  5  +  -  x1
    chr1  2  15  chr1  200  210  pair2  1  +  +  y1
    <BLANKLINE>


    And the following transcription start sites (TSSes) in BED4 format:

    >>> tsses = pybedtools.example_bedtool('test_tsses.bed')
    >>> print(tsses) # doctest: +NORMALIZE_WHITESPACE
    chr1  5   6   gene1
    chr1  60  61  gene2
    chr1  88  89  gene3
    <BLANKLINE>

    And the following called peaks as BED6:

    >>> peaks = pybedtools.example_bedtool('test_peaks.bed')
    >>> print(peaks) # doctest: +NORMALIZE_WHITESPACE
    chr1  3  4  peak1  50  .
    <BLANKLINE>

    Then we can get the following iterator, n, and extra. Note that the
    OrderedDict is only for testing to ensure output is always consistend; in
    practice a regular dictionary is fine:

    >>> from pybedtools.contrib.long_range_interaction import tag_bedpe
    >>> from collections import OrderedDict
    >>> queries = OrderedDict()
    >>> queries['tss'] = tsses
    >>> queries['pk'] = peaks
    >>> iterator, n, extra = tag_bedpe(bedpe, queries)
    >>> print(n)
    2
    >>> print(extra)
    1

    The following illustrates that each item in the iterator represents one
    pair, and each item in each group represents an intersection with one end.
    Note that the sorting is necessary only for the doctests to be output in
    consistent format; this not typically needed:

    >>> for (label, end1_hits, end2_hits) in iterator:
    ...    end1_hits = sorted(end1_hits, key=lambda x: str(x))
    ...    end2_hits = sorted(end2_hits, key=lambda x: str(x))
    ...    print('PAIR = {}'.format(label))
    ...    print('end1_hits:')
    ...    for i in end1_hits:
    ...        print(i, end='')
    ...    print('end2_hits:')
    ...    for i in end2_hits:
    ...        print(i, end='')  # doctest: +NORMALIZE_WHITESPACE
    PAIR = pair1
    end1_hits:
    chr1       1       10      pair1   5       +       x1      pk      chr1    3       4       peak1   50      .       1
    chr1       1       10      pair1   5       +       x1      tss     chr1    5       6       gene1   1
    end2_hits:
    chr1       50      90      pair1   5       -       x1      tss     chr1    60      61      gene2   1
    chr1       50      90      pair1   5       -       x1      tss     chr1    88      89      gene3   1
    PAIR = pair2
    end1_hits:
    chr1       2       15      pair2   1       +       y1      pk      chr1    3       4       peak1   50      .       1
    chr1       2       15      pair2   1       +       y1      tss     chr1    5       6       gene1   1
    end2_hits:
    chr1       200     210     pair2   1       +       y1      .       .       -1      -1      .       0

    See the `cis_trans_interactions()` function for one way of summarizing
    these data.
    """
    b = pybedtools.BedTool(bedpe)

    # Figure out if the supplied bedpe had any extra fields. If so, the fields
    # are repeated in each of the split output files.
    observed = b.field_count()
    extra = observed - 10
    extra_inds = [10 + i for i in range(extra)]


    end1_fn = pybedtools.BedTool._tmp()
    end2_fn = pybedtools.BedTool._tmp()

    # Performance notes:
    # We don't need the overhead of converting every line into
    # a pybedtools.Interval object just so we can grab the fields. Doing so
    # takes 3.5x more time than simply splitting each line on a tab.
    if verbose:
        print('splitting BEDPE into separate files.')
        print('end1 is going to %s' % end1_fn)
        print('end2 is going to %s' % end2_fn)


    n = 0
    with open(end1_fn, 'w') as end1_out, open(end2_fn, 'w') as end2_out:
        for line in open(b.fn):
            n += 1
            f = line.strip().split('\t')
            end1_out.write(
                '\t'.join(
                    (f[i] for i in [0, 1, 2, 6, 7, 8] + extra_inds)) + '\n')
            end2_out.write(
                '\t'.join(
                    (f[i] for i in [3, 4, 5, 6, 7, 9] + extra_inds)) + '\n')

    # Performance notes:
    #
    # For small BEDPE and large set of query files, it would be faster to sort
    # these independently, intersect with sorted=True, and then re-sort by name
    # for the grouping. For large BEDPE, I don't think the sorted=True
    # performance gain outweighs the hit from sorting twice.
    #
    # On the other hand, if BEDPE was coord-sorted in the first place, only
    # end2 would need to be sorted and re-sorted. On the other (third!?) hand,
    # BEDPE creation from BAM implies name-sorting, so it's probably not
    # reasonable to assume coord-sorted.
    #
    # In the end: don't do any sorting.

    end1_bt = pybedtools.BedTool(end1_fn)
    end2_bt = pybedtools.BedTool(end2_fn)
    names, fns = [], []
    for name, fn in queries.items():
        names.append(name)
        if isinstance(fn, pybedtools.BedTool):
            fns.append(fn.fn)
        else:
            fns.append(fn)

    if verbose:
        print('intersecting end 1')
    end1_hits = end1_bt.intersect(list(fns), names=names, wao=True)
    if verbose:
        print('intersecting end 2')
    end2_hits = end2_bt.intersect(list(fns), names=names, wao=True)
    if verbose:
        print('intersection with end1 is in %s' % (end1_hits.fn))
        print('intersection with end2 is in %s' % (end2_hits.fn))

    grouped_end1 = itertools.groupby(end1_hits, lambda f: f[3])
    grouped_end2 = itertools.groupby(end2_hits, lambda f: f[3])

    def gen():
        for (label1, group1), (label2, group2) \
            in six.moves.zip(grouped_end1, grouped_end2):
            assert label1 == label2
            yield label1, group1, group2

    return gen(), n, extra


def cis_trans_interactions(iterator, n, extra, verbose=True):
    """
    Converts the output from `tag_bedpe` into a pandas DataFrame containing
    information about regions that contact each other in cis (same fragment) or
    trans (different fragments).

    For example, given a BEDPE file representing 3D interactions in the genome,
    we want to identify which transcription start sites are connected to distal
    regions containing a peak.

    >>> bedpe = pybedtools.example_bedtool('test_bedpe.bed')
    >>> print(bedpe) # doctest: +NORMALIZE_WHITESPACE
    chr1  1  10  chr1  50   90   pair1  5  +  -  x1
    chr1  2  15  chr1  200  210  pair2  1  +  +  y1
    <BLANKLINE>

    >>> tsses = pybedtools.example_bedtool('test_tsses.bed')
    >>> print(tsses) # doctest: +NORMALIZE_WHITESPACE
    chr1  5   6   gene1
    chr1  60  61  gene2
    chr1  88  89  gene3
    <BLANKLINE>

    >>> peaks = pybedtools.example_bedtool('test_peaks.bed')
    >>> print(peaks) # doctest: +NORMALIZE_WHITESPACE
    chr1  3  4  peak1  50  .
    <BLANKLINE>

    Here's what the tracks look like. Note that pair1 is evidence of
    a gene1-gene2 interaction and a gene1-gene3 interaction::

        TRACKS:

                      1         2  /         5         6    / 8         9     / 20
            0123456789012345678901 / 2345678901234567890123 / 012345678901234 / 0123456789
       pair1 |||||||||------------ / --------|||||||||||||| / ||||||||||||||| /
       pair2  |||||||||||||------- / ---------------------- / --------------- / ||||||||||
       tsses     1                 /                   2    /         3
       peaks   1


    >>> from collections import OrderedDict
    >>> queries = OrderedDict()
    >>> queries['tss'] = tsses
    >>> queries['pk'] = peaks
    >>> iterator, n, extra = tag_bedpe(bedpe, queries)
    >>> for (label, group1, group2) in iterator:
    ...    group1 = sorted(group1, key=lambda x: str(x))
    ...    group2 = sorted(group2, key=lambda x: str(x))
    ...    for i in group1:
    ...        print(i, end='')  # doctest: +NORMALIZE_WHITESPACE
    ...    for i in group2:
    ...        print(i, end='')  # doctest: +NORMALIZE_WHITESPACE
    chr1       1       10      pair1   5       +       x1      pk      chr1    3       4       peak1   50      .       1
    chr1       1       10      pair1   5       +       x1      tss     chr1    5       6       gene1   1
    chr1       50      90      pair1   5       -       x1      tss     chr1    60      61      gene2   1
    chr1       50      90      pair1   5       -       x1      tss     chr1    88      89      gene3   1
    chr1       2       15      pair2   1       +       y1      pk      chr1    3       4       peak1   50      .       1
    chr1       2       15      pair2   1       +       y1      tss     chr1    5       6       gene1   1
    chr1       200     210     pair2   1       +       y1      .       .       -1      -1      .       0

    Now we run the same thing, but now aggregate it. Note that each piece of
    interaction evidence has its own line. The first line shows that pair1 has
    gene1 and peak1 in the same fragment, and that they are connected to gene2.
    The second line shows again that gene1 and peak1 are in the same fragmet
    and that they are also connected to gene3:

    >>> import pandas; pandas.set_option('display.max_columns', 10)
    >>> iterator, n, extra = tag_bedpe(bedpe, {'tss': tsses, 'pk': peaks})
    >>> df =  cis_trans_interactions(iterator, n, extra)
    >>> print(df.sort_values(list(df.columns)).reset_index(drop=True))
      target_label target_name cis_label cis_name distal_label distal_name  label
    0           pk       peak1       tss    gene1            .           .  pair2
    1           pk       peak1       tss    gene1          tss       gene2  pair1
    2           pk       peak1       tss    gene1          tss       gene3  pair1
    3          tss       gene1        pk    peak1            .           .  pair2
    4          tss       gene1        pk    peak1          tss       gene2  pair1
    5          tss       gene1        pk    peak1          tss       gene3  pair1
    6          tss       gene2       tss    gene3           pk       peak1  pair1
    7          tss       gene2       tss    gene3          tss       gene1  pair1
    8          tss       gene3       tss    gene2           pk       peak1  pair1
    9          tss       gene3       tss    gene2          tss       gene1  pair1

    If we only care about genes:

    >>> print((df[df.target_label == 'tss']).sort_values(list(df.columns)).reset_index(drop=True))
      target_label target_name cis_label cis_name distal_label distal_name  label
    0          tss       gene1        pk    peak1            .           .  pair2
    1          tss       gene1        pk    peak1          tss       gene2  pair1
    2          tss       gene1        pk    peak1          tss       gene3  pair1
    3          tss       gene2       tss    gene3           pk       peak1  pair1
    4          tss       gene2       tss    gene3          tss       gene1  pair1
    5          tss       gene3       tss    gene2           pk       peak1  pair1
    6          tss       gene3       tss    gene2          tss       gene1  pair1


    Note that in pair2, there is no evidence of interaction between gene1 and
    gene2.

    What interacts distally with gene2's TSS?

    >>> assert set(df.loc[df.target_name == 'gene2', 'distal_name']).difference('.') == set([u'gene1', u'peak1'])

    """
    try:
        import pandas
    except ImportError:
        raise ImportError("pandas must be installed to use this function")
    c = 0
    lines = []
    for label, end1_hits, end2_hits in iterator:
        c += 1
        if c % 1000 == 0:
            print('%d (%.1f%%)\r' % (c, c / float(n) * 100), end='')
            sys.stdout.flush()

        # end1_hits has the full lines of all intersections with end1
        end1_hits = list(end1_hits)
        end2_hits = list(end2_hits)

        def get_name_hits(f):
            """
            Returns the key (from which file the interval came) and the name
            (of the individual feature).
            """
            # this is the "name" reported if there was no hit.
            if f[6 + extra] == '.':
                return ('.', '.')
            interval = pybedtools.create_interval_from_list(f[7 + extra:])
            return ([f[6 + extra], interval.name])

        names1 = set(
            six.moves.map(tuple, six.moves.map(get_name_hits, end1_hits)))
        names2 = set(
            six.moves.map(tuple, six.moves.map(get_name_hits, end2_hits)))

        for cis, others in [(names1, names2), (names2, names1)]:
            for target in cis:
                if target == ('.', '.'):
                    continue
                non_targets = set(cis).difference([target])
                if len(non_targets) == 0:
                    non_targets = [('.', '.')]
                for non_target in non_targets:
                    for other in others:
                        line = []
                        line.extend(target)
                        line.extend(non_target)
                        line.extend(other)
                        line.append(label)
                        lines.append(line)

    df = pandas.DataFrame(
        lines,
        columns=['target_label', 'target_name', 'cis_label', 'cis_name',
                 'distal_label', 'distal_name', 'label'])
    df = df.drop_duplicates()
    return df