File: README.rst

package info (click to toggle)
python-pybedtools 0.10.0-4
  • links: PTS, VCS
  • area: main
  • in suites: sid, trixie
  • size: 16,620 kB
  • sloc: python: 10,030; cpp: 899; makefile: 142; sh: 57
file content (66 lines) | stat: -rw-r--r-- 2,242 bytes parent folder | download
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66

Overview
--------

The `BEDTools suite of programs <http://bedtools.readthedocs.org/>`_ is widely
used for genomic interval manipulation or "genome algebra".  `pybedtools` wraps
and extends BEDTools and offers feature-level manipulations from within
Python.

See full online documentation, including installation instructions, at
http://daler.github.io/pybedtools/.

Why `pybedtools`?
-----------------

Here is an example to get the names of genes that are <5 kb away from
intergenic SNPs:

.. code-block:: python

    from pybedtools import BedTool

    snps = BedTool('snps.bed.gz')  # [1]
    genes = BedTool('hg19.gff')    # [1]

    intergenic_snps = snps.subtract(genes)                       # [2]
    nearby = genes.closest(intergenic_snps, d=True, stream=True) # [2, 3]

    for gene in nearby:             # [4]
        if int(gene[-1]) < 5000:    # [4]
            print gene.name         # [4]

Useful features shown here include:

* `[1]` support for all BEDTools-supported formats (here gzipped BED and GFF)
* `[2]` wrapping of all BEDTools programs and arguments (here, `subtract` and `closest` and passing
  the `-d` flag to `closest`);
* `[3]` streaming results (like Unix pipes, here specified by `stream=True`)
* `[4]` iterating over results while accessing feature data by index or by attribute
  access (here `[-1]` and `.name`).

In contrast, here is the same analysis using shell scripting.  Note that this
requires knowledge in Perl, bash, and awk.  The run time is identical to the
`pybedtools` version above:

.. code-block:: bash

    snps=snps.bed.gz
    genes=hg19.gff
    intergenic_snps=/tmp/intergenic_snps

    snp_fields=`zcat $snps | awk '(NR == 2){print NF; exit;}'`
    gene_fields=9
    distance_field=$(($gene_fields + $snp_fields + 1))

    intersectBed -a $snps -b $genes -v > $intergenic_snps

    closestBed -a $genes -b $intergenic_snps -d \
    | awk '($'$distance_field' < 5000){print $9;}' \
    | perl -ne 'm/[ID|Name|gene_id]=(.*?);/; print "$1\n"'

    rm $intergenic_snps

See the `Shell script comparison <http://daler.github.io/pybedtools/sh-comparison.html>`_ in the docs
for more details on this comparison, or keep reading the full documentation at
http://daler.github.io/pybedtools.