File: flow-of-commands.rst

package info (click to toggle)
python-pybedtools 0.10.0-4
  • links: PTS, VCS
  • area: main
  • in suites: forky, sid, trixie
  • size: 16,620 kB
  • sloc: python: 10,030; cpp: 899; makefile: 142; sh: 57
file content (120 lines) | stat: -rw-r--r-- 4,757 bytes parent folder | download | duplicates (4)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
Under the hood
==============

This section documents some details about what happens when a :class:`BedTool`
object is created and exactly what happens when a BEDTools command is called.
It's mostly useful for developers or for debugging.


There are three kinds of sources/sinks for BedTool objects:

* filename
* open file object
* iterator of Interval objects


Iterator "protocol"
-------------------
BedTool objects yield an Interval object on each `next()` call.  Where this
Interval comes from depends on how the BedTool was created and what format the
underlying data are in, as follows.

Filename-based
~~~~~~~~~~~~~~
If BED/GTF/GFF/VCF format, then use an `IntervalFile` object for Cython/C++
speed.

If SAM format, then use an `IntervalIterator`.  This is a Cython object that
reads individual lines and passes them to `create_interval_from_list`, a Cython
function.  `create_interval_from_list` does a lot of the work to figure out
what format the line is, and this is how we are able to support SAM Interval
objects.

If BAM format, then first do a Popen call to `samtools view`, and create an
`IntervalIterator` from subprocess.PIPE similar to SAM format.

Open file-based
~~~~~~~~~~~~~~~
All formats are passed to an `IntervalIterator`, which reads one line at
a time and yields an `Interval` object.

If it's a BAM file (specifically, a detected bgzip stream), then it's actually
first sent to the stdin of a `samtools` Popen call, and then the
subprocess.PIPE from that Popen's stdout is sent to an `IntervalIterator`.

Iterator or generator-based
~~~~~~~~~~~~~~~~~~~~~~~~~~~
If it's neither of the above, then the assumption is that it's already an
iterable of `Interval` objects.  This is the case if a `BedTool` is created
with something like::

    a = pybedtools.example_bedtool('a.bed')
    b = pybedtools.BedTool((i for i in a))


In this case, the `(i for i in a)` creates a generator of intervals from an
`IntervalFile` -- since `a` is a filename-based BedTool.  Since the first
argument to the BedTool constructor is neither a filename nor an open file, the
new BedTool `b`'s `.fn` attribute is directly set to this generator . . . so we
have a generator-based BedTool.

Calling BEDTools programs
-------------------------
Depending on the type of BedTool (filename, open file, or iterator), the method
of calling BEDTools programs differs.

In all cases, BEDTools commands are called via a `subprocess.Popen` call
(hereafter called "the Popen" for convenience).  Depending on the type of
BedTool objects being operated on, the Popen will be passed different objects
as stdin and/or stdout.

In general, using a filename as input is the most straightforward -- nothing is
passed to the Popen's stdin because the filenames are embedded in the BEDTools
command.

Using non-filename-based BedTools means that they are passed, one line at
a time, to the stdin of the Popen.  The commands for the BEDTools call
will specify "stdin" in these cases, as is standard for the BEDTools suite.

The default is for the output to be file-based.  In this case, an open tempfile
object is provided as the Popen's stdout.

If the returned BedTool is requested to be a "streaming" BedTool, then the
Popen's stdout will be subprocess.PIPE, and the new BedTool object will be
open-file based (which is what subprocess.PIPE acts like).

Specifically, here is the information flow of stdin/stdout for various
interconversions of BedTool types . . . .


:filename -> filename:
    The calling BedTool is filename-based and `stream=False`.

    * `stdin`: `None` (the filenames are provided in the BEDTools command)
    * `stdout`: open tempfile object
    * new BedTool: filename-based BedTool pointing to the tempfile's filename

:filename -> open file object:
    The calling BedTool is filename-based and `stream=True` is requested.

    * `stdin`: None (provided in the cmds)
    * `stdout`: open file object -- specifically, subprocess.PIPE
    * new BedTool: iterator-based BedTool.  Each `next()` call retrieves the
      next line in subprocess.PIPE

:open file object -> filename:
    The calling BedTool is from, e.g., subprocess.PIPE and there's
    a saveas() call to "render" to file.

    * `stdin`: each line in the open file object is written to subprocess.PIPE
    * `stdout`: open file object -- either a tempfile or new file created from
      supplied filename
    * new BedTool: filename-based BedTool

:open file object -> iterator:
    The calling BedTool is usually based on subprocess.PIPE, and the output
    will *also* come from subprocess.PIPE.

    * `stdin`: each line from the open file is written to subprocess.PIPE
    * `stdout`: open file object, subprocess.PIPE
    * new BedTool: filename based on subprocess.PIPE