File: genomecov.rst

package info (click to toggle)
bedtools 2.27.1%2Bdfsg-4
  • links: PTS, VCS
  • area: main
  • in suites: buster
  • size: 54,804 kB
  • sloc: cpp: 38,072; sh: 7,307; makefile: 2,241; python: 163
file content (361 lines) | stat: -rwxr-xr-x 13,673 bytes parent folder | download | duplicates (2)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
.. _genomecov:

###############
*genomecov*
###############

|

.. image:: ../images/tool-glyphs/genomecov-glyph.png 
    :width: 600pt 

|

``bedtools genomecov`` computes histograms (default), per-base reports (``-d``) 
and BEDGRAPH (``-bg``) summaries of feature coverage (e.g., aligned sequences) 
for a given genome. 

.. note::

  1. If using BED/GFF/VCF, the input (``-i``) file must be grouped by 
  chromosome. A simple  ``sort -k 1,1 in.bed > in.sorted.bed`` will suffice.
  Also, if using BED/GFF/VCF, one must provide a genome file via the ``-g``
  argument.

  2. If the input is in BAM (-ibam) format, the BAM file must be sorted 
  by position.  Using ``samtools sort aln.bam aln.sorted`` will suffice.


===============================
Usage and option summary
===============================
**Usage**:
::

  bedtools genomecov [OPTIONS] [-i|-ibam] -g (iff. -i)

**(or)**:
::

  genomeCoverageBed [OPTIONS] [-i|-ibam] -g (iff. -i)



===========================      ===============================================================================================================================================================================================================
 Option                           Description
===========================      ===============================================================================================================================================================================================================
**-ibam**                        | BAM file as input for coverage. Each BAM alignment in A added to the total coverage for the genome. 
                                 | Use "stdin" or simply "-" if passing it with a UNIX pipe: For example:
                                 | ``samtools view -b <BAM> | genomeCoverageBed -ibam stdin -g hg18.genome``
**-d**                           Report the depth at each genome position with 1-based coordinates.
**-dz**                          Report the depth at each genome position with 0-based coordinates.
                                 Unlike, `-d`, this reports only non-zero positions.
**-bg**                          Report depth in BedGraph format. For details, see: http://genome.ucsc.edu/goldenPath/help/bedgraph.html
**-bga**                         Report depth in BedGraph format, as above (i.e., -bg). However with this option, regions with zero coverage are also reported. This allows one to quickly extract all regions of a genome with 0 coverage by applying: "grep -w 0$" to the output.
**-split**                       Treat "split" BAM or BED12 entries as distinct BED intervals when computing coverage. For BAM files, this uses the CIGAR "N" and "D" operations to infer the blocks for computing coverage. For BED12 files, this uses the BlockCount, BlockStarts, and BlockEnds fields (i.e., columns 10,11,12).
**-strand**                      Calculate coverage of intervals from a specific strand. With BED files, requires at least 6 columns (strand is column 6).
**-5**                           Calculate coverage of 5' positions (instead of entire interval).
**-3**                           Calculate coverage of 3' positions (instead of entire interval).
**-max**                         Combine all positions with a depth >= max into a single bin in the histogram.
**-scale**                       | Scale the coverage by a constant factor.
                                 | Each coverage value is multiplied by this factor before being reported.
                                 | Useful for normalizing coverage by, e.g., reads per million (RPM).
                                 | ``Default is 1.0; i.e., unscaled.``
**-trackline**                   | Adds a UCSC/Genome-Browser track line definition in the first line of the output.
                                 | See `here <http://genome.ucsc.edu/goldenPath/help/bedgraph.html>`_ for more details about track line definition:
**-trackopts**                   Writes additional track line definition parameters in the first line.
**-pc**                          | Calculates coverage of intervals from left point of a pair reads to the right point.
                                 | Works for BAM files only
**-fs**                          | Forces to use fragment size instead of read length
                                 | Works for BAM files only

===========================      ===============================================================================================================================================================================================================




==========================================================================
Default behavior
==========================================================================
By default, ``bedtools genomecov`` will compute a histogram of coverage for 
the genome file provided. The default output format is as follows:

1. chromosome (or entire genome)
2. depth of coverage from features in input file
3. number of bases on chromosome (or genome) with depth equal to column 2.
4. size of chromosome (or entire genome) in base pairs
5. fraction of bases on chromosome (or entire genome) with depth equal to column 2.

For example:

.. code-block:: bash

  $ cat A.bed
  chr1  10  20
  chr1  20  30
  chr2  0   500

  $ cat my.genome
  chr1  1000
  chr2  500

  $ bedtools genomecov -i A.bed -g my.genome
  chr1   0  980  1000  0.98
  chr1   1  20   1000  0.02
  chr2   1  500  500   1
  genome 0  980  1500  0.653333
  genome 1  520  1500  0.346667

 
==========================================================================
``-max`` Controlling the histogram's maximum depth 
==========================================================================
Using the ``-max`` option, ``bedtools genomecov`` will "lump" all positions in
the genome having feature coverage greater than or equal to ``-max`` into 
the ``-max`` histogram bin. For example, if one sets ``-max``
equal to 50, the max depth reported in the output will be 50 and all positions 
with a depth >= 50 will be represented in bin 50.


==========================================================================
``-d`` Reporting "per-base" genome coverage 
==========================================================================
Using the ``-d`` option, ``bedtools genomecov`` will compute the depth of 
feature coverage for each base on each chromosome in genome file provided.

The "per-base" output format is as follows:

1. chromosome
2. chromosome position
3. depth (number) of features overlapping this chromosome position.

For example:

.. code-block:: bash

  $ cat A.bed
  chr1  10  20
  chr1  20  30
  chr2  0   500

  $ cat my.genome
  chr1  1000
  chr2  500

  $ bedtools genomecov -i A.bed -g my.genome -d | \
        head -15 | \
        tail -n 10
  chr1  6  0
  chr1  7  0
  chr1  8  0
  chr1  9  0
  chr1  10 0
  chr1  11 1
  chr1  12 1
  chr1  13 1
  chr1  14 1
  chr1  15 1


==========================================================================
``-bg`` Reporting genome coverage in BEDGRAPH format.
==========================================================================
Whereas the ``-d`` option reports an output line describing the observed 
coverage at each and every position in the genome, the ``-bg`` option instead
produces genome-wide coverage output in 
`BEDGRAPH <http://genome.ucsc.edu/goldenPath/help/bedgraph.html>`_ format. 
This is a much more concise representation since consecutive positions with the
same coverage are reported as a single output line describing the start and end
coordinate of the interval having the coverage level, followed by the coverage 
level itself.


For example, below is a snippet of BEDGRAPH output of the coverage from a 1000
Genome Project BAM file:

.. code-block:: bash
  
  $ bedtools genomecov -ibam NA18152.bam -bg | head
  chr1	554304	554309	5
  chr1	554309	554313	6
  chr1	554313	554314	1
  chr1	554315	554316	6
  chr1	554316	554317	5
  chr1	554317	554318	1
  chr1	554318	554319	2
  chr1	554319	554321	6
  chr1	554321	554323	1
  chr1	554323	554334	7

Using this format, one can quickly identify regions of the genome with
sufficient coverage (in this case, 10 or more reads) by piping the 
output to an ``awk`` filter.

.. code-block:: bash

  $ bedtools genomecov -ibam NA18152.bam -bg | \
      awk '$4 > 9' | \
      head
  chr1	554377	554381	11
  chr1	554381	554385	12
  chr1	554385	554392	16
  chr1	554392	554408	17
  chr1	554408	554410	19
  chr1	554410	554422	20
  chr1	554422	554423	19
  chr1	554423	554430	22
  chr1	554430	554440	24
  chr1	554440	554443	25


==========================================================================
``-bga`` Reporting genome coverage for *all* positions in BEDGRAPH format.
==========================================================================
The ``-bg`` option reports coverage in BEDGRAPH format only for those regions
of the genome that actually have coverage.  But what about the uncovered portion
of the genome?  By using the ``-bga`` option, one receives a complete report
including the regions with zero coverage.

For example, compare the output from ``-bg``:

.. code-block:: bash
  
  $ bedtools genomecov -ibam NA18152.bam -bg | head
  chr1	554304	554309	5
  chr1	554309	554313	6
  chr1	554313	554314	1
  chr1	554315	554316	6
  chr1	554316	554317	5
  chr1	554317	554318	1
  chr1	554318	554319	2
  chr1	554319	554321	6
  chr1	554321	554323	1
  chr1	554323	554334	7
  
to the output from ``-bga``:

.. code-block:: bash

  # Note the first record reports that the first 554304 
  # base pairs of chr1 had zero coverage
  $ bedtools genomecov -ibam NA18152.bam -bga | head
  chr1	0	554304	0
  chr1	554304	554309	5
  chr1	554309	554313	6
  chr1	554313	554314	1
  chr1	554314	554315	0
  chr1	554315	554316	6
  chr1	554316	554317	5
  chr1	554317	554318	1
  chr1	554318	554319	2
  chr1	554319	554321	6


==========================================================================
``-strand`` Reporting genome coverage for a specific strand.
==========================================================================
Whereas the default is to count coverage regardless of strand, the ``-strand`` 
option allows one to report the coverage observed for a specific strand. 

Compare:

.. code-block:: bash
  
  $ bedtools genomecov -ibam NA18152.bam -bg | head
  chr1	554304	554309	5
  chr1	554309	554313	6
  chr1	554313	554314	1
  chr1	554315	554316	6
  chr1	554316	554317	5
  chr1	554317	554318	1
  chr1	554318	554319	2
  chr1	554319	554321	6
  chr1	554321	554323	1
  chr1	554323	554334	7
  
to

.. code-block:: bash
  
  $ bedtools genomecov -ibam NA18152.bam -bg -strand + | head
  chr1	554385	554392	4
  chr1	554392	554408	5
  chr1	554408	554430	6
  chr1	554430	554451	7
  chr1	554451	554455	8
  chr1	554455	554490	9
  chr1	554490	554495	10
  chr1	554495	554496	9
  chr1	554496	554574	10
  chr1	554574	554579	11
  

==========================================================================
``-scale`` Scaling coverage by a constant factor.
==========================================================================
The ``-scale`` option allows one to scale the coverage observed in an interval
file by a constant factor. Each coverage value is multiplied by this factor 
before being reported. This can be useful for normalizing coverage by, 
e.g., metrics such as reads per million (RPM). 

Compare:

.. code-block:: bash
  
  $ bedtools genomecov -ibam NA18152.bam -bg | head
  chr1	554304	554309	5
  chr1	554309	554313	6
  chr1	554313	554314	1
  chr1	554315	554316	6
  chr1	554316	554317	5
  chr1	554317	554318	1
  chr1	554318	554319	2
  chr1	554319	554321	6
  chr1	554321	554323	1
  chr1	554323	554334	7
  
to

.. code-block:: bash
  
  $ bedtools genomecov -ibam NA18152.bam -bg -scale 10.0 | head
  chr1	554304	554309	50
  chr1	554309	554313	60
  chr1	554313	554314	10
  chr1	554315	554316	60
  chr1	554316	554317	50
  chr1	554317	554318	10
  chr1	554318	554319	20
  chr1	554319	554321	60
  chr1	554321	554323	10
  chr1	554323	554334	70
  

==============================================================================
``-split`` Reporting coverage with spliced alignments or blocked BED features 
==============================================================================
``bedtools genomecov`` will, by default, screen for overlaps against the
entire span of a spliced/split BAM alignment or blocked BED12 feature. When 
dealing with RNA-seq reads, for example, one typically wants to only screen 
for overlaps for the portions of the reads that come from exons (and ignore the 
interstitial intron sequence). The ``-split`` command allows for such
overlaps to be performed.


==============================================================================
Coverage by fragment
==============================================================================

For the Debian package the image barski_binding_site.png from Jothi et al, 2008, doi:10.1093/nar/gkn488
was removed from this doc since it is licensed under no-DFSG CC-BY-NC 2.0 and thus not distributable.

In ChiP-Seq the binding site is usually not at the coordinate where reads map,
but in the middle of the fragment. For this reason we often try to estimate average fragment size
for single-read experiment and extend the reads in the 5’-3’ direction up to the estimated fragment length.
The coverage "by estimated fragments" or by actual pair-end fragments graph is expected to peak at the actual binding site.


``-fs`` Forces to use provided fragment size.


``-pc`` Calculates coverage for paired-end reads, coverage is calculated as the number of fragments covering each base pair