File: overview.rst

package info (click to toggle)
bedtools 2.26.0%2Bdfsg-3
links: PTS, VCS
area: main
in suites: stretch
size: 55,328 kB
sloc: cpp: 37,989; sh: 6,930; makefile: 2,225; python: 163
file content (328 lines) | stat: -rwxr-xr-x 22,048 bytes
parent folder | download | duplicates (2)
############
Overview
############

==========
Background
==========

The development of bedtools was motivated by a need for fast, flexible tools with which to compare large sets of genomic
features. Answering fundamental research questions with existing tools was either too slow or required modifications to the
way they reported or computed their results. We were aware of the utilities on the UCSC Genome Browser and Galaxy websites, as
well as the elegant tools available as part of Jim Kent’s monolithic suite of tools (“Kent source”). However, we found that
the web-based tools were too cumbersome when working with large datasets generated by current sequencing technologies.
Similarly, we found that the Kent source command line tools often required a local installation of the UCSC Genome Browser.
These limitations, combined with the fact that we often wanted an extra option here or there that wasn’t available with
existing tools, led us to develop our own from scratch. The initial version of bedtools was publicly released in the spring of
2009. The current version has evolved from our research experiences and those of the scientists using the suite over the last
year. The bedtools suite enables one to answer common questions of genomic data in a fast and reliable manner. The fact that
almost all the utilities accept input from “stdin” allows one to “stream / pipe” several commands together to facilitate more
complicated analyses. Also, the tools allow fine control over how output is reported. The initial version of bedtools
supported solely 6-column `BED <http://genome.ucsc.edu/FAQ/FAQformat#format1>`_ files. *However, we have subsequently added support for sequence alignments in* `BAM <http://samtools.sourceforge.net/>`_
*format, as well as for features in* `GFF <http://genome.ucsc.edu/FAQ/FAQformat#format3>`_ , *“blocked” BED format, and*
`VCF <http://www.1000genomes.org/wiki/Analysis/Variant%20Call%20Format/vcf-variant-call-format-version-41>`_ *format*. 
The tools are quite fast and typically finish in a matter of a few seconds, even for large datasets. This manual seeks to describe the behavior and
available functionality for each bedtool. Usage examples are scattered throughout the text, and formal examples are
provided in the last two sections, we hope that this document will give you a sense of the flexibility of
the toolkit and the types of analyses that are possible with bedtools. If you have further questions, please join the bedtools
discussion group, visit the Usage Examples on the Google Code site (usage, advanced usage), or take a look at the nascent
“Usage From the Wild” page.

===========================
Summary of available tools.
===========================

bedtools support a  wide range of operations for  interrogating and manipulating genomic features. The table below summarizes
the tools available in the suite.

===========================      =========================================================================================================================================================
Utility	                         Description
===========================      =========================================================================================================================================================
**annotate**                     Annotate coverage of features from multiple files.
**bamtobed**                     Convert BAM alignments to BED (& other) formats.
**bamtofastq**                   Convert BAM records to FASTQ records.
**bed12tobed6**                  Breaks BED12 intervals into discrete BED6 intervals.
**bedpetobam**                   Convert BEDPE intervals to BAM records.
**bedtobam**                     Convert intervals to BAM records.
**closest**                      Find the closest, potentially non-overlapping interval.
**cluster**                      Cluster (but don't merge) overlapping/nearby intervals.
**complement**                   Extract intervals _not_ represented by an interval file.
**coverage**                     Compute the coverage over defined intervals.
**expand**                       Replicate lines based on lists of values in columns.
**flank**                        Create new intervals from the flanks of existing intervals.
**genomecov**                    Compute the coverage over an entire genome.
**getfasta**                     Use intervals to extract sequences from a FASTA file.
**groupby**                      Group by common cols. & summarize oth. cols. (~ SQL "groupBy")
**igv**                          Create an IGV snapshot batch script.
**intersect**                    Find overlapping intervals in various ways.
**jaccard**                      Calculate the Jaccard statistic b/w two sets of intervals.
**links**                        Create a HTML page of links to UCSC locations.
**makewindows**                  Make interval "windows" across a genome.
**map**                          Apply a function to a column for each overlapping interval.
**maskfasta**                    Use intervals to mask sequences from a FASTA file.
**merge**                        Combine overlapping/nearby intervals into a single interval.
**multicov**                     Counts coverage from multiple BAMs at specific intervals.
**multiinter**                   Identifies common intervals among multiple interval files.
**nuc**                          Profile the nucleotide content of intervals in a FASTA file.
**overlap**                      Computes the amount of overlap from two intervals.
**pairtobed**                    Find pairs that overlap intervals in various ways.
**pairtopair**                   Find pairs that overlap other pairs in various ways.
**random**                       Generate random intervals in a genome.
**reldist**                      Calculate the distribution of relative distances b/w two files.
**shift**                        Adjust the position of intervals.
**shuffle**                      Randomly redistribute intervals in a genome.
**slop**                         Adjust the size of intervals.
**sort**                         Order the intervals in a file.
**subtract**                     Remove intervals based on overlaps b/w two files.
**tag**                          Tag BAM alignments based on overlaps with interval files.
**unionbedg**                    Combines coverage intervals from multiple BEDGRAPH files.
**window**                       Find overlapping intervals within a window around an interval.
===========================      =========================================================================================================================================================






===========================
Fundamental concepts.
===========================
------------------------------------------------------
What are genome features and how are they represented?
------------------------------------------------------
Throughout this manual, we will discuss how to use bedtools to manipulate, compare and ask questions of genome “features”. Genome features can be functional elements (e.g., genes), genetic polymorphisms (e.g.
SNPs, INDELs, or structural variants), or other annotations that have been discovered or curated by genome sequencing groups or genome browser groups. In addition, genome features can be custom annotations that
an individual lab or researcher defines (e.g., my novel gene or variant). 

The basic characteristics of a genome feature are the chromosome or scaffold on which the feature “resides”, the base pair on which the
feature starts (i.e. the “start”), the base pair on which feature ends (i.e. the “end”), the strand on which the feature exists (i.e. “+” or “-“), and the name of the feature if one is applicable. 

The two most widely used formats for representing genome features are the BED (Browser Extensible Data) and GFF (General Feature Format) formats. bedtools was originally written to work exclusively with genome features
described using the BED format, but it has been recently extended to seamlessly work with BED, GFF and VCF files. 

Existing annotations for the genomes of many species can be easily downloaded in BED and GFF
format from the UCSC Genome Browser’s “Table Browser” (http://genome.ucsc.edu/cgi-bin/hgTables?command=start) or from the “Bulk Downloads” page (http://hgdownload.cse.ucsc.edu/downloads.html). In addition, the
Ensemble Genome Browser contains annotations in GFF/GTF format for many species (http://www.ensembl.org/info/data/ftp/index.html)

-------------------------------------
Overlapping / intersecting features.
-------------------------------------
Two genome features (henceforth referred to as “features”) are said to overlap or intersect if they share at least one base in common. 
In the figure below, Feature A intersects/overlaps Feature B, but it does not intersect/overlap Feature C.

**TODO: place figure here**

--------------------------------------------
Comparing features in file “A” and file “B”.
--------------------------------------------
The previous section briefly introduced a fundamental naming convention used in bedtools. Specifically, all bedtools that compare features contained in two distinct files refer to one file as feature set “A” and the other file as feature set “B”. This is mainly in the interest of brevity, but it also has its roots in set theory.
As an example, if one wanted to look for SNPs (file A) that overlap with exons (file B), one would use bedtools intersect in the following manner::

  bedtools intersect –a snps.bed –b exons.bed

There are two exceptions to this rule: 1) When the “A” file is in BAM format, the “-abam” option must bed used. For example::

  bedtools intersect –abam alignedReads.bam –b exons.bed 

And 2) For tools where only one input feature file is needed, the “-i” option is used. For example::

  bedtools merge –i repeats.bed

-----------------------------------------------------
BED starts are zero-based and BED ends are one-based.
-----------------------------------------------------
bedtools users are sometimes confused by the way the start and end of BED features are represented. Specifically, bedtools uses the UCSC Genome Browser’s internal database convention of making the start position 0-based and the end position 1-based: (http://genome.ucsc.edu/FAQ/FAQtracks#tracks1)
In other words, bedtools interprets the “start” column as being 1 basepair higher than what is represented in the file. For example, the following BED feature represents a single base on chromosome 1; namely, the 1st base::

  chr1   0	  1    first_base

Why, you might ask? The advantage of storing features this way is that when computing the length of a feature, one must simply subtract the start from the end.	Were the start position 1-based, 
the calculation would be (slightly) more complex (i.e. (end-start)+1). Thus, storing BED features this way reduces the computational burden.

-----------------------------------------------------
GFF starts and ends are one-based.
-----------------------------------------------------
In contrast, the GFF format uses 1-based coordinates for both the start and the end positions. bedtools is aware of this and adjusts the positions accordingly. 
In other words, you don’t need to subtract 1 from the start positions of your GFF features for them to work correctly with bedtools.

-----------------------------------------------------
VCF coordinates are one-based.
-----------------------------------------------------
The VCF format uses 1-based coordinates. As in GFF, bedtools is aware of this and adjusts the positions accordingly. 
In other words, you don’t need to subtract 1 from the start positions of your VCF features for them to work correctly with bedtools.

-----------------------------------------------------
File B is loaded into memory (most of the time).
-----------------------------------------------------
Whenever a bedtool compares two files of features, the “B” file is loaded into memory. By contrast, the “A” file is processed line by line and compared with the features from B. 
Therefore to minimize memory usage, one should set the smaller of the two files as the B file. One salient example is the comparison of aligned sequence reads from a 
current DNA sequencer to gene annotations.	In this case, the aligned sequence file (in BED format) may have tens of millions of features (the sequence alignments), 
while the gene annotation file will have tens of thousands of features. In this case, it is wise to sets the reads as file A and the genes as file B.

-----------------------------------------------------
Feature files *must* be tab-delimited.
----------------------------------------------------- 
This is rather self-explanatory. While it is possible to allow BED files to be space-delimited, we have decided to require tab delimiters for three reasons:

1. By requiring one delimiter type, the processing time is minimized. 
2. Tab-delimited files are more amenable to other UNIX utilities. 
3. GFF files can contain spaces within attribute columns. This complicates the use of space-delimited files as spaces must therefore be treated specially depending on the context.

-------------------------------------------------------------
All bedtools allow features to be “piped” via standard input.
-------------------------------------------------------------

In an effort to allow one to combine multiple bedtools and other UNIX utilities into more complicated “pipelines”, all bedtools allow features 
to be passed to them via standard input. Only one feature file may be passed to a bedtool via standard input. 
The convention used by all bedtools is to set either file A or file B to “stdin” or "-". For example::

  cat snps.bed | bedtools intersect –a stdin –b exons.bed 
  cat snps.bed | bedtools intersect –a - –b exons.bed 

In addition, all bedtools that simply require one main input file (the -i file) will assume that input is
coming from standard input if the -i parameter is ignored. For example, the following are equivalent::

  cat snps.bed | bedtools sort –i stdin 
  cat snps.bed | bedtools sort

------------------------------------------------------
Most bedtools write their results to standard output.
------------------------------------------------------
To allow one to combine multiple bedtools and other UNIX utilities into more complicated “pipelines”, 
most bedtools report their output to standard output, rather than to a named file. If one wants to write the output to a named file, one can use the UNIX “file redirection” symbol “>” to do so.
Writing to standard output (the default)::

   bedtools intersect –a snps.bed –b exons.bed
   chr1 100100 100101 rs233454
   chr1 200100 200101 rs446788
   chr1 300100 300101 rs645678

Writing to a file::

  bedtools intersect –a snps.bed –b exons.bed > snps.in.exons.bed

  cat snps.in.exons.bed
  chr1 100100 100101 rs233454
  chr1 200100 200101 rs446788
  chr1 300100 300101 rs645678

------------------------
What is a “genome” file?
------------------------
Some of the bedtools (e.g., ``genomecov``, ``complement``, ``slop``) need to know the size of
the chromosomes for the organism for which your BED files are based. When using the UCSC Genome
Browser, Ensemble, or Galaxy, you typically indicate which species / genome build you are working.
The way you do this for bedtools is to create a “genome” file, which simply lists the names of the
chromosomes (or scaffolds, etc.) and their size (in basepairs).
Genome files must be tab-delimited and are structured as follows (this is an example for C. elegans)::

  chrI 15072421
  chrII 15279323
  ...
  chrX 17718854
  chrM 13794

bedtools includes predefined genome files for human and mouse in the /genomes directory included
in the bedtools distribution. Additionally, the “chromInfo” files/tables available from the UCSC
Genome Browser website are acceptable. For example, one can download the hg19 chromInfo file here:
http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/chromInfo.txt.gz

------------------------------------
Paired-end BED files (BEDPE files).
------------------------------------
We have defined a new file format (BEDPE) to concisely describe disjoint genome features, such as
structural variations or paired-end sequence alignments. We chose to define a new format because the
existing BED block format (i.e. BED12) does not allow inter-chromosomal feature definitions. Moreover,
the BED12 format feels rather bloated when one want to describe events with only two blocks. 

------------------------------------------
Use “-h” for help with any bedtool.
------------------------------------------
Rather straightforward. If you use the “-h” option with any bedtool, a full menu of example usage
and available options (when applicable) will be reported.

--------------------------------------------------
BED features must not contain negative positions.
--------------------------------------------------
bedtools will typically reject BED features that contain negative positions. In special cases, however,
BEDPE positions may be set to -1 to indicate that one or more ends of a BEDPE feature is unaligned.

---------------------------------------------------
The start position must be <= to the end position.
---------------------------------------------------
bedtools will reject BED features where the start position is greater than the end position.

-----------------------------------------
Headers are allowed in GFF and BED files
-----------------------------------------
bedtools will ignore headers at the beginning of BED and GFF files. Valid header lines begin with a
“#” symbol, the work “track”, or the word “browser”. For example, the following examples are valid
headers for BED or GFF files::

  track name=aligned_read description="Illumina aligned reads”
  chr5 100000 500000 read1 50 +
  chr5 2380000 2386000 read2 60 -

  #This is a fascinating dataset
  chr5 100000 500000 read1 50 +
  chr5 2380000 2386000 read2 60 -

  browser position chr22:1-20000
  chr5 100000 500000 read1 50 +
  chr5 2380000 2386000 read2 60 -

-------------------------------------------------------------
GZIP support: BED, GFF, VCF, and BEDPE file can be “gzipped”
-------------------------------------------------------------
bedtools will process gzipped BED, GFF, VCF and BEDPE files in the same manner as
uncompressed files. Gzipped files are auto-detected thanks to a helpful contribution from Gordon
Assaf.

----------------------------------------------------------------------------
Support for “split” or “spliced” BAM alignments and “blocked” BED features
----------------------------------------------------------------------------
As of Version 2.8.0, five bedtools (``intersect``, ``coverage``, ``genomecob``,
``bamToBed``, and ``bed12ToBed6``) can properly handle “split”/”spliced” BAM alignments (i.e., having an
“N” CIGAR operation) and/or “blocked” BED (aka BED12) features.

``intersect``, ``coverage``, and ``genomecov`` will optionally handle “split” BAM and/or
“blocked” BED by using the ``-split`` option. This will cause intersects or coverage to be computed only
for the alignment or feature blocks. In contrast, without this option, the intersects/coverage would be
computed for the entire “span” of the alignment or feature, regardless of the size of the gaps between
each alignment or feature block. For example, imagine you have a RNA-seq read that originates from
the junction of two exons that were spliced together in a mRNA. In the genome, these two exons
happen to be 30Kb apart. Thus, when the read is aligned to the reference genome, one portion of the
read will align to the first exon, while another portion of the read will align ca. 30Kb downstream to the
other exon. The corresponding CIGAR string would be something like (assuming a 76bp read):
30M*3000N*46M. In the genome, this alignment “spans” 3076 bp, yet the nucleotides in the sequencing
read only align “cover” 76bp. Without the ``-split`` option, coverage or overlaps would be reported for the
entire 3076bp span of the alignment. However, with the ``-split`` option, coverage or overlaps will only
be reported for the portions of the read that overlap the exons (i.e. 30bp on one exon, and
46bp on the other).


Using the -split option with bamToBed causes “spliced/split” alignments to be reported in BED12
format. Using the -split option with ``bed12tobed6`` causes “blocked” BED12 features to be reported in
BED6 format.

--------------------------------
Writing uncompressed BAM output.
--------------------------------
When working with a large BAM file using a complex set of tools in a pipe/stream, it is advantageous
to pass uncompressed BAM output to each downstream program. This minimizes the amount of time
spent compressing and decompressing output from one program to the next. All bedtools that create
BAM output (e.g. ``intersect``, ``window``) will now optionally create uncompressed BAM output
using the ``-ubam`` option.



=====================================
Implementation and algorithmic notes.
=====================================
bedtools was implemented in C++ and makes extensive use of data structures and fundamental
algorithms from the Standard Template Library (STL). Many of the core algorithms are based upon the
genome binning algorithm described in the original UCSC Genome Browser paper (Kent et al, 2002).
The tools have been designed to inherit core data structures from central source files, thus allowing
rapid tool development and deployment of improvements and corrections. Support for BAM files is
made possible through Derek Barnett’s elegant C++ API called BamTools.