File: answers.html

package info (click to toggle)
bedtools 2.31.1%2Bdfsg-2
  • links: PTS, VCS
  • area: main
  • in suites: trixie
  • size: 57,304 kB
  • sloc: ansic: 38,507; cpp: 29,721; sh: 8,001; makefile: 663; python: 240; javascript: 16
file content (148 lines) | stat: -rw-r--r-- 8,325 bytes parent folder | download | duplicates (6)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
  <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
  <meta http-equiv="Content-Style-Type" content="text/css" />
  <meta name="generator" content="pandoc" />
  <meta name="author" content="Aaron Quinlan" />
  <title>bedtools Tutorial</title>
  <style type="text/css">code{white-space: pre;}</style>
  <link rel="stylesheet" href="bootstrap.css" type="text/css" />
</head>
<body>
    <div class="navbar navbar-static-top">
    <div class="navbar-inner">
      <div class="container">
        <span class="doc-title">bedtools Tutorial</span>
        <ul class="nav pull-right doc-info">
                    <li><p class="navbar-text">Aaron Quinlan</p></li>
                            </ul>
      </div>
    </div>
  </div>
    <div class="container">
    <div class="row">
      <div class="span12">
            <h1 id="puzzles-to-help-teach-you-more-bedtools.">Puzzles to help teach you more bedtools.</h1>
<ol style="list-style-type: decimal">
<li>Create a BED file representing all of the intervals in the genome that are NOT exonic and are not Promoters (based on the promoters in the hESC file).</li>
</ol>
<p>Answer:</p>
<pre><code>grep Promoter hesc.chromHmm.bed &gt; hesc.promoters.bed

cat exons.bed hesc.promoters.bed | sort -k1,1 -k2,2n | exons.and.promoters.bed

bedtools complement -i exons.and.promoters.bed -g genome.txt &gt; notexonsorpromoters.bed</code></pre>
<ol start="2" style="list-style-type: decimal">
<li>What is the average distance from GWAS SNPs to the closest exon? (Hint - have a look at the <a href="http://bedtools.readthedocs.org/en/latest/content/tools/closest.html">closest</a> tool.)</li>
</ol>
<p>Answer:</p>
<pre><code>bedtools closest -a gwas.bed -b exons.bed -d | head
chr1    1005805 1005806 rs3934834   chr1    1007125 1007955 NM_001205252_exon_0_0_chr1_1007126_r    0   -   1320
chr1    1079197 1079198 rs11260603  chr1    1078118 1079434 NR_038869_exon_2_0_chr1_1078119_f   0   +   0
chr1    1247493 1247494 rs12103 chr1    1247397 1247527 NM_001256456_exon_1_0_chr1_1247398_r    0   -   0
chr1    1247493 1247494 rs12103 chr1    1247397 1247527 NM_001256460_exon_1_0_chr1_1247398_r    0   -   0
chr1    1247493 1247494 rs12103 chr1    1247397 1247527 NM_001256462_exon_1_0_chr1_1247398_r    0   -   0
chr1    1247493 1247494 rs12103 chr1    1247397 1247527 NM_001256463_exon_1_0_chr1_1247398_r    0   -   0
chr1    1247493 1247494 rs12103 chr1    1247397 1247527 NM_017871_exon_1_0_chr1_1247398_r   0   -   0
chr1    2069171 2069172 rs425277    chr1    2066700 2066786 NM_001033581_exon_1_0_chr1_2066701_f    0   +   2386
chr1    2069171 2069172 rs425277    chr1    2066700 2066786 NM_001033582_exon_1_0_chr1_2066701_f    0   +   2386
chr1    2069171 2069172 rs425277    chr1    2066700 2066786 NM_001242874_exon_1_0_chr1_2066701_f    0   +   2386

bedtools closest -a gwas.bed -b exons.bed -d \
  | awk &#39;{ sum += $11 } END { if (NR &gt; 0) print sum / NR }&#39;
46713.1</code></pre>
<ol start="3" style="list-style-type: decimal">
<li>Count how many exons occur in each 500kb interval (“window”) in the human genome. (Hint - have a look at the <code>makewindows</code> tool.)</li>
</ol>
<p>Answer:</p>
<pre><code>bedtools makewindows -g genome.txt -w 500000 &gt; genome.windows.bed
bedtools intersect -a genome.windows.bed -b exons.bed -c &gt; genome.windows.exoncount.bedg</code></pre>
<p>or…</p>
<pre><code>bedtools makewindows -g genome.txt -w 500000 \
  | bedtools intersect -a - -b exons.bed -c \
&gt; genome.windows.exoncount.bedg</code></pre>
<ol start="4" style="list-style-type: decimal">
<li>Are there any exons that are completely overlapped by an enhancer? If so, how many?</li>
</ol>
<p>Answer:</p>
<pre><code>bedtools intersect -a exons.bed \
                   -b &lt;(grep Enhancer hesc.chromHmm.bed) \
                   -wa -wb -f 1.0 \
| head
chr1    948846  948956  NM_005101_exon_0_0_chr1_948847_f    0   +   chr1    948337  949337  4_Strong_Enhancer
chr1    1051439 1051736 NM_017891_exon_9_0_chr1_1051440_r   0   -   chr1    1051337 1051737 6_Weak_Enhancer
chr1    1109285 1109306 NM_001130045_exon_0_0_chr1_1109286_f    0   +   chr1    1108537 1109537 6_Weak_Enhancer
chr1    1109803 1109869 NM_001130045_exon_2_0_chr1_1109804_f    0   +   chr1    1109737 1109937 6_Weak_Enhancer
chr1    1219357 1219470 NM_001130413_exon_4_0_chr1_1219358_f    0   +   chr1    1219137 1220137 7_Weak_Enhancer
chr1    1219357 1219470 NR_037668_exon_4_0_chr1_1219358_f   0   +   chr1    1219137 1220137 7_Weak_Enhancer
chr1    1229202 1229313 NM_030649_exon_1_0_chr1_1229203_r   0   -   chr1    1228937 1229937 6_Weak_Enhancer
chr1    1229469 1229579 NM_030649_exon_2_0_chr1_1229470_r   0   -   chr1    1228937 1229937 6_Weak_Enhancer
chr1    1234724 1234736 NM_030649_exon_14_0_chr1_1234725_r  0   -   chr1    1234137 1234937 7_Weak_Enhancer
chr1    1245060 1245231 NM_153339_exon_4_0_chr1_1245061_f   0   +   chr1    1244937 1245337 4_Strong_Enhancer

bedtools intersect -a exons.bed \
                   -b &lt;(grep Enhancer hesc.chromHmm.bed) \
                   -wa -wb -f 1.0 -u \
| wc -l
13746</code></pre>
<ol start="5" style="list-style-type: decimal">
<li>What fraction of the GWAS SNPs are exonic?</li>
</ol>
<p>Answer (Any idea why we need -u?):</p>
<pre><code>wc -l gwas.bed
17680 gwas.bed

bedtools intersect -a gwas.bed -b exons.bed -u | wc -l
1625

echo &quot;foo&quot; | awk &#39;{print 1625/17680}&#39;
0.0919118</code></pre>
<ol start="6" style="list-style-type: decimal">
<li>What fraction of the GWAS SNPs are lie in either enhancers or promoters in the hESC data we have?</li>
</ol>
<p>Answer (Any idea why we need -u?):</p>
<pre><code>bedtools intersect -a gwas.bed -b &lt;(egrep &quot;Enhancer|Promoter&quot; hesc.chromHmm.bed) -u \
| wc -l
1285

echo &quot;foo&quot; | awk &#39;{print 1285/17680}&#39;
0.072681</code></pre>
<ol start="7" style="list-style-type: decimal">
<li>Create intervals representing the canonical 2bp splice sites on either side of each exon (don’t worry about excluding splice sites at the first or last exon). (Hint - have a look at the <a href="http://bedtools.readthedocs.org/en/latest/content/tools/flank.html">flank</a> tool.)</li>
</ol>
<p>Answer:</p>
<pre><code>bedtools flank -l 2 -r 2 -i exons.bed -g genome.txt &gt; splice-sites.bed</code></pre>
<p>Or:</p>
<pre><code>bedtools slop -b 2 -i exons.bed -g genome.txt &gt; exons.plus2.bed
bedtools subtract -a exons.plus2.bed -b exons.bed &gt; splice-sites.bed</code></pre>
<ol start="8" style="list-style-type: decimal">
<li>What is the Jaccard statistic between CpG and hESC enhancers? Compare that to the Jaccard statistic between CpG and hESC promoters. Does the result make sense? (Hint - you will need <code>grep</code>).</li>
</ol>
<p>Answer:</p>
<pre><code>bedtools jaccard -a cpg.bed -b &lt;(grep Enhancer hesc.chromHmm.bed)
intersection    union-intersection  jaccard n_intersections
1148180 132977386   0.0086344   4969

bedtools jaccard -a cpg.bed -b &lt;(grep Promoter hesc.chromHmm.bed)
intersection    union-intersection  jaccard n_intersections
15661111    53551816    0.292448    20402</code></pre>
<ol start="9" style="list-style-type: decimal">
<li>What would you expect the Jaccard statistic to look like if promoters were randomly distributed throughout the genome? (Hint - you will need the <a href="http://bedtools.readthedocs.org/en/latest/content/tools/shuffle.html">shuffle</a> tool.)</li>
</ol>
<p>Answer:</p>
<pre><code>bedtools shuffle -i &lt;(grep Promoter hesc.chromHmm.bed) -g genome.txt \
  | sort -k1,1 -k2,2n \
&gt; promoters.shuffled.bed

bedtools jaccard -a cpg.bed -b promoters.shuffled.bed
intersection    union-intersection  jaccard n_intersections
294071  68556207    0.00428949  78</code></pre>
<ol start="10" style="list-style-type: decimal">
<li>Which hESC ChromHMM state (e.g., 11_Weak_Txn, 10_Txn_Elongation) represents the most number of base pairs in the genome? (Hint: you will need to use <code>awk</code> or <code>perl</code> here, as well as the <a href="http://bedtools.readthedocs.org/en/latest/content/tools/groupby.html">groupby</a> tool.)</li>
</ol>
            </div>
    </div>
  </div>
</body>
</html>