File: getfasta.rst

package info (click to toggle)
bedtools 2.26.0%2Bdfsg-3
  • links: PTS, VCS
  • area: main
  • in suites: stretch
  • size: 55,328 kB
  • sloc: cpp: 37,989; sh: 6,930; makefile: 2,225; python: 163
file content (274 lines) | stat: -rwxr-xr-x 15,242 bytes parent folder | download
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
.. _getfasta:

###############
*getfasta*
###############

|

.. image:: ../images/tool-glyphs/getfasta-glyph.png 
    :width: 600pt 


``bedtools getfasta`` extracts sequences from a FASTA file for each of the 
intervals defined in a BED/GFF/VCF file. 

.. tip::
    
    1. The headers in the input FASTA file must *exactly* match the chromosome 
    column in the BED file.
    
    2. You can use the UNIX ``fold`` command to set the line width of the 
    FASTA output.  For example, ``fold -w 60`` will make each line of the FASTA
    file have at most 60 nucleotides for easy viewing.
    
    3. BED files containing a single region require a newline character at the end of
    the line, otherwise a blank output file is produced.

.. seealso::

    :doc:`../tools/maskfasta`

    
==========================================================================
Usage and option summary
==========================================================================
**Usage**

.. code-block:: bash

  $ bedtools getfasta [OPTIONS] -fi <input FASTA> -bed <BED/GFF/VCF> 
  
**(or):**

.. code-block:: bash

  $ getFastaFromBed [OPTIONS] -fi <input FASTA> -bed <BED/GFF/VCF>



===========================      ===============================================================================================================================================================================================================
 Option                           Description
===========================      ===============================================================================================================================================================================================================
**-fo**                          Specify an output file name. By default, output goes to stdout.
**-name**				                 Use the "name" column in the BED file for the FASTA headers in the output FASTA file.								 
**-tab**					               Report extract sequences in a tab-delimited format instead of in FASTA format.
**-bedOut**                      Report extract sequences in a tab-delimited BED format instead of in FASTA format.
**-s**                           Force strandedness. If the feature occupies the antisense strand, the sequence will be reverse complemented. *Default: strand information is ignored*.
**-split**	                     Given BED12 input, extract and concatenate the sequences from the BED "blocks" (e.g., exons)
===========================      ===============================================================================================================================================================================================================


==========================================================================
Default behavior
==========================================================================
``bedtools getfasta`` will extract the sequence defined by the coordinates 
in a BED interval and create a new FASTA entry in the output file for each 
extracted sequence. By default, the FASTA header for each
extracted sequence will be formatted as follows: "<chrom>:<start>-<end>".

.. code-block:: bash

  $ cat test.fa
  >chr1
  AAAAAAAACCCCCCCCCCCCCGCTACTGGGGGGGGGGGGGGGGGG

  $ cat test.bed
  chr1 5 10

  $ bedtools getfasta -fi test.fa -bed test.bed 
  >chr1:5-10
  AAACC

  # optionally write to an output file
  $ bedtools getfasta -fi test.fa -bed test.bed -fo test.fa.out

  $ cat test.fa.out
  >chr1:5-10
  AAACC



  
==========================================================================
``-name`` Using the BED "name" column as a FASTA header.
==========================================================================
Using the ``-name`` option, one can set the FASTA header for each extracted 
sequence to be the "name" columns from the BED feature.

.. code-block:: bash

  $ cat test.fa
  >chr1
  AAAAAAAACCCCCCCCCCCCCGCTACTGGGGGGGGGGGGGGGGGG

  $ cat test.bed
  chr1 5 10 myseq

  $ bedtools getfasta -fi test.fa -bed test.bed -name
  >myseq
  AAACC



==========================================================================
``-tab`` Creating a tab-delimited output file in lieu of FASTA output.
==========================================================================
Using the ``-tab`` option, the ``-fo`` output file will be tab-delimited 
instead of in FASTA format.

.. code-block:: bash

  $ cat test.fa
  >chr1
  AAAAAAAACCCCCCCCCCCCCGCTACTGGGGGGGGGGGGGGGGGG

  $ cat test.bed
  chr1 5 10 myseq

  $ bedtools getfasta -fi test.fa -bed test.bed -name -tab
  myseq AAACC
  

==========================================================================
``-bedOut`` Creating a tab-delimited BED file in lieu of FASTA output.
==========================================================================
Using the ``-tab`` option, the ``-fo`` output file will be tab-delimited 
instead of in FASTA format.

.. code-block:: bash

  $ cat test.fa
  >chr1
  AAAAAAAACCCCCCCCCCCCCGCTACTGGGGGGGGGGGGGGGGGG

  $ cat test.bed
  chr1 5 10 myseq

  $ bedtools getfasta -fi test.fa -bed test.bed -tab
  chr1 5 10 AAACC

  
==========================================================================
``-s`` Forcing the extracted sequence to reflect the requested strand 
==========================================================================
``bedtools getfasta`` will extract the sequence in the orientation defined in 
the strand column when the "-s" option is used.

.. code-block:: bash

  $ cat test.fa
  >chr1
  AAAAAAAACCCCCCCCCCCCCGCTACTGGGGGGGGGGGGGGGGGG

  $ cat test.bed
  chr1 20 25 forward 1 +
  chr1 20 25 reverse 1 -

  $ bedtools getfasta -fi test.fa -bed test.bed -s -name
  >forward
  CGCTA
  >reverse
  TAGCG
  

==========================================================================
``-split`` Extracting BED "blocks". 
==========================================================================
One can optionally request that FASTA records be extracting and concatenating 
each block in a BED12 record.  For example, consider a BED12 record describing a 
transcript.  By default, ``getfasta`` will extract the sequence representing the
entire transcript (intons, exons, UTRs).  Using the -split option, ``getfasta``
will instead produce separate a FASTA record representing a transcript that
splices together each BED12 block (e.g., exons
and UTRs in the case of genes described with BED12).

.. code-block:: bash

  $ cat genes.bed12
  chr1	164404	173864	ENST00000466557.1	0	-	173864	173864	0	6	387,59,66,216,132,112,	0,1479,3695,4644,8152,9348,
  chr1	235855	267253	ENST00000424587.1	0	-	267253	267253	0	4	2100,150,105,158,	0,2562,23161,31240,
  chr1	317810	328455	ENST00000426316.1	0	+	328455	328455	0	2	323,145,	0,10500,
  
  $ bedtools getfasta -fi chr1.fa -bed genes.bed12 -split -name
  >ENST00000466557.1
  gaggcgggaagatcacttgatatcaggagtcgaggcgggaagatcacttgacgtcaggagttcgagactggcccggccaacatggtgaaaccgcatctccactaaaaatacaaaaattagcctggtatggtggtgggcacctgtaatcccagtgacttgggaggctaaggcaggagaatttcttgaacccaggaggcagaggttgcagtgaccagcaaggttgcgccattgcaccccagcctgggcgataagagtgaaaactccatctcaaaaaaaaaaaaaaaaaaaaaaTTCCTTTGGGAAGGCCTTCTACATAAAAATCTTCAACATGAGACTGGAAAAAAGGGTATGGGATCATCACCGGACCTTTGGCTTTTACAGCTCGAGCTGACAAAGTTGATTTATCAAGTTGTAAATCTTCACCTGTTGAATTCATAAGTTCATGTCATATTTTCTTTCAGACAATTCTTCAGTTTGTTTACGTAGATCAGCGATACGATGATTCCATTTCTtcggatccttgtaagagcagagcaggtgatggagagggtgggaggtgtagtgacagaagcaggaaactccagtcattcgagacgggcagcacaagctgcggagtgcaggccacctctacggccaggaaacggattctcccgcagagcctcggaagctaccgaccctgctcccaccttgactcagtaggacttactgtagaattctggccttcagacCTGAGCCTGGCAGCTCTCTCCAACTTTGGAAGCCCAGGGGCATGGCCCCTGTCCACAGATGCACCTGGCATGAGGCGTGCCCAGAGGGACAGAGGCAGATGAGTttcgtctcctccactggattgtgagggcCAGAGTTGAACTCCCTCATTTTCCGTTCCCCAGCATTGGCAGGTTCTGGGACTGGTGGCTGTGGTGGCTCGTTGGTCTTTGTCTCTTAGAAGGTGGGGAATAATCATCATCT
  >ENST00000424587.1
  ccaggaagtgaaaatgacactttactgttttaatttgcatttctctgcttacaagtggattacacacattttcgtgtgctgttggctacttatTCATTCAGAAAACATACTAAGTGCTGGCTCTTTTTCATGTCCTTTATCAAGTTTGGATCATGTCATTTGCTATTTTCTTTCTGATGTAAACTCTCAAAGTCTGAAGTGTATTGTCTTTTCCTGACACATATGTTGTAAATAATTTTCTGGCTTACATTTTGACTTTTAATTTCATTCACGATGTTTTTAATGAATAATTTTAATTTTTATGAATGCAAGTTAAAATAATTCTTTCATTGTGGTCTCTGACATGTCATGCCAATAAGGGTCTTCTCCTCCAAGAGCACAGAAATATTTGCCAATACTGTCCTTAAAATCGGTCACAGTTTCATTTTTTATATATGCATTTTACTTCAATTGGGGCTTCATTTTACTGAATGCCCTATTTGAAGCAAGTTTCTCAGTTAATTCTTTTCTCAAAGGGCTAAGTATGGTAGATTGCAAACATAAGTGGCCACATAATGCTCTCACCTCctttgcctcctctcccaggaggagatagcgtccatctttccactccttaatctgggcttggccgtgtgacttgcactggccaatgggatattaacaagtctgatgtgcacagaggctgtagaatgtgcacgggggcttggtctctcttgctgccctggagaccagctgccCCACGAAGGAACCAGAGCCAACCTGCTGCTTCCTGGAGGAAGACAGTCCCTCTGTCCCTCTGTCTCTGCCAACCAGTTAACCTGCTGCTTCCTGGAGGGAGACAGTCCCTCAGTCCCTCTGTCTCTGCCAACCAGTTAACCTGCTGCTTCCTGGAGGAAGACAGTCACTCTGTCTCTGccaacccagttgaccgcagacatgcaggtctgctcaggtaagaccagcacagtccctgccctgtgagccaaaccaaatggtccagccacagaatcgtgagcaaataagtgatgcttaagtcactaagatttgggCAAAAGCTGAGCATTTATCCCAATCCCAATACTGTTTGTCCTTCTGTTTATCTGTCTGTCCTTCCCTGCTCATTTAAAATGCCCCCACTGCATCTAGTACATTTTTATAGGATCAGGGATCTGCTCTTGGATTAATGTTGTGTTCCCACCTCGAGGCAGCTTTGTAAGCTTCTGAGCACTTCCCAATTCCGGGTGACTTCAGGCACTGGGAGGCCTGTGCATCAGCTGCTGCTGTCTGTAGCTGACTTCCTTCACCCCTCTGCTGTCCTCAGCTCCTTCACCCCTGGGCCTCAGGAAATCAATGTCATGCTGACATCACTCTAGATCTAAAAGTTGGGTTCTTGgaccaggcgtggtggctcacacctgtaatcccagcactttgggaggccgaggcgggtggatcacaaggtcaggagatcaagacgattctggctaacacggtgaaaccccgtctctactaaaaatacaaaaaaattagccgggtgtggtggcaggtgcctgtagccccagctacttgggaggctgaggcaggagaatggcttgaacctgggaggtggagcttgcagtgagccaagatcacgccactgcactccagaatgggagagagagcgagactttctcaaaaaaaaaaaaaaaaCTTAGGTTCTTGGATGTTCGGGAAAGGGGGTTATTATCTAGGATCCTTGAAGCACCCCCAAGGGCATCTTCTCAAAGTTGGATGTGTGCATTTTCCTGAGAGGAAAGCTTTCCCACATTATACAGCTTCTGAAAGGGTTGCTTGACCCACAGATGTGAAGCTGAGGCTGAAGGAGACTGATGTGGTTTCTCCTCAGTTTCTCTGTGCAGCACCAGGTGGCAGCAGAGGTCAGCAAGGCAAACCCGAGCCCGGGGATGCGGAGTGGGGGCAGCTACGTCCTCTCTTGAGCTACAGCAGATTCACTCTGTTCTGTTTCATTGTTGTTTAGTTTGCGTTGTGTTTCTCCAACTTTGTGCCTCATCAGGAAAAGCTTTGGATCACAATTCCCAGtgctgaagaaaaggccaaactcttggttgtgttctttgattAGTgcctgtgacgcagcttcaggaggtcctgagaacgtgtgcacagtttagtcggcagaaacttagggaaatgtaagaccaccatcagcacataggagttctgcattggtttggtctgcattggtttggtCTTTTCCTGGATACAGGTCTTGATAGGTCTCTTGATGTCATTTCACTTCAGATTCTTCTTTAGAAAACTTGGACAATAGCATTTGCTGTCTTGTCCAAATTGTTACTTCAAGTTTGCTCTTAGCAAGTAATTGTTTCAGTATCTATATCAAAAATGGCTTAAGCCTGCAACATGTTTCTGAATGATTAACAAGGTGATAGTCAGTTCTTCATTGAATCCTGGATGCTTTATTTTTCTTAATAAGAGGAATTCATATGGATCAG
  >ENST00000426316.1
  AATGATCAAATTATGTTTCCCATGCATCAGGTGCAATGGGAAGCTCTTctggagagtgagagaagcttccagttaaggtgacattgaagccaagtcctgaaagatgaggaagagttgtatgagagtggggagggaagggggaggtggagggaTGGGGAATGGGCCGGGATGGGATAGCGCAAACTGCCCGGGAAGGGAAACCAGCACTGTACAGACCTGAACAACGAAGATGGCATATTTTGTTCAGGGAATGGTGAATTAAGTGTGGCAGGAATGCTTTGTAGACACAGTAATTTGCTTGTATGGAATTTTGCCTGAGAGACCTCATTCCTCACGTCGGCCATTCCAGGCCCCGTTTTTCCCTTCCGGCAGCCTCTTGGCCTCTAATTTGTTTATCTTTTGTGTATAAATCCCAAAATATTGAATTTTGGAATATTTCCACCATTATGTAAATATTTTGATAGGTAA
  
  # use the UNIX fold command to wrap the FASTA sequence such that each line
  # has at most 60 characters
  $ bedtools getfasta -fi chr1.fa -bed genes.bed12 -split -name | \
        fold -w 60
  >ENST00000466557.1
  gaggcgggaagatcacttgatatcaggagtcgaggcgggaagatcacttgacgtcaggag
  ttcgagactggcccggccaacatggtgaaaccgcatctccactaaaaatacaaaaattag
  cctggtatggtggtgggcacctgtaatcccagtgacttgggaggctaaggcaggagaatt
  tcttgaacccaggaggcagaggttgcagtgaccagcaaggttgcgccattgcaccccagc
  ctgggcgataagagtgaaaactccatctcaaaaaaaaaaaaaaaaaaaaaaTTCCTTTGG
  GAAGGCCTTCTACATAAAAATCTTCAACATGAGACTGGAAAAAAGGGTATGGGATCATCA
  CCGGACCTTTGGCTTTTACAGCTCGAGCTGACAAAGTTGATTTATCAAGTTGTAAATCTT
  CACCTGTTGAATTCATAAGTTCATGTCATATTTTCTTTCAGACAATTCTTCAGTTTGTTT
  ACGTAGATCAGCGATACGATGATTCCATTTCTtcggatccttgtaagagcagagcaggtg
  atggagagggtgggaggtgtagtgacagaagcaggaaactccagtcattcgagacgggca
  gcacaagctgcggagtgcaggccacctctacggccaggaaacggattctcccgcagagcc
  tcggaagctaccgaccctgctcccaccttgactcagtaggacttactgtagaattctggc
  cttcagacCTGAGCCTGGCAGCTCTCTCCAACTTTGGAAGCCCAGGGGCATGGCCCCTGT
  CCACAGATGCACCTGGCATGAGGCGTGCCCAGAGGGACAGAGGCAGATGAGTttcgtctc
  ctccactggattgtgagggcCAGAGTTGAACTCCCTCATTTTCCGTTCCCCAGCATTGGC
  AGGTTCTGGGACTGGTGGCTGTGGTGGCTCGTTGGTCTTTGTCTCTTAGAAGGTGGGGAA
  TAATCATCATCT
  >ENST00000424587.1
  ccaggaagtgaaaatgacactttactgttttaatttgcatttctctgcttacaagtggat
  tacacacattttcgtgtgctgttggctacttatTCATTCAGAAAACATACTAAGTGCTGG
  CTCTTTTTCATGTCCTTTATCAAGTTTGGATCATGTCATTTGCTATTTTCTTTCTGATGT
  AAACTCTCAAAGTCTGAAGTGTATTGTCTTTTCCTGACACATATGTTGTAAATAATTTTC
  TGGCTTACATTTTGACTTTTAATTTCATTCACGATGTTTTTAATGAATAATTTTAATTTT
  TATGAATGCAAGTTAAAATAATTCTTTCATTGTGGTCTCTGACATGTCATGCCAATAAGG
  GTCTTCTCCTCCAAGAGCACAGAAATATTTGCCAATACTGTCCTTAAAATCGGTCACAGT
  TTCATTTTTTATATATGCATTTTACTTCAATTGGGGCTTCATTTTACTGAATGCCCTATT
  TGAAGCAAGTTTCTCAGTTAATTCTTTTCTCAAAGGGCTAAGTATGGTAGATTGCAAACA
  TAAGTGGCCACATAATGCTCTCACCTCctttgcctcctctcccaggaggagatagcgtcc
  atctttccactccttaatctgggcttggccgtgtgacttgcactggccaatgggatatta
  acaagtctgatgtgcacagaggctgtagaatgtgcacgggggcttggtctctcttgctgc
  cctggagaccagctgccCCACGAAGGAACCAGAGCCAACCTGCTGCTTCCTGGAGGAAGA
  CAGTCCCTCTGTCCCTCTGTCTCTGCCAACCAGTTAACCTGCTGCTTCCTGGAGGGAGAC
  AGTCCCTCAGTCCCTCTGTCTCTGCCAACCAGTTAACCTGCTGCTTCCTGGAGGAAGACA
  GTCACTCTGTCTCTGccaacccagttgaccgcagacatgcaggtctgctcaggtaagacc
  agcacagtccctgccctgtgagccaaaccaaatggtccagccacagaatcgtgagcaaat
  aagtgatgcttaagtcactaagatttgggCAAAAGCTGAGCATTTATCCCAATCCCAATA
  CTGTTTGTCCTTCTGTTTATCTGTCTGTCCTTCCCTGCTCATTTAAAATGCCCCCACTGC
  ATCTAGTACATTTTTATAGGATCAGGGATCTGCTCTTGGATTAATGTTGTGTTCCCACCT
  CGAGGCAGCTTTGTAAGCTTCTGAGCACTTCCCAATTCCGGGTGACTTCAGGCACTGGGA
  GGCCTGTGCATCAGCTGCTGCTGTCTGTAGCTGACTTCCTTCACCCCTCTGCTGTCCTCA
  GCTCCTTCACCCCTGGGCCTCAGGAAATCAATGTCATGCTGACATCACTCTAGATCTAAA
  AGTTGGGTTCTTGgaccaggcgtggtggctcacacctgtaatcccagcactttgggaggc
  cgaggcgggtggatcacaaggtcaggagatcaagacgattctggctaacacggtgaaacc
  ccgtctctactaaaaatacaaaaaaattagccgggtgtggtggcaggtgcctgtagcccc
  agctacttgggaggctgaggcaggagaatggcttgaacctgggaggtggagcttgcagtg
  agccaagatcacgccactgcactccagaatgggagagagagcgagactttctcaaaaaaa
  aaaaaaaaaCTTAGGTTCTTGGATGTTCGGGAAAGGGGGTTATTATCTAGGATCCTTGAA
  GCACCCCCAAGGGCATCTTCTCAAAGTTGGATGTGTGCATTTTCCTGAGAGGAAAGCTTT
  CCCACATTATACAGCTTCTGAAAGGGTTGCTTGACCCACAGATGTGAAGCTGAGGCTGAA
  GGAGACTGATGTGGTTTCTCCTCAGTTTCTCTGTGCAGCACCAGGTGGCAGCAGAGGTCA
  GCAAGGCAAACCCGAGCCCGGGGATGCGGAGTGGGGGCAGCTACGTCCTCTCTTGAGCTA
  CAGCAGATTCACTCTGTTCTGTTTCATTGTTGTTTAGTTTGCGTTGTGTTTCTCCAACTT
  TGTGCCTCATCAGGAAAAGCTTTGGATCACAATTCCCAGtgctgaagaaaaggccaaact
  cttggttgtgttctttgattAGTgcctgtgacgcagcttcaggaggtcctgagaacgtgt
  gcacagtttagtcggcagaaacttagggaaatgtaagaccaccatcagcacataggagtt
  ctgcattggtttggtctgcattggtttggtCTTTTCCTGGATACAGGTCTTGATAGGTCT
  CTTGATGTCATTTCACTTCAGATTCTTCTTTAGAAAACTTGGACAATAGCATTTGCTGTC
  TTGTCCAAATTGTTACTTCAAGTTTGCTCTTAGCAAGTAATTGTTTCAGTATCTATATCA
  AAAATGGCTTAAGCCTGCAACATGTTTCTGAATGATTAACAAGGTGATAGTCAGTTCTTC
  ATTGAATCCTGGATGCTTTATTTTTCTTAATAAGAGGAATTCATATGGATCAG
  >ENST00000426316.1
  AATGATCAAATTATGTTTCCCATGCATCAGGTGCAATGGGAAGCTCTTctggagagtgag
  agaagcttccagttaaggtgacattgaagccaagtcctgaaagatgaggaagagttgtat
  gagagtggggagggaagggggaggtggagggaTGGGGAATGGGCCGGGATGGGATAGCGC
  AAACTGCCCGGGAAGGGAAACCAGCACTGTACAGACCTGAACAACGAAGATGGCATATTT
  TGTTCAGGGAATGGTGAATTAAGTGTGGCAGGAATGCTTTGTAGACACAGTAATTTGCTT
  GTATGGAATTTTGCCTGAGAGACCTCATTCCTCACGTCGGCCATTCCAGGCCCCGTTTTT
  CCCTTCCGGCAGCCTCTTGGCCTCTAATTTGTTTATCTTTTGTGTATAAATCCCAAAATA
  TTGAATTTTGGAATATTTCCACCATTATGTAAATATTTTGATAGGTAA