1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126
|
[](https://travis-ci.org/tseemann/any2fasta)
[](https://www.gnu.org/licenses/gpl-3.0)

# any2fasta
Convert various sequence formats to FASTA
## Motivation
You may wonder why this tool even exists. Well, I tried to do the right
thing and use established tools like `readseq` and `seqret` from EMBOSS, but
they both mangled IDs containing `|` or `.` characters, and
there is no way to fix this behaviour. This resulted in inconsitences
between my `.gbk` and `.fna` versions of files in my pipelines.
Then you may wonder why I didn't use Bioperl or Biopython. Well they are
heavyweight libraries, and actually very slow at parsing Genbank files.
This script uses only core Perl modules, has no other dependencies, and
runs very quickly.
It supports the following input formats:
1. Genbank flat file, typically `.gb`, `.gbk`, `.gbff` (starts with `LOCUS`)
2. EMBL flat file, typically `.embl`, (starts with `ID`)
3. GFF with sequence, typically `.gff`, `.gff3` (starts with `##gff`)
4. FASTA DNA, typically `.fasta`, `.fa`, `.fna`, `.ffn` (starts with `>`)
5. FASTQ DNA, typically `.fastq`, `.fq` (starts with `@`)
6. CLUSTAL alignments, typically `.clw`, `.clu` (starts with `CLUSTAL` or `MUSCLE`)
7. STOCKHOLM alignments, typically `.sth` (starts with `# STOCKHOLM`)
8. GFA assembly graph, typically `.gfa` (starts with `^[A-Z]\t`)
Files may be compressed with:
1. gzip, typically `.gz`
2. bzip2, typically `.bz2`
3. zip, typically `.zip`
## Installation
`any2fasta` has no dependencies except [Perl 5.10](https://www.perl.org/)
or higher. It only uses core modules, so no CPAN needed.
### Direct script download
```
% cd /usr/local/bin # choose a folder in your $PATH
% wget https://raw.githubusercontent.com/tseemann/any2fasta/master/any2fasta
% chmod +x any2fasta
```
### Homebrew
```
% brew install brewsci/bio/any2fasta # COMING SOON
```
### Conda
```
% conda install -c bioconda any2fasta # COMING SOON
```
### Github
```
% git clone https://github.com/tseemann/any2fasta.git
% cp any2fasta/any2fasta /usr/local/bin # choose a folder in your $PATH
```
## Test Installation
```
% ./any2fasta -v
any2fasta 0.2.2
% ./any2fasta -h
NAME
any2fasta 0.4.2
SYNOPSIS
Convert various sequence formats into FASTA
USAGE
any2fasta [options] file.{gb,fa,fq,gff,gfa,clw,sth}[.gz,bz2,zip] > output.fasta
OPTIONS
-h Print this help
-v Print version and exit
-q No output while running, only errors
-n Replace ambiguous IUPAC letters with 'N'
-l Lowercase the sequence
-u Uppercase the sequence
END
```
## Examples
```
% any2fasta ref.gbk > ref.fna
% any2fasta in.fasta > out.fasta # should behave like "cat"
% any2fasta prokka.gff > prokka.fna # only if GFF has FASTA appended
% any2fasta - < file.gb > file.fasta # '-' means stdin
% anyfasta genes.gff.gz > genes.ffn # automatically decompresses
% any2fasta 1.gb 2.fa.gz 3.gff.bz2 - > out.fa # multiple files and stdin
% any2fasta R1.fq.gz | bzip2 > R1.fa.bz2 # 'seqtk seq -A' is much faster
% any2fasta -q 23S.clw > 23S.aln # gaps '-' will be preserved
% any2fasta pfam4321.sth > pfam4321.aln # '.' gaps will become '-'
```
## Options
* `-n` replaces any characters that aren't A,C,G,T with N (gaps preserved)
* `-l` will lowercase all the letters
* `-u` will uppercase all the letters
* `-q` will prevent logging messages being printed
## Issues
Submit feedback to the [Issue Tracker](https://github.com/tseemann/any2fasta/issues)
## License
[GPL v3](https://raw.githubusercontent.com/tseemann/any2fasta/master/LICENSE)
## Author
[Torsten Seemann](http://tseemann.github.io/)
|