1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218
|
# Benchmark
## Softwares
1. [seqkit](https://github.com/shenwei356/seqkit). (Go).
Version [v0.3.1.1](https://github.com/shenwei356/seqkit/releases/tag/v0.3.1.1).
Compiled with Go 1.7rc5.
1. [fasta_utilities](https://github.com/jimhester/fasta_utilities). (Perl).
Version [3dcc0bc](https://github.com/jimhester/fasta_utilities/tree/3dcc0bc6bf1e97839476221c26984b1789482579).
Lots of dependencies to install.
1. [fastx_toolkit](http://hannonlab.cshl.edu/fastx_toolkit/). (Perl).
Version [0.0.13](http://hannonlab.cshl.edu/fastx_toolkit/fastx_toolkit_0.0.13_binaries_Linux_2.6_amd64.tar.bz2).
Can't handle multi-line FASTA files.
1. [seqmagick](http://seqmagick.readthedocs.io/en/latest/index.html#installation). (Python).
Version 0.6.1
1. [seqtk](https://github.com/lh3/seqtk). (C).
Version [1.1-r92-dirty](https://github.com/lh3/seqtk/tree/fb85aad4ce1fc7b3d4543623418a1ae88fe1cea6).
Not used:
1. [pyfaidx](https://github.com/mdshw5/pyfaidx). (Python).
Version [0.4.7.1](https://pypi.python.org/packages/source/p/pyfaidx/pyfaidx-0.4.7.1.tar.gz#md5=f33604a3550c2fa115ac7d33b952127d). *Not used, because it exhausted my memory (10G) when computing reverse-complement on a 5GB fasta file of 250 bp.*
A Python script [memusg](https://github.com/shenwei356/memusg) was used
to compute running time and peak memory usage of a process.
## Features
Categories |Features |seqkit |fasta_utilities|fastx_toolkit|pyfaidx|seqmagick|seqtk
:-------------------|:----------------------|:------:|:-------------:|:-----------:|:-----:|:-------:|:---:
**Formats support** |Multi-line FASTA |Yes |Yes |-- |Yes |Yes |Yes
|FASTQ |Yes |Yes |Yes |-- |Yes |Yes
|Multi-line FASTQ |Yes |Yes |-- |-- |Yes |Yes
|Validating sequences |Yes |-- |Yes |Yes |-- |--
|Supporting RNA |Yes |Yes |-- |-- |Yes |Yes
**Functions** |Searching by motifs |Yes |Yes |-- |-- |Yes |--
|Sampling |Yes |-- |-- |-- |Yes |Yes
|Extracting sub-sequence|Yes |Yes |-- |Yes |Yes |Yes
|Removing duplicates |Yes |-- |-- |-- |Partly |--
|Splitting |Yes |Yes |-- |Partly |-- |--
|Splitting by seq |Yes |-- |Yes |Yes |-- |--
|Shuffling |Yes |-- |-- |-- |-- |--
|Sorting |Yes |Yes |-- |-- |Yes |--
|Locating motifs |Yes |-- |-- |-- |-- |--
|Common sequences |Yes |-- |-- |-- |-- |--
|Cleaning bases |Yes |Yes |Yes |Yes |-- |--
|Transcription |Yes |Yes |Yes |Yes |Yes |Yes
|Translation |-- |Yes |Yes |Yes |Yes |--
|Filtering by size |Indirect|Yes |-- |Yes |Yes |--
|Renaming header |Yes |Yes |-- |-- |Yes |Yes
**Other features** |Cross-platform |Yes |Partly |Partly |Yes |Yes |Yes
|Reading STDIN |Yes |Yes |Yes |-- |Yes |Yes
|Reading gzipped file |Yes |Yes |-- |-- |Yes |Yes
|Writing gzip file |Yes |-- |-- |-- |Yes |--
**Note 2**: See [usage](http://bioinf.shenwei.me/seqkit/usage/) for detailed options of seqkit.
## Datasets
All test data is available here: [seqkit-benchmark-data.tar.gz](http://app.shenwei.me/data/seqkit/seqkit-benchmark-data.tar.gz) (2.2G)
### dataset_A.fa - large number of short sequences
Dataset A is reference genomes DNA sequences of gastrointestinal tract from
[NIH Human Microbiome Project](http://hmpdacc.org/):
[`Gastrointestinal_tract.nuc.fsa`](http://downloads.hmpdacc.org/data/reference_genomes/body_sites/Gastrointestinal_tract.nuc.fsa) (FASTA format, ~2.7G).
### dataset_B.fa - small number of large sequences
Dataset B is Human genome from [ensembl](http://uswest.ensembl.org/info/data/ftp/index.html).
- Genome DNA: [`Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa.gz`](ftp://ftp.ensembl.org/pub/release-84/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa.gz) (Gzipped FASTA file, ~900M)
. Decompress it and rename to dataset_B.fa (~2.9G).
- GTF file: [`Homo_sapiens.GRCh38.84.gtf.gz`](ftp://ftp.ensembl.org/pub/release-84/gtf/homo_sapiens/Homo_sapiens.GRCh38.84.gtf.gz) (~44M)
- BED file: `Homo_sapiens.GRCh38.84.bed.gz` was converted from `Homo_sapiens.GRCh38.84.gtf.gz` by [`gtf2bed`](http://bedops.readthedocs.org/en/latest/content/reference/file-management/conversion/gtf2bed.html?highlight=gtf2bed) with command
$ zcat Homo_sapiens.GRCh38.84.gtf.gz | gtf2bed --do-not-sort | gzip -c > Homo_sapiens.GRCh38.84.bed.gz
### dataset_C.fq – Illumina single end reads (SE100)
Dataset C is Illumina single end (SE 100bp) reads file (~2.2G).
Summary
$ seqkit stat *.fa
file format type num_seqs sum_len min_len avg_len max_len
dataset_A.fa FASTA DNA 67,748 2,807,643,808 56 41,442.5 5,976,145
dataset_B.fa FASTA DNA 194 3,099,750,718 970 15,978,096.5 248,956,422
dataset_C.fq FASTQ DNA 9,186,045 918,604,500 100 100 100
### Sequence ID list
Parts of sequences IDs was sampled and shuffled from original data.
They were used in test of extracting sequences by ID list.
Commands:
$ seqkit sample -p 0.3 dataset_A.fa | seqkit seq --name --only-id | shuf > ids_A.txt
$ seqkit sample -p 0.3 dataset_B.fa | seqkit seq --name --only-id | shuf > ids_B.txt
$ seqkit sample -p 0.03 dataset_C.fq | seqkit seq --name --only-id | shuf > ids_C.txt
Numbers:
$ wc -l ids*.txt
20138 ids_A.txt
58 ids_B.txt
2754516 ids_C.txt
### BED file
Only BED data of chromosome 19 was used in test of subsequence with BED file:
$ zcat Homo_sapiens.GRCh38.84.bed.gz | grep -E "^19" | gzip -c > chr19.bed.gz
## Platform
PC:
- CPU: Intel Core i5-3320M @ 2.60GHz, two cores/4 threads
- RAM: DDR3 1600MHz, 12GB
- SSD: SAMSUNG 850 EVO 250G, SATA-3
- OS: Fedora 24 (Scientific KDE spin), Kernal: 4.6.4-301.fc24.x86_64
Softwares:
- Perl: perl 5, version 22, subversion 2 (v5.22.2) built for x86_64-linux-thread-multi
- Python: Python 2.7.11 (default, Jul 10 2016, 20:58:20) [GCC 6.1.1 20160621 (Red Hat 6.1.1-3)] on linux2
## Tests
Automatic benchmark and plotting scripts are available at: [https://github.com/shenwei356/seqkit/tree/master/benchmark](https://github.com/shenwei356/seqkit/tree/master/benchmark).
All tests were repeated 3 times,
and average time and peak memory ware used for plotting.
All data were readed once before tests began to minimize the influence of page cache.
Output sequences of all softwares were not wrapped to fixed length.
### Test 1. Reverse Complement
`revcom_biogo` ([source](https://github.com/shenwei356/seqkit/blob/master/benchmark/revcom_biogo.go),
[binary](https://github.com/shenwei356/seqkit/blob/master/benchmark/revcom_biogo?raw=true) ),
a tool written in Golang (compiled with Go 1.6.3) using [biogo](https://github.com/biogo/biogo)
(Version [7ebd71b](https://github.com/biogo/biogo/commit/7ebd71bd9afc52cdab7a7128467ae1a936b68958))
package,
was also used for comparison of FASTA file parsing performance.
*Note that some softwares (fasta_utilities and biogo) have different converting rules of computing complement sequence on ambiguous bases, there fore the results are different from others.*
[Commands](https://github.com/shenwei356/seqkit/blob/master/benchmark/run_benchmark_01_revcom.sh)
### Test 2. Extracting sequences by ID list
[Commands](https://github.com/shenwei356/seqkit/blob/master/benchmark/run_benchmark_02_exctact_by_id_list.sh)
### Test 3. Sampling by number
*Note that different softwares have different sampling strategies, the peak memory depends on size of sampled sequences and the results may not be the same.*
[Commands](https://github.com/shenwei356/seqkit/blob/master/benchmark/run_benchmark_03_sampling.sh)
### Test 4. Removing duplicates by sequence content
[Commands](https://github.com/shenwei356/seqkit/blob/master/benchmark/run_benchmark_04_remove_duplicated_seqs_by_seq.sh)
### Test 5. Subsequence with BED file
[Commands](https://github.com/shenwei356/seqkit/blob/master/benchmark/run_benchmark_05_subseq_with_bed.sh)
## Results
seqkit version: v0.3.1.1
FASTA:

FASTQ:

### Test of multiple threads:
From the results, 2 threads/CPU is enough, so the default threads of seqkit is 2.

### Tests on different file sizes
Files are generated by replicating Human genome chr1 for N times.

<div id="disqus_thread"></div>
<script>
/**
* RECOMMENDED CONFIGURATION VARIABLES: EDIT AND UNCOMMENT THE SECTION BELOW TO INSERT DYNAMIC VALUES FROM YOUR PLATFORM OR CMS.
* LEARN WHY DEFINING THESE VARIABLES IS IMPORTANT: https://disqus.com/admin/universalcode/#configuration-variables*/
/*
var disqus_config = function () {
this.page.url = PAGE_URL; // Replace PAGE_URL with your page's canonical URL variable
this.page.identifier = PAGE_IDENTIFIER; // Replace PAGE_IDENTIFIER with your page's unique identifier variable
};
*/
(function() { // DON'T EDIT BELOW THIS LINE
var d = document, s = d.createElement('script');
s.src = '//seqkit.disqus.com/embed.js';
s.setAttribute('data-timestamp', +new Date());
(d.head || d.body).appendChild(s);
})();
</script>
<noscript>Please enable JavaScript to view the <a href="https://disqus.com/?ref_noscript">comments powered by Disqus.</a></noscript>
|