File: README.md

package info (click to toggle)
pyfastx 2.2.0-1
  • links: PTS, VCS
  • area: main
  • in suites: forky, sid, trixie
  • size: 1,336 kB
  • sloc: ansic: 4,820; python: 1,817; sh: 505; perl: 66; makefile: 31
file content (157 lines) | stat: -rw-r--r-- 7,465 bytes parent folder | download | duplicates (2)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
# Benchmark

## Test tools

All tools support for random access were used to perform benchmark. Python 3.6.6 was used to test tools that was developed as Python package.

| Tools    | Version  | Language  | URL                                         |
|----------|----------|-----------|---------------------------------------------|
| pyfastx  | v0.7.0   | C, Python | <http://github.com/lmdu/pyfastx>            |
| samtools | v1.9     | C         | <http://www.htslib.org/>                    |
| pysam    | v0.15.3  | C, Python | <https://github.com/pysam-developers/pysam> |
| seqkit   | v0.13.2  | Go        | <https://bioinf.shenwei.me/seqkit>          |
| bioperl  | v1.7.7   | Perl      | <https://bioperl.org>                       |
| pyfasta  | v0.5.2   | Python    | <https://github.com/brentp/pyfasta>         |
| biopython| v1.74    | Python    | <https://biopython.org>                     |
| pyfaidx  | v0.5.8   | Python    | <https://github.com/mdshw5/pyfaidx>         |

## Test data

We downloaded 15 genome FASTA files from NCBI FTP server with different size and sequence counts. These genomes were listed in table below. 

| Species name             | Common name      | Assembly Accession | Genome size (bp) | Sequence counts |
|--------------------------|------------------|--------------------|------------------|-----------------|
| Ambystoma mexicanum      | Axolotl          | GCA_002915635.1    | 32,393,605,577   | 125,724         |
| Pinus lambertiana        | Sugar pine       | GCA_001447015.2    | 27,602,651,501   | 4,253,096       |
| Picea glauca             | White spruce     | GCA_000411955.5    | 24,633,074,982   | 3,033,321       |
| Triticum aestivum        | Bread wheat      | GCA_900519105.1    | 14,547,261,565   | 22              |
| Triticum dicoccoides     | Wild wheat       | GCA_900184675.1    | 10,494,955,245   | 149,145         |
| Rana catesbeiana         | Bullfrog         | GCA_002284835.2    | 6,250,335,504    | 1,544,634       |
| Locusta migratoria       | Migratory locust | GCA_000516895.1    | 5,759,798,599    | 1,397,492       |
| Pisum sativum            | Pea              | GCA_003013575.1    | 4,275,931,177    | 5,449,423       |
| Homo sapiens             | Human            | GCF_000001405.38   | 3,257,302,968    | 593             |
| Zea mays                 | Maize            | GCF_000005005.2    | 2,134,513,431    | 266             |
| Gallus gallus            | Chicken          | GCF_000002315.5    | 1,065,348,650    | 463             |
| Sorghum bicolor          | Sorghum          | GCF_000003195.3    | 708,876,072      | 868             |
| Cicer arietinum          | Chickpea         | GCF_000331145.1    | 530,893,862      | 7,127           |
| Arabidopsis thaliana     | Thale cress      | GCF_000001735.4    | 119,300,826      | 6               |
| Saccharomyces cerevisiae | Yeast            | GCF_000146045.2    | 12,071,326       | 16              |

Download these FASTA files, and place in a directory (e.g. data/fastas), uncompress the file and retain the gzip compressed file.

We downloaded 5 FASTQ files from [NGDC database](https://bigd.big.ac.cn/). These FASTQ files were listed in table below.

| Accession | Total bases \(bp\) | Read counts  | Plain file size \(B\) | Plain file size \(GB\) | Gzip file size \(B\) | Gzip file size \(GB\) |
|-----------|--------------------|--------------|-----------------------|------------------------|----------------------|-----------------------|
| CRD000339 | 39,221,236,800     | 392,212,368  | 102,558,579,168       | 95\.52                 | 30,316,624,332       | 28\.23                |
| CRD000389 | 20,655,260,429     | 204,507,529  | 49,907,114,035        | 46\.48                 | 17,716,170,588       | 16\.50                |
| CRD000371 | 13,930,424,000     | 139,304,240  | 36,426,113,582        | 33\.92                 | 10,972,470,560       | 10\.22                |
| CRD000411 | 5,515,159,742      | 54,605,542   | 13,325,681,150        | 12\.41                 | 4,204,994,577        | 3\.92                 |
| CRD000365 | 1,597,691,427      | 15,818,727   | 3,860,387,755         | 3\.60                  | 1,211,374,918        | 1\.13                 |

Download these FASTQ files, and place in a directory (e.g. data/fastqs), uncompress the file and retain the gzip compressed file.

## Test method

We used linux command ``/usr/bin/time -f "%e %M"`` to estimate the running time and peak memory of building index, random access and sequence iteration for each tool. Each test was performed 3 times. The average value was calculated as final results.

## Perform test

### Building index

building index for fastas

```sh
./benchmark_fasta_build_index.sh 3 data/fastas/*.fa > benchmark_result_fasta_build_index.tsv
python3 make_fasta_ggplot2_matrix.py benchmark_result_fasta_build_index.tsv > matrix_fasta_build_index.tsv
```
building index for fastqs

```sh
./benchmark_fastq_build_index.sh 3 data/fastqs/*.fq > benchmark_result_fastq_build_index.tsv
python3 make_fastq_ggplot2_matrix.py benchmark_result_fastq_build_index.tsv > matrix_fastq_build_index.tsv
```

### Random access

randomly get 30% of sequence names from fastas with random seed of 123

```sh
python3 get_random_sequence_names_from_fasta.py 123 0.3 data/fastas/*.fa
```
random access to sequences from fastas

```sh
./benchmark_fasta_random_access.sh 3 data/fastas/*.fa > benchmark_result_fasta_random_access.tsv
python3 make_fasta_ggplot2_matrix.py benchmark_result_fasta_random_access.tsv > matrix_fasta_random_access.tsv
```

randomly generate one thousand subsequence region with length of 1 Kb and random seed 123

```sh
python3 generate_random_interval.py 123 data/fastas/*.fa
```

random access to subsequences from fastas

```sh
./benchmark_fasta_extract_subsequences.sh 3 data/fastas/*.fa > benchmark_result_fasta_extract_subsequences.tsv
python3 make_fasta_ggplot2_matrix.py benchmark_result_fasta_extract_subsequences.tsv > matrix_fasta_extract_subsequences.tsv
```

randomly get ten thousand names of reads from fastqs with random seed of 123

```sh
python3 get_random_read_names_from_fastq.py 123 data/fastqs/*.fq
```

random access to reads from fastqs

```sh
./benchmark_fastq_random_access.sh 3 data/fastqs/*.fq > benchmark_result_fastq_random_access.tsv
python3 make_fastq_ggplot2_matrix.py benchmark_result_fastq_random_access.tsv > matrix_fastq_random_access.tsv
```

### Sequence iteration

Iterating over sequences from fastas

```sh
./benchmark_fasta_sequence_iterate.sh 3 data/fastas/*.fa > benchmark_result_fasta_sequence_iterate.tsv
python3 make_fasta_ggplot2_matrix.py benchmark_result_fasta_sequence_iterate.tsv > matrix_fasta_sequence_iterate.tsv
```

Iterating over reads from fastqs

```sh
./benchmark_fastq_sequence_iterate.sh 3 data/fastqs/*.fq > benchmark_result_fastq_sequence_iterate.tsv
python3 make_fastq_ggplot2_matrix.py benchmark_result_fastq_sequence_iterate.tsv > matrix_fastq_sequence_iterate.tsv
```

### Plot

Recommended you use RStudio to plot, place the matrix file to the same folder of Rscripts. Prior to plotting, you should install ggplot2 and patchwork.

Select the work directory where matrix and rscripts located in

```r
setwd(dir.choose())
```

Plot for building index

```r
source('Figure_for_build_index.R')
```

Plot for random access

```r
source('Figure_for_random_access.R')
```

Plot for sequence iteration

```r
source('Figure_for_sequence_iteration.R')
```