File: README.md

package info (click to toggle)
idseq-bench 0.0~git20200902.8241a9a-1
  • links: PTS, VCS
  • area: main
  • in suites: bullseye
  • size: 196 kB
  • sloc: python: 849; sh: 39; makefile: 3
file content (219 lines) | stat: -rw-r--r-- 8,123 bytes parent folder | download
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
# [IDseq](https://idseq.net/) · [![GitHub license](https://img.shields.io/badge/license-MIT-brightgreen.svg)](https://github.com/chanzuckerberg/idseq-web/blob/master/LICENSE) ![PRs Welcome](https://img.shields.io/badge/PRs-welcome-brightgreen.svg)

![logo](https://assets.idseq.net/Logo_Black.png)

#### Infectious Disease Sequencing Platform
IDseq is an unbiased global software platform that helps scientists identify pathogens in metagenomic sequencing data.

- **Discover** - Identify the pathogen landscape
- **Detect** - Monitor and review potential outbreaks
- **Decipher** - Find potential infecting organisms in large datasets

A collaborative open project of [Chan Zuckerberg Initiative](https://www.chanzuckerberg.com/) and [Chan Zuckerberg Biohub](https://czbiohub.org).

Check out our repositories:
- [idseq-web](https://github.com/chanzuckerberg/idseq-web) - Frontend portal
- [idseq-workflows](https://github.com/chanzuckerberg/idseq-workflows) - Bioinformatics workflows
- [idseq-cli](https://github.com/chanzuckerberg/idseq-cli) - Command line upload interface
- [idseq-bench](https://github.com/chanzuckerberg/idseq-bench) - Pipeline benchmarking tools (here)

# idseq-bench
Benchmark generator for the [IDseq Portal](https://idseq.net).

So far just a thin wrapper around [InSilicoSeq](https://insilicoseq.readthedocs.io/en/latest/).

## setup
```
pip3 install git+https://github.com/chanzuckerberg/idseq-bench.git --upgrade
```

## running
```
idseq-bench-generate config_file.yaml
```

This produces zipped fastq files and config files to generate them. You can upload the fastq files to the [IDseq Portal](https://idseq.net) via [IDSEQ-CLI](https://github.com/chanzuckerberg/idseq-cli).

## help
```
idseq-bench-generate -h
```


## selecting organisms and chromosomes
Create a yaml file in the following format:
```
# A readable name for the benchmark
description: List of relevant genomes to use on standard benchmarks
# Number of reads per organism
reads_per_organism: 10000
# The sequencer model to emulate (determines the error model used by iss)
# Possible values: novaseq, miseq, hiseq
# It will generate one benchmark per specified model
models:
  - hiseq
abundance: uniform
genomes:
  - category: fungi
    organism: aspergillus_fumigatus
    lineage:
      - level: subspecies
        tax_id: 330879
      - level: species
        tax_id: 746128
      - level: genus
        tax_id: 5052
      - level: family
        tax_id: 1131492
    versioned_accession_ids:
      - NC_007194.1
      - NC_007195.1
      - NC_007196.1
      - NC_007197.1
      - NC_007198.1
      - NC_007199.1
      - NC_007200.1
      - NC_007201.1
    genome_assembly_url: https://www.ncbi.nlm.nih.gov/genome/18?genome_assembly_id=22576
```

See more examples in the examples folder.

## tweaking InSilicoSeq options
You can select different sets of [error models](https://insilicoseq.readthedocs.io/en/latest/iss/model.html).

The generated filenames will include the package version used to create it.

## interpreting the output
Each output file name reflects the params of its generation, like so:
```
norg_6__nacc_27__uniform_weight_per_organism__hiseq_reads__v0.1.0__[R1, R2].fastq.gz
  -- number of organisms: 6
  -- number of accessions: 27
  -- distribution: uniform per organism
  -- error model: hiseq
  -- logical version: 4
```

We generate a summary file for each pair of fastqs, indicating read counts per organism,
and the average coverage of the organism's genome.  Each pair counts as 2 reads / 300 bases,
matching InSilicoSeq and IDseq conventions.
```
READS  COVERAGE    LINEAGE                                          GENOME
----------------------------------------------------------------------------------------------------------------------
16656    215.3x    benchmark_lineage_0_37124_11019_11018            viruses__chikungunya__37124
16594      0.1x    benchmark_lineage_330879_746128_5052_1131492     fungi__aspergillus_fumigatus__330879
16564    352.1x    benchmark_lineage_0_463676_12059_12058           viruses__rhinovirus_c__463676
16074      0.5x    benchmark_lineage_1125630_573_570_543            bacteria__klebsiella_pneumoniae_HS11286__1125630
15078      0.8x    benchmark_lineage_93061_1280_1279_90964          bacteria__staphylococcus_aureus__93061
14894      0.1x    benchmark_lineage_36329_5833_5820_1639119        protista__plasmodium_falciuparum__36329
```
We alter the read id's generated by ISS to satisfy the input requirements of tools like [STAR](https://github.com/alexdobin/STAR).
This requires stripping those `_1` and `_2` pair indicators from all read id's, so that both reads in a pair have the exact same
read ID.  Then, each read ID gets a serial number, and a tag identifying the taxonomic lineage of the organism
that read was sourced from, like so:
```
@NC_016845.1_503__benchmark_lineage_1125630_573_570_543__s0000001169
```
This is helpful in tracking reads through complex bioinformatic pipelines and
scoring results.  We assume the pipelines would not cheat by inspecting those tags.

An even more detailed summary, including all ISS options, is generated in json format.

## For IDseq developers: automated testing of IDseq Portal

Just upload an output folder to `s3://idseq-bench/<next-number>` and add
an entry for it to `s3://idseq-bench/config.json` to specify frequency and environments in which that test should run.

## scoring an IDseq Portal Run

After a benchmark sample has completed running through the IDseq Portal, the QC pass rate and recall per benchmark organism can be scored by running, e.g.,
```
idseq-bench-score <project_id> <sample_id> <pipeline_version:major.minor>
```
which produces JSON formatted output like so
```
{
  "per_rank": {
    "family": {
      "NT": {
        "543": {
          "total_reads": 10000,
          "post_qc_reads": 8476,
          "recall_per_read": {
            "count": 8461,
            "value": 0.9982302973100519
          }
        },
        ...
        "accuracy": {
          "count": 80137,
          "value": 0.8820803522289489
        },
        "total_simulated_taxa": 12,
        "total_correctly_identified_taxa": 11,
        "total_identified_taxa": 539,
        "recall": 0.9166666666666666,
        "precision": 0.02040816326530612,
        "f1-score": 0.03992740471869328
        "aupr": 0.9751017478206347,
        "l1_norm": 0.8389712437238702,
        "l2_norm": 0.07556827265305112
      },
      "NR": {
        "543": {
          "total_reads": 10000,
          "post_qc_reads": 8476,
          "recall_per_read": {
            "count": 7951,
            "value": 0.9380604058518169
          }
        },
        ...
      },
      "concordance": {
        "11018": {
          "count": 16048,
          "value": 1.9154929577464788
        },
        ...
    },
    "genus": {
      "NT": {
        "570": {
          ...
```

### Local files

For users who lack direct access to S3, scoring also works on a local download of sample results.  However, you must organize any locally downloaded files in versioned subfolders to match the S3 structure illustrated in the example above. Use the option `-p <local_path>` or `--local-path <local_path>` to use the local folder instead.

### Comparison to ground truth

Users can also compare any sample against a provided ground truth file. This file should be a TSV file with the following fields (without headers):
```
<taxon_id>	<absolute_abundance>	<relative_abundance>	<rank>	<taxon_name>
```

e.g

```
366648	100000.00000	0.01746	species	Xanthomonas fuscans
1685	100000.00000	0.01746	species	Bifidobacterium breve
486	100000.00000	0.01746	species	Neisseria lactamica
2751	100000.00000	0.01746	species	Carnobacterium maltaromaticum
28123	100000.00000	0.01746	species	Porphyromonas asaccharolytica
118562	100000.00000	0.01746	species	Arthrospira platensis
...
```

To compare against a ground truth run the scoring script with the following options:

```
idseq-bench-score <project_id> <sample_id> <pipeline_version:major.minor> -t <truth_file_1.tsv> <truth_file_2.tsv> ...
```

## help
```
idseq-bench-score -h
```