File: readme.md

package info (click to toggle)
qcumber 2.3.0-2
  • links: PTS, VCS
  • area: main
  • in suites: bullseye, sid
  • size: 2,276 kB
  • sloc: python: 3,097; sh: 153; makefile: 18
file content (310 lines) | stat: -rwxr-xr-x 12,636 bytes parent folder | download
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
# QCumber
Quality control, quality trimming, adapter removal and sequence content check of NGS data.

>Version: 2.1.1 <br>
>Contact: BI-Support@rki.de <br>
>Documentation updated: 24.07.2017



## Installation:

Install the latest stable version via Bioconda channel. It is assumed that the following channels are activated:

* bioconda
* r
* ostrokach
* conda-forge


```sh
conda install qcumber
```
and update with
```sh
conda update qcumber
```

Further prerequisite tools are pdflatex and texlive-latex-extra for PDF reports.

## Introduction
QCumber is a pipeline for quality control, trimming and sequence content check of NGS data. It includes parameter optimization of trimming and visualization of the output as an interactive HTML report.
Note that mapping and read classification are only preliminary results and also paired-end data are treated as single-end.

The workflow used in the pipeline is visualized in the following chart:

![Workflow](workflow.png "Workflow image")

QCumber needs miniconda3 to build the pipeline and pdflatex to write sample reports. The following tools are used:

| Tool name | Version | Pubmed ID |
|-----------|---------|-----------|
| [snakemake](https://bitbucket.org/snakemake/snakemake/wiki/Home) | 3.12.0 ||
| [FastQC](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/)    | 0.11.5  ||
| [Trimmomatic](http://www.usadellab.org/cms/?page=trimmomatic) | 0.36 | [24695404](https://www.ncbi.nlm.nih.gov/pubmed/24695404)|
| [Bowtie2](http://bowtie-bio.sourceforge.net/bowtie2/index.shtml)   | 2.2.9   | [22388286](https://www.ncbi.nlm.nih.gov/pubmed/22388286) |
| [Kraken](http://ccb.jhu.edu/software/kraken/) | 0.10.5 |[24580807](https://www.ncbi.nlm.nih.gov/pubmed/24580807)

## Tutorial

Calling the pipeline with the option `--help` provides a help message listing all options:

```sh
QCumber-2 --help
```

A small example dataset is provided in the data-folder in  /pipelines/datasets/benchmarking/qc_map_var/ (not published yet). The following example uses this dataset to demonstrate a basic run of the pipeline.

A basic pipeline run is as follows:
```sh
QCumber-2 --read1 /pipelines/datasets/benchmarking/qc_map_var/data/Sample1_S1_L001_R1_001.fastq.gz --read2 /pipelines/datasets/benchmarking/qc_map_var/data/Sample1_S1_L001_R2_001.fastq.gz \
--reference /pipelines/datasets/benchmarking/qc_map_var/references/reference.fasta --output qcumber_output
```

Input data can be entered as `-1` or `-2` for single files or `--input` for a project folder. QCumber can automatically detect read pairs if  Illumina sample pattern matches `<samplename>_<lane>_<R1|R2>_<number>`.
For the preliminary mapping step a reference in fasta-format must be given. Otherwise QCumber skips mapping process. If `--output` is not defined, the results folder **QCResults** is written to the working directory.

The following usage is for batch analysis and parameter optimization adjusted to mapping as downstream analysis (`--trimBetter`, see chapter 'Functions' for more details):

```
QCumber-2 --input  /pipelines/datasets/benchmarking/qc_map_var/ --adapter NexteraPE-PE --trimBetter mapping --output qcumber_batch_output
```

If you only need a subset of files in your folder, you can also use regular expression in `--input`. This example returns all files starting with Sample1 :
```
QCumber-2 --input /pipelines/datasets/benchmarking/qc_map_var/data/Sample1*
```

QCumber-2 outputs **<output\_folder>/config.yaml** for each run, which can be used to rerun the analysis or to define default parameters.

```
QCumber-2 --config config.yaml
```

If you add additional parameters, it overrides the values in the config file. Here is an example how to use **config.yaml** as default parameter setting. The structure of **config.yaml** is very easy. All input parameters can be listed in the format `<parameter_name> : <parameter>`. For instance:

 trimBetter: mapping <br>
 threads: 10 <br>
 save_mapping: true  <br>


```
QCumber-2 --config config.yaml --trimBetter assembly --output results_folder
```
In this case trimBetter parameters will be optimized for assembly, i.e trimming is more aggressive than for mapping (see section 'Functions' for further details).

## Functions


#### Get information from Illumina Sequence Analysis Viewer
> short: `-w <folder>` <br>
> long: `--sav <folder>`

This option requires that the provided folder contain:

* CompletedJobInfo.xml
* GenerateFASTQRunStatistics.xml
* RunCompletionStatus.xml
* RunInfo.xml
* RunParameters.xml
* InterOp/ControlMetricsOut.bin
* InterOp/CorrectedIntMetricsOut.bin
* InterOp/ErrorMetricsOut.bin
* InterOp/ExtractionMetricsOut.bin
* InterOp/IndexMetricsOut.bin
* InterOp/QMetricsOut.bin
* InterOp/TileMetricsOut.bin

It takes the information from these files and converts it into a human readable table. Furthermore, plots were generated equivalent to SAV section "Data by Cycle" for FWHM, intensity and %base as well as for section "Data by Lane" for prephasing, phasing and cluster density. Both tables and plots can be found in **QCResults/batch_report.html** under the section "Sequencer Information". Additionally, a report **QCResults/SAV.pdf** for SAV will be generated.

#### Input
> Long option: `--input <folder>` <br>
> Short option: `-i <folder>`

Input sample folder. Illumina filenames should be gzipped fastq files end with _<lane>_<R1|R2>_number, e.g. Sample_12_345_R1_001.fastq.gz, to find the right paired set. If this does not match, all files are treated as single end data. This is always the case of IonTorrent data, i.e. the input file is in bam-format.

#### Read1
> Long option: `--read1 <filename>` <br>
> Short option: `-1 <filename>`

Filename for forward reads or one single end file. This is expected to be a .fastq.gz file.

#### Read2
> Long option: `--read2 R2` <br>
> Short option: `-2 R2`

Filename for reverse read file. This option does not check for file pattern. This is expected to be a .fastq.gz file.

#### Sequence technology
> Long option: `--technology <string>` <br>
> Short option: `-T <string>`
> Options: {Illumina, IonTorrent}

If not set, automatically determine technology and search for fastq and bam files. Set technology to IonTorrent if all files are bam-files, else set technology to Illumina.

#### Optimize trimming parameter
> Long option: `--trimBetter <string>`
> Options : {assembly, mapping, default}

Optimize trimming parameter using 'Per sequence base content' from fastqc. This option is not recommended for amplicons. This option will, after quality trimming, remove all positions at the beginning and end of the reads that show an uneven distribution of bases (as is characteristic for Nextera). The trimBetter_threshold in the values given below sets by how much the highest-abundant base in a position can be more abundant than the lowest-abundant base (i.e. if --trimBetter_threshold is set to 0.15, the abundancy of the highest-abundant base in a cycle may be at most 1.15 times that of the lowest-abundant base, otherwise the cycle will be trimmed).
The option *assembly* trims more aggressively than *mapping*, i.e. it allows even lower fluctuations in 'Per sequence base content'.

The parameters are written in config/parameter.txt and vary with trimBetter type and sequencing platform:

* default: `--trimOption 'SLIDINGWINDOW:4:20' --trimBetter_threshold 0.15`
* Illumina - Assembly: `--trimOption 'SLIDINGWINDOW:4:25' --trimBetter_threshold 0.1`
* Illumina - Mapping: `--trimOption 'SLIDINGWINDOW:4:15' --trimBetter_threshold 0.15`
* IonTorrent - Assembly: `--trimOption 'SLIDINGWINDOW:4:15' --trimBetter_threshold 0.2`
* IonTorrent - Mapping: `--trimOption 'SLIDINGWINDOW:4:15' --trimBetter_threshold 0.25`


#### Trimbetter threshold
> Long option: `--trimBetter_threshold <float>` <br>
> Short option: `-b <float>`

Set --trimBetter to use this option. This option overrides the threshold of how much the base content can max. fluctuate. *assembly*,*mapping* and *default* will be overwritten by this.

#### Mininmal read length
> Long option: `--minlen <int>` <br>
> Short option: `-m <int>` <br>
> Default: 50

Minlen parameter for Trimmomatic. Drops read short than minlen.

#### Only Trim Adapters
> Long option: `--only_trimm_adapters`
> Short option: `-A`

Only removes adapters and invaliates
additional trimmomatic/trimBetter parameters

#### Additional Trimmomatic parameters
> Long option: `--trimOption <string>` <br>
> Short option: `-O <string>`

#### Illuminaclip
> Long option: `--illuminaclip <int:int:int>` <br>
> Short option: `-L <int:int:int>` <br>
> Default: 2:30:10

Illuminaclip option: `<leading quality>:<trailing quality>:<sliding window>`.

#### Adapter removal
> Long option: `--adapter <string>` <br>
> Short option: `-a  <string>` <br>
> Options: {TruSeq2-PE, TruSeq2-SE, TruSeq3-PE, TruSeq3-SE, TruSeq3-PE-2, NexteraPE-PE}
> Default: all

Adapter sequence for Trimmomatic. Suggested adapter sequences are provided for TruSeq2 (as used in GAII machines) and TruSeq3 (as used by HiSeq and MiSeq machines), for both single-end and paired-end mode (check Trimmomatic manual). If not set, all adapters are used for trimming.


#### Reference
> Long option: `--reference <fasta-file>` <br>
> Short option: `-r  <fasta-file>`

Map reads against reference. Reference needs to be in fasta-format.

#### Bowtie2 index
> Long option: `--index <bt2-index>` <br>
> Short option: `-I <bt2-index>`

Bowtie2 index if available. Otherwise, set --reference for mapping.

#### Save mapping
> Long option: `--save_mapping ` <br>
> Short option: `-S `
> Default: False

Saves mapping file in sam-format. As default, only mapping statistics are saved.

#### Kraken DB
> Long option: `--kraken_db <db>` <br>
> Short option: `-d <db>`

Define destination to Kraken database. The folder has to contain database.kdb.

#### Kraken (un)classified read output
> Long option: `--kraken_classified_out` <br>

Kraken (un)classified-out option. If set, both the --classified-out and --unclassified-out option are set. Default: False.

#### Nokraken
> Long option: `--nokraken ` <br>
> Short option: `-K`

Skip Kraken classifiation.

#### Notrimming
> Long option: `--notrimming ` <br>
> Short option: `-Q `

Skip trimming step.

#### Config
> Long option: `--config <yaml-file>` <br>
> Short option: `-c <yaml-file>`
> Default: config/config.txt in the installation directory of QCumber-2 if it exists

Configfile to (re-)run pipeline. Additional parameters in the commandline will override arguments in configfile.


#### Threads
> Long option: `--threads <int>` <br>
> Short option: `-t <int>` <br>
> Default: 4

Number of threads.

#### Output
> Long option: `--output <folder>` <br>
> Short option: `-o <folder>`


#### Rename
> Long option: `--rename RENAME` <br>
> Short option: `-R RENAME`

Tab-separated file with two columns: `<old sample name> <new sample name>`. QCumber replaces the old filename with the new one. If it does not find unique replacements, it will skip renaming for this sample.

#### Additional snakemake commands
All parameters (excluding --cores) from snakemake can be given to QCumber. For example `--notemp` saves all temp files of the analysis or `--forceall` will force the pipeline the rerun all analysis steps, although the output already exists.


## Output

By default, the pipeline generates the following files in the output folder:

* **QCResults**
    * < PDF report per sample >
    * **batch_report.html** *(HTML report for entire project; it integrates kraken.html, so if you move this file, make sure to move kraken.html in the some folder)*
    * **kraken.html**
    * **FastQC**
        * **Raw**
            * < output folder(s) from FastQC >
        * **Trimmed**
            * < output folder(s) from FastQC >
    * **Trimmed**
        * < trimmed reads (.fastq.gz) >
    * **Mapping**
        * < sam files >
    * **Classification**
        * < Kraken plots >
        * < textfile of classified reads (.translated) >
        * **kraken_batch_result.csv** (table of classified species [%] )
* config.yaml


# License

This program is free software: you can redistribute it and/or modify
it under the terms of the GNU Lesser General Public License, version 3
as published by the Free Software Foundation.

This program is distributed in the hope that it will be useful, but
WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
GNU Lesser General Public License for more details.

You should have received a copy of the GNU Lesser General Public
License along with this program.  If not, see
http://www.gnu.org/licenses/.