1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203
|
# Using MultiQC in pipelines
MultiQC has been designed to be placed at the end of bioinformatics workflows
and works well when it's a final step in a pipeline. Here you can find a few
tips about integration with different workflow tools.
I'll use FastQC as an example input in all of these examples as it's a common
use case. However, the concepts should apply for any of the tools that MultiQC supports.
> Remember that you can use [Custom Content](https://multiqc.info/docs/#custom-content)
> feature to easily collect pipeline-specific metadata (software version numbers,
> pipeline run-time data, links to documentation) in to a format that can be inserted
> in to your report.
If you know exactly which modules will be used by MultiQC, you can use the
`-m`/`--modules` flag to specify just these. This will speed up MultiQC a little.
This will probably only make a noticeable impact if your pipeline has thousands
of input files for MultiQC.
In the following, we show examples for [Nextflow](#nextflow) and [Snakemake](#snakemake) integration.
## Nextflow
An example mini-pipeline for [nextflow](https://www.nextflow.io/) which runs FastQC
and MultiQC is below.
See the [nf-core](https://nf-co.re/) pipelines for lots of examples of full nextflow
pipelines that use MultiQC.
```groovy
#!/usr/bin/env nextflow
params.reads = "data/*{1,2}.fastq.gz"
Channel.fromFilePairs( params.reads ).into { read_files_fastqc }
process fastqc {
input:
set val(name), file(reads) from read_files_fastqc
output:
file "*_fastqc.{zip,html}" into fastqc_results
script:
"""
fastqc -q $reads
"""
}
process multiqc {
input:
file ('fastqc/*') from fastqc_results.collect().ifEmpty([])
output:
file "multiqc_report.html" into multiqc_report
file "multiqc_data"
script:
"""
multiqc .
"""
}
```
You will need to specify channels for all files that MultiQC needs to run, so that nextflow
stages them correctly in the MultiQC work directory.
Note that `.collect()` is needed to make MultiQC run once for all upstream outputs.
The `.ifEmpty([])` add on isn't really needed here, but is helpful in larger pipelines where
some processes may be optional. Without this, if any channels are not processed then MultiQC
won't run.
### Clashing input filenames
If a nextflow process tries to stage more than one input file with an identical filename,
it will throw an error. Putting inputs into their own subfolder (`file ('fastqc/*')`) in
the above examples is not really needed, but reduces the chance of input filename clashes.
If you're using a tool that gives the same filename to each file that MultiQC uses, you'll
need to tell nextflow to rename the inputs to prevent clashes.
For example, [StringTie](https://ccb.jhu.edu/software/stringtie/) prints statistics to
STDOUT that MultiQC uses to generate reports. We can easily collect this in nextflow by
using the `.command.log` file that nextflow saves as an output file from the process
(`file ".command.log" into stringtie_log`). However, now every sample has the same filename
for MultiQC.
We get around this by using [dynamic input file names](https://www.nextflow.io/docs/latest/process.html#dynamic-input-file-names)
with nextflow:
```groovy
file ('stringtie/stringtie_log*') from stringtie_log.collect().ifEmpty([])
```
This `file` pattern renames each stringtie log file to `stringtie_log1`,
`stringtie_log2` and so on, ensuring that we avoid any filename clashes.
Note that MultiQC finds output from some tools based on their filename, so use with caution
(you may need to define some custom [module search patterns](https://multiqc.info/docs/#module-search-patterns)).
### Custom run name
When you launch nextflow, you can use the `-name` command line flag (single hyphen) to give
a name to that specific pipeline run. You can configure your pipeline to pass this on to
MultiQC. It can then be used as the report title and filename.
Here's an example snippet for this:
```groovy
custom_runName = params.name
if( !(workflow.runName ==~ /[a-z]+_[a-z]+/) ){
custom_runName = workflow.runName
}
process multiqc {
input:
file ('fastqc/*') from ch_fastqc_results_for_multiqc.collect().ifEmpty([])
output:
file "*_report.html" into ch_multiqc_report
file "*_data"
script:
rtitle = custom_runName ? "--title \"$custom_runName\"" : ''
rfilename = custom_runName ? "--filename " + custom_runName.replaceAll('\\W','_').replaceAll('_+','_') + "_multiqc_report" : ''
"""
multiqc $rtitle $rfilename .
"""
}
```
Note that we use `params.name` as a placeholder, this gives the benefit that both `-name`
and the common typo `--name` work for this case.
There isn't an easy way to know if a custom value for `-name` has been given to nextflow,
but all default names are two lowercase words with a single underscore so if the name
matches this pattern then we ignore it.
### MultiQC config file
It can be nice to use a config file for MultiQC to add in some static content to reports
about your pipeline (eg. `report_comment`).
Remember that even this config file should also be in a nextflow channel,
so that nextflow correctly stages it (especially important when running on the cloud).
This snippet works with a `params` variable again, so that pipeline users can replace
the config file with one of their own if they wish.
```groovy
params.multiqc_config = "$baseDir/assets/multiqc_config.yaml"
Channel.fromPath(params.multiqc_config, checkIfExists: true).set { ch_config_for_multiqc }
process multiqc {
input:
file multiqc_config from ch_config_for_multiqc
file ('fastqc/*') from fastqc_results.collect().ifEmpty([])
output:
file "multiqc_report.html" into multiqc_report
file "multiqc_data"
script:
"""
multiqc --config $multiqc_config .
"""
}
```
## Snakemake
The best-practice for using MultiQC in [Snakemake](https://snakemake.readthedocs.io/) pipelines is to use a predefined [Snakemake wrapper](https://snakemake-wrappers.readthedocs.io/en/stable/wrappers/multiqc.html). For example, see the following mini-pipeline:
```yaml
configfile: "config.yaml"
rule all:
input: "multiqc_report.html"
rule fastqc:
input: "data/{sample}.fastq.gz"
output: html="fastqc/{sample}.html",
zip="fastqc/{sample}.zip"
wrapper: "0.31.1/bio/fastqc"
rule multiqc:
input: expand("fastqc/{sample}.html", sample=config["samples"])
output: "multiqc_report.html"
wrapper: "0.31.1/bio/multiqc"
```
Snakemake wrappers not only deliver predefined and unit tested code for generating the requested output with the respective tool, but also define the required software stack in terms of conda packages.
You can run the above workflow as follows:
```bash
snakemake --use-conda
```
This first installs all the required tools into isolated conda environments, and then executes all necessary steps to create the target that is given in the top rule. In other words, it will execute FastQC twice (creating `fastqc/1.html` and `fastqc/2.html`) and MultiQC once, creating `multiqc_report.html`.
Of course, using conda is optional, but it greatly increases reproducibility.
Snakemake is not limited to wrappers (although its [wrapper repository](https://snakemake-wrappers.readthedocs.io) provides many in the field of bioinformatics), but also supports direct execution of [shell commands](https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#rules) and integration of [custom scripts](https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#external-scripts) (e.g., for plotting).
If you prefer to use MultiQC without a snakemake wrapper, you can see a minimal example on GitHub: [jakevc/snakemake_multiqc](https://github.com/jakevc/snakemake_multiqc). This has an example script and some test data for you to play with.
|