1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149
|
---
name: FastQC
url: http://www.bioinformatics.babraham.ac.uk/projects/fastqc/
description: >
FastQC is a quality control tool for high throughput sequence data,
written by Simon Andrews at the Babraham Institute in Cambridge.
---
The FastQC module parses results generated by
[FastQC](http://www.bioinformatics.babraham.ac.uk/projects/fastqc/),
a quality control tool for high throughput sequence data written
by Simon Andrews at the Babraham Institute.
FastQC generates a HTML report which is what most people use when
they run the program. However, it also helpfully generates a file
called `fastqc_data.txt` which is relatively easy to parse.
A typical run will produce the following files:
```
mysample_fastqc.html
mysample_fastqc/
Icons/
Images/
fastqc.fo
fastqc_data.txt
fastqc_report.html
summary.txt
```
Sometimes the directory is zipped, with just `mysample_fastqc.zip`.
The FastQC MultiQC module looks for files called `fastqc_data.txt`
or ending in `_fastqc.zip`. If the zip files are found, they are
read in memory and `fastqc_data.txt` parsed.
:::note
The directory and zip file are often both present. To speed
up MultiQC execution, zip files will be skipped if the file name suggests
that they will share a sample name with data that has already been parsed.
:::
You can customise the patterns used for finding these files in your
MultiQC config (see [Module search patterns](#module-search-patterns)).
The below code shows the default file patterns:
```yaml
sp:
fastqc/data:
fn: "fastqc_data.txt"
fastqc/zip:
fn: "*_fastqc.zip"
```
:::note
Sample names are discovered by parsing the line beginning
`Filename` in `fastqc_data.txt`, _not_ based on the FastQC report names.
:::
### Theoretical GC Content
It is possible to plot a dashed line showing the theoretical GC content for a
reference genome. MultiQC comes with genome and transcriptome guides for Human
and Mouse. You can use these in your reports by adding the following MultiQC
config keys (see [Configuring MultiQC](http://multiqc.info/docs/#configuring-multiqc)):
```yaml
fastqc_config:
fastqc_theoretical_gc: "hg38_genome"
```
Only one theoretical distribution can be plotted.
The following guides are available: _(txome = transcriptome)_
- `hg38_genome`
- `hg38_txome`
- `mm10_genome`
- `mm10_txome`
Alternatively, a custom theoretical guide can be used in reports. To do this,
create a file with `fastqc_theoretical_gc` in the filename and place it with your
analysis files. It should be tab delimited with the following format (column 1 = %GC,
column 2 = % of genome):
```bash
# FastQC theoretical GC content curve: YOUR REFERENCE NAME
0 0.005311768
1 0.004108502
2 0.004060371
3 0.005066476
[...]
```
You can generate these files using an R package called
[fastqcTheoreticalGC](https://github.com/mikelove/fastqcTheoreticalGC)
written by [Mike Love](https://github.com/mikelove).
Please see the [package readme](https://github.com/mikelove/fastqcTheoreticalGC)
for more details.
Result files from this package are searched for with the following search pattern
(can be customised as described above):
```yaml
sp:
fastqc/theoretical_gc:
fn: "*fastqc_theoretical_gc*"
```
If you want to always use a specific custom file for MultiQC reports without having to
add it to the analysis directory, add the full file path to the same MultiQC config
variable described above:
```yaml
fastqc_config:
fastqc_theoretical_gc: "/path/to/your/custom_fastqc_theoretical_gc.txt"
```
### Overrepresented sequences
The overrepresented sequences table shows the most common sequences found,
measured by the number of samples they occur as overrepresented. By default, the
table shows top 20 sequences. This can be customised in the config:
```yaml
fastqc_config:
top_overrepresented_sequences: 50
```
You can also choose to rank the top sequences by the total number of reads
rather than by number of samples:
```yaml
fastqc_config:
top_overrepresented_sequences_by: "total"
```
### Changing the order of sections
Remember that it is possible to customise the order in which the different module sections appear
in the report if you wish.
See [the docs](https://multiqc.info/docs/#order-of-module-and-module-subsection-output) for more information.
For example, to show the _Status Checks_ section at the top, use the following config:
```yaml
report_section_order:
fastqc_status_checks:
order: -1000
```
|