File: preseq.md

package info (click to toggle)
multiqc 1.9%2Bdfsg-3
links: PTS, VCS
area: main
in suites: bullseye
size: 27,568 kB
sloc: python: 28,657; javascript: 3,233; sh: 74; makefile: 24
file content (69 lines) | stat: -rw-r--r-- 2,830 bytes
---
Name: Preseq
URL: http://smithlabresearch.org/software/preseq/
Description: >
    Preseq estimates the complexity of a library, showing how many additional
    unique reads are sequenced for increasing total read count.
---

The Preseq module parses results generated by
[Preseq](http://smithlabresearch.org/software/preseq/),
a tool that estimates the complexity of a library, showing how many additional
unique reads are sequenced for increasing total read count.

When `preseq lc_extrap` is run with the default parameters, the extrapolation points 
reach 10 billion molecules making the plot difficult to interpret in most scenarios. 
It also includes a lot of data in the reports, which can unnecessarily inflate report 
file sizes. To avoid this, MultiQC trims back the x axis until each dataset 
shows 80% of its maximum y-value (unique molecules).

To disable this feature and show all of the data, add the following to your
[MultiQC configuration](http://multiqc.info/docs/#configuring-multiqc):
```yaml
preseq:
    notrim: true
```

#### Using coverage instead of read counts

Preseq reports its numbers as "Molecule counts". This isn't always very intuitive, 
and it's often easier to talk about sequencing depth in terms of coverage.
You can plot the estimated coverage instead by specifying the reference genome or target size, 
and the read length in your [MultiQC configuration](http://multiqc.info/docs/#configuring-multiqc):
```yaml
preseq:
    genome_size: 3049315783
    read_length: 300
```
These parameters make the script take every molecule count and divide it by 
(genome_size / read_length).

MultiQC comes with effective genome size presets for Human and Mouse, so you can 
provide the genome build name instead, like this: `genome_size: hg38_genome`. The
following values are supported: `hg19_genome`, `hg38_genome`, `mm10_genome`.

When the genome and read sizes are provided, MultiQC will plot the molecule counts 
on the X axis ("total" data) and coverages on the Y axis ("unique" data). 
However, you can customize what to plot on each axis (counts or coverage), e.g.:
```yaml
preseq:
    x_axis: counts
    y_axis: coverage
```

#### Plotting externally calculated read counts

To mark on the plot the read counts calculated externally from BAM or fastq files, 
create a file with `preseq_real_counts` in the filename and place it with your analysis files. 
It should be space or tab delimited with 2 or 3 columns (column 1 = preseq file name, 
column 2 = real read count, optional column 3 = real unique read count). For example:
```
Sample_1.preseq.txt 3638261 3638011
Sample_2.preseq.txt 1592394 1592133
[...]
```

You can generate a line for such a file using samtools:
```bash
echo "Sample_1.preseq.txt "$(samtools view -c -F 4 Sample_1.bam)" "$(samtools view -c -F 1028 Sample_1.bam)
```