1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69
|
---
Name: Preseq
URL: http://smithlabresearch.org/software/preseq/
Description: >
Preseq estimates the complexity of a library, showing how many additional
unique reads are sequenced for increasing total read count.
---
The Preseq module parses results generated by
[Preseq](http://smithlabresearch.org/software/preseq/),
a tool that estimates the complexity of a library, showing how many additional
unique reads are sequenced for increasing total read count.
When `preseq lc_extrap` is run with the default parameters, the extrapolation points
reach 10 billion molecules making the plot difficult to interpret in most scenarios.
It also includes a lot of data in the reports, which can unnecessarily inflate report
file sizes. To avoid this, MultiQC trims back the x axis until each dataset
shows 80% of its maximum y-value (unique molecules).
To disable this feature and show all of the data, add the following to your
[MultiQC configuration](http://multiqc.info/docs/#configuring-multiqc):
```yaml
preseq:
notrim: true
```
#### Using coverage instead of read counts
Preseq reports its numbers as "Molecule counts". This isn't always very intuitive,
and it's often easier to talk about sequencing depth in terms of coverage.
You can plot the estimated coverage instead by specifying the reference genome or target size,
and the read length in your [MultiQC configuration](http://multiqc.info/docs/#configuring-multiqc):
```yaml
preseq:
genome_size: 3049315783
read_length: 300
```
These parameters make the script take every molecule count and divide it by
(genome_size / read_length).
MultiQC comes with effective genome size presets for Human and Mouse, so you can
provide the genome build name instead, like this: `genome_size: hg38_genome`. The
following values are supported: `hg19_genome`, `hg38_genome`, `mm10_genome`.
When the genome and read sizes are provided, MultiQC will plot the molecule counts
on the X axis ("total" data) and coverages on the Y axis ("unique" data).
However, you can customize what to plot on each axis (counts or coverage), e.g.:
```yaml
preseq:
x_axis: counts
y_axis: coverage
```
#### Plotting externally calculated read counts
To mark on the plot the read counts calculated externally from BAM or fastq files,
create a file with `preseq_real_counts` in the filename and place it with your analysis files.
It should be space or tab delimited with 2 or 3 columns (column 1 = preseq file name,
column 2 = real read count, optional column 3 = real unique read count). For example:
```
Sample_1.preseq.txt 3638261 3638011
Sample_2.preseq.txt 1592394 1592133
[...]
```
You can generate a line for such a file using samtools:
```bash
echo "Sample_1.preseq.txt "$(samtools view -c -F 4 Sample_1.bam)" "$(samtools view -c -F 1028 Sample_1.bam)
```
|