File: cookbook.md

package info (click to toggle)
minimap2 2.17%2Bdfsg-12
  • links: PTS, VCS
  • area: main
  • in suites: bullseye, sid
  • size: 1,204 kB
  • sloc: ansic: 8,653; javascript: 2,301; makefile: 130; python: 91; sh: 42; perl: 29
file content (243 lines) | stat: -rw-r--r-- 10,412 bytes parent folder | download | duplicates (2)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
## Table of Contents

- [Introduction & Installation](#intro)
- [Mapping Genomic Reads](#map-reads)
  * [Mapping long reads](#map-pb)
  * [Mapping Illumina paired-end reads](#map-sr)
  * [Evaluating mapping accuracy with simulated reads (for developers)](#mapeval)
- [Mapping Long RNA-seq Reads](#map-rna)
  * [Mapping Nanopore 2D cDNA reads](#map-ont-cdna-2d)
  * [Mapping Nanopore direct-RNA reads](#map-direct-rna)
  * [Mapping PacBio Iso-seq reads](#map-iso-seq)
- [Full-Genome Alignment](#genome-aln)
  * [Intra-species assembly alignment](#asm-to-ref)
  * [Cross-species full-genome alignment](#x-species)
  * [Eyeballing alignment](#view-aln)
  * [Calling variants from assembly-to-reference alignment](#asm-var)
  * [Constructing self-homology map](#hom-map)
  * [Lift Over (for developers)](#liftover)
- [Read Overlap](#read-overlap)
  * [Long-read overlap](#long-read-overlap)
  * [Evaluating overlap sensitivity (for developers)](#ov-eval)

## <a name="intro"></a>Introduction & Installation

This cookbook walks you through a variety of applications of minimap2 and its
companion script `paftools.js`. All data here are freely available from the
minimap2 release page at version tag [v2.10][v2.10]. Some examples only work
with v2.10 or later.

To acquire the data used in this cookbook and to install minimap2 and paftools,
please follow the command lines below:
```sh
# install minimap2 executables
curl -L https://github.com/lh3/minimap2/releases/download/v2.17/minimap2-2.17_x64-linux.tar.bz2 | tar jxf -
cp minimap2-2.17_x64-linux/{minimap2,k8,paftools.js} .  # copy executables
export PATH="$PATH:"`pwd`                               # put the current directory on PATH
# download example datasets
curl -L https://github.com/lh3/minimap2/releases/download/v2.10/cookbook-data.tgz | tar zxf -
```

## <a name="map-reads"></a>Mapping Genomic Reads

### <a name="map-pb"></a>Mapping long reads
```sh
minimap2 -ax map-pb -t4 ecoli_ref.fa ecoli_p6_25x_canu.fa > mapped.sam
```
Alternatively, you can create a minimap2 index first and then map:
```sh
minimap2 -x map-pb -d ecoli-pb.mmi ecoli_ref.fa                      # create an index
minimap2 -ax map-pb ecoli-pb.mmi ecoli_p6_25x_canu.fa > mapped.sam
```
This will save you a couple of minutes when you map against the human genome.
**HOWEVER**, key algorithm parameters such as the k-mer length and window
size can't be changed after indexing. Minimap2 will give you a warning if
parameters used in a pre-built index doesn't match parameters on the command
line. **Please always make sure you are using an intended pre-built index.**

### <a name="map-sr"></a>Mapping Illumina paired-end reads:
```sh
minimap2 -ax sr -t4 ecoli_ref.fa ecoli_mason_1.fq ecoli_mason_2.fq > mapped-sr.sam
```

### <a name="mapeval"></a>Evaluating mapping accuracy with simulated reads (for developers)
```sh
minimap2 -ax sr ecoli_ref.fa ecoli_mason_1.fq ecoli_mason_2.fq | paftools.js mapeval -
```
The output is:
```
Q       60      19712   0       0.000000000     19712
Q       0       282     219     0.010953286     19994
U       6
```
where a `U`-line gives the number of unmapped reads (for SAM input only); a
`Q`-line gives:

1. Mapping quality (mapQ) threshold
2. Number of mapped reads between this threshold and the previous mapQ threshold.
3. Number of wrong mappings in the same mapQ interval
4. Accumulative mapping error rate
5. Accumulative number of mappings

For `paftools.js mapeval` to work, you need to encode the true read positions
in read names in the right format. For [PBSIM][pbsim] and [mason2][mason2], we
provide scripts to generate the right format. Simulated reads in this cookbook
were created with the following command lines:
```sh
# in PBSIM source code directory:
src/pbsim ../ecoli_ref.fa --depth 1 --sample-fastq sample/sample.fastq
paftools.js pbsim2fq ../ecoli_ref.fa.fai sd_0001.maf > ../ecoli_pbsim.fa

# mason2 simulation
mason_simulator --illumina-prob-mismatch-scale 2.5 -ir ecoli_ref.fa -n 10000 -o tmp-l.fq -or tmp-r.fq -oa tmp.sam
paftools.js mason2fq tmp.sam | seqtk seq -1 > ecoli_mason_1.fq
paftools.js mason2fq tmp.sam | seqtk seq -2 > ecoli_mason_2.fq
```



## <a name="map-rna"></a>Mapping Long RNA-seq Reads

### <a name="map-ont-cdna-2d"></a>Mapping Nanopore 2D cDNA reads
```sh
minimap2 -ax splice SIRV_E2.fa SIRV_ont-cdna.fa > aln.sam
```
You can compare the alignment to the true annotations with:
```sh
paftools.js junceval SIRV_E2C.gtf aln.sam
```
It gives the percentage of introns found in the annotation. For SIRV data, it
is possible to achieve higher junction accuracy with
```sh
minimap2 -ax splice --splice-flank=no SIRV_E2.fa SIRV_ont-cdna.fa | paftools.js junceval SIRV_E2C.gtf
```
This is because minimap2 models one additional evolutionarily conserved base
around a canonical junction, but SIRV doesn't honor this signal. Option
`--splice-flank=no` asks minimap2 no to model this additional base.

In the output a tag `ts:A:+` indicates that the read strand is the same as the
transcript strand; `ts:A:-` indicates the read strand is opposite to the
transcript strand. This tag is inferred from the GT-AG signal and is thus only
available to spliced reads.

### <a name="map-direct-rna"></a>Mapping Nanopore direct-RNA reads
```sh
minimap2 -ax splice -k14 -uf SIRV_E2.fa SIRV_ont-drna.fa > aln.sam
```
Direct-RNA reads are noisier, so we use a shorter k-mer for improved
sensitivity. Here, option `-uf` forces minimap2 to map reads to the forward
transcript strand only because direct-RNA reads are stranded. Again, applying
`--splice-flank=no` helps junction accuracy for SIRV data.

### <a name="map-iso-seq"></a>Mapping PacBio Iso-seq reads
```sh
minimap2 -ax splice -uf -C5 SIRV_E2.fa SIRV_iso-seq.fq > aln.sam
```
Option `-C5` reduces the penalty on non-canonical splicing sites. It helps
to align such sites correctly for data with low error rate such as Iso-seq
reads and traditional cDNAs. On this example, minimap2 makes one junction
error. Applying `--splice-flank=no` fixes this alignment error.

Note that the command line above is optimized for the final Iso-seq reads.
PacBio's Iso-seq pipeline produces intermediate sequences at varying quality.
For example, some intermediate reads are not stranded. For these reads, option
`-uf` will lead to more errors. Please revise the minimap2 command line
accordingly.



## <a name="genome-aln"></a>Full-Genome Alignment

### <a name="asm-to-ref"></a>Intra-species assembly alignment
```sh
# option "--cs" is recommended as paftools.js may need it
minimap2 -cx asm5 --cs ecoli_ref.fa ecoli_canu.fa > ecoli_canu.paf
```
Here `ecoli_canu.fa` is the Canu assembly of `ecoli_p6_25x_canu.fa`. This
command line outputs alignments in the [PAF format][paf]. Use `-a` instead of
`-c` to get output in the SAM format.

### <a name="x-species"></a>Cross-species full-genome alignment
```sh
minimap2 -cx asm20 --cs ecoli_ref.fa ecoli_O104:H4.fa > ecoli_O104:H4.paf
sort -k6,6 -k8,8n ecoli_O104:H4.paf | paftools.js call -f ecoli_ref.fa -L10000 -l1000 - > out.vcf
```
Minimap2 has three presets for full-genome alignment: "asm5" for sequence
divergence below 1%, "asm10" for divergence around a couple of percent and
"asm20" for divergence not more than 10%. In theory, with the right setting,
minimap2 should work for sequence pairs with sequence divergence up to ~15%,
but this has not been carefully evaluated.

### <a name="view-aln"></a>Eyeballing alignment
```sh
# option "--cs" required; minimap2-r741 or higher required for the "asm20" preset
minimap2 -cx asm20 --cs ecoli_ref.fa ecoli_O104:H4.fa | paftools.js view - | less -S
```
This prints the alignment in a BLAST-like format.

### <a name="asm-var"></a>Calling variants from assembly-to-reference alignment
```sh
# don't forget the "--cs" option; otherwise it doesn't work
minimap2 -cx asm5 --cs ecoli_ref.fa ecoli_canu.fa \
  | sort -k6,6 -k8,8n \
  | paftools.js call -f ecoli_ref.fa - > out.vcf
```
Without option `-f`, `paftools.js call` outputs in a custom format. In this
format, lines starting with `R` give the regions covered by one contig only.
This information is not available in the VCF output.

### <a name="hom-map"></a>Constructing self-homology map
```sh
minimap2 -DP -k19 -w19 -m200 ecoli_ref.fa ecoli_ref.fa > out.paf
```
Option `-D` asks minimap2 to ignore anchors from perfect self match and `-P`
outputs all chains. For large nomes, we don't recommend to perform base-level
alignment (with `-c`, `-a` or `--cs`) when `-P` is applied. This is because
base-alignment is slow and occasionally gives wrong alignments close to the
diagonal of a dotter plot. For E. coli, though, base-alignment is still fast.

### <a name="liftover"></a>Lift over (for developers)
```sh
minimap2 -cx asm5 --cs ecoli_ref.fa ecoli_canu.fa > ecoli_canu.paf
echo -e 'tig00000001\t200000\t300000' | paftools.js liftover ecoli_canu.paf -
```
This lifts over a region on query sequences to one or multiple regions on
reference sequences. Note that this paftools.js command may not be efficient
enough to lift millions of regions.



## <a name="read-overlap"></a>Read Overlap

### <a name="long-read-overlap"></a>Long read overlap
```sh
# For pacbio reads:
minimap2 -x ava-pb ecoli_p6_25x_canu.fa ecoli_p6_25x_canu.fa > overlap.paf
# For Nanopore reads (ava-ont also works with PacBio but not as good):
minimap2 -x ava-ont -r 10000 ecoli_p6_25x_canu.fa ecoli_p6_25x_canu.fa > overlap.paf
# If you have miniasm installed:
miniasm -f ecoli_p6_25x_canu.fa overlap.paf > asm.gfa
```
Here we explicitly applied `-r 10000`. We are considering to set this as the
default for the `ava-ont` mode as this seems to improve the contiguity for
nanopore read assembly (Loman, personal communication).

*Minimap2 doesn't work well with short-read overlap.*

### <a name="ov-eval"></a>Evaluating overlap sensitivity (for developers)

```sh
# read to reference mapping
minimap2 -cx map-pb ecoli_ref.fa ecoli_p6_25x_canu.fa > to-ref.paf
# evaluate overlap sensitivity
sort -k6,6 -k8,8n to-ref.paf | paftools.js ov-eval - overlap.paf
```
You can see that for PacBio reads, minimap2 achieves higher overlap sensitivity
with `-x ava-pb` (99% vs 93% with `-x ava-ont`).



[pbsim]: https://github.com/pfaucon/PBSIM-PacBio-Simulator
[mason2]: https://github.com/seqan/seqan/tree/master/apps/mason2
[paf]: https://github.com/lh3/miniasm/blob/master/PAF.md
[v2.10]: https://github.com/lh3/minimap2/releases/tag/v2.10