File: vcfwave.1

package info (click to toggle)
libvcflib 1.0.12%2Bdfsg-1
  • links: PTS, VCS
  • area: main
  • in suites: sid, trixie
  • size: 70,520 kB
  • sloc: cpp: 39,837; python: 532; perl: 474; ansic: 317; ruby: 295; sh: 254; lisp: 148; makefile: 123; javascript: 94
file content (214 lines) | stat: -rw-r--r-- 10,562 bytes parent folder | download
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
.\" Automatically generated by Pandoc 2.19.2
.\"
.\" Define V font for inline verbatim, using C font in formats
.\" that render this, and otherwise B font.
.ie "\f[CB]x\f[]"x" \{\
. ftr V B
. ftr VI BI
. ftr VB B
. ftr VBI BI
.\}
.el \{\
. ftr V CR
. ftr VI CI
. ftr VB CB
. ftr VBI CBI
.\}
.TH "VCFWAVE" "1" "" "vcfwave (vcflib)" "vcfwave (VCF transformation)"
.hy
.SH NAME
.PP
vcfwave - reduces complex alleles by pairwise alignment with BiWFA
.SH SYNOPSIS
.PP
\f[B]vcfwave\f[R]
.SH DESCRIPTION
.PP
\f[B]vcfwave\f[R] reduces complex alleles into simpler primitive
representation using pairwise alignment with BiWFA.
.PP
Often variant callers are not perfect.
\f[B]vcfwave\f[R] with its companion tool \f[B]vcfcreatemulti\f[R] can
take an existing VCF file that contains multiple complex overlapping and
even nested alleles and, like Humpty Dumpty, can take them apart and put
them together again in a more sane VCF output.
Thereby getting rid of false positives and often greatly simplifying the
output.
We created these tools for the output from long-read pangenome
genotypers - with 10K base pair realignments - and is used in the Human
Pangenome Reference Consortium analyses (HPRC).
.PP
\f[B]vcfwave\f[R] realigns reference and alternate alleles with the
recently introduced super fast bi-wavefront aligner (WFA).
\f[B]vcfwave\f[R] parses out the original \[ga]primitive\[cq] alleles
into multiple VCF records and \f[B]vcfcreatemulti\f[R] puts them
together again.
These tools can handle insertions, deletions, inversions and nested
sequences.
In both tools information is tracked on original positions and genotypes
are handled.
New records have IDs that reference the source record ID.
Deletion alleles will result in haploid (missing allele) genotypes
overlapping the deleted region.
.PP
A typical workflow will call \f[B]vcfwave\f[R] to realign all ALT
alleles against the reference and spit out simplified VCF records.
Next use a tool such as \f[V]bcftools norm -m-\f[R] to normalise the VCF
records and split out multiple ALT alleles into separate VCF records.
Finally use \f[B]vcfcreatemulti\f[R] to create multi-allele VCF records
again.
.PP
PERFORMANCE:
.PP
Unlike traditional aligners that run in quadratic time, the recently
introduced wavefront aligner WFA runs in time O(ns+s\[ha]2),
proportional to the sequence length n and the alignment score s, using
O(s\[ha]2) memory (or O(s) using the ultralow/BiWFA mode).
Therefore WFA does not choke on longer alignments.
.PP
Speed-wise vcfwave can still be faster.
See also the performance docs for some metrics and discussion.
.PP
READING:
.PP
See also the \f[I]humpty dumpty\f[R] companion tool vcfcreatemulti.
.SS Options
.TP
-h, \[en]help
shows help message and exits.
.PP
See more below.
.SH EXIT VALUES
.TP
\f[B]0\f[R]
Success
.TP
\f[B]not 0\f[R]
Failure
.SH EXAMPLES
.PP
Current command line options:
.IP
.nf
\f[C]

>>> head(\[dq]vcfwave -h\[dq],26)
>
usage: vcfwave [options] [file]
>
Realign reference and alternate alleles with WFA, parsing out the
\[aq]primitive\[aq] alleles into multiple VCF records. New records have IDs that
reference the source record ID.  Genotypes/samples are handled
correctly. Deletions generate haploid/missing genotypes at overlapping
sites.
>
options:
    -p, --wf-params PARAMS  use the given BiWFA params (default: 0,19,39,3,81,1)
                            format=match,mismatch,gap1-open,gap1-ext,gap2-open,gap2-ext
    -f, --tag-parsed FLAG   Annotate decomposed records with the source record position
                            (default: ORIGIN).
    -L, --max-length LEN    Do not manipulate records in which either the ALT or
                            REF is longer than LEN (default: unlimited).
    -K, --inv-kmer K        Length of k-mer to use for inversion detection sketching (default: 17).
    -I, --inv-min LEN       Minimum allele length to consider for inverted alignment (default: 64).
    -t, --threads N         Use this many threads for variant decomposition (default is 1).
                            For most datasets threading may actually slow vcfwave down.
    --quiet                 Do not display progress bar.
    -d, --debug             Debug mode.
>
Note the -k,--keep-info switch is no longer in use and ignored.
>
Type: transformation

\f[R]
.fi
.PP
vcfwave picks complex regions and simplifies nested alignments.
For example:
.IP
.nf
\f[C]

>>> sh(\[dq]grep 10158243 ../samples/10158243.vcf\[dq])
grch38#chr4     10158243        >3655>3662      ACCCCCACCCCCACC ACC,AC,ACCCCCACCCCCAC,ACCCCCACC,ACA     60      .       AC=64,3,2,3,1;AF=0.719101,0.0337079,0.0224719,0.0337079,0.011236;AN=89;AT=>3655>3656>3657>3658>3659>3660>3662,>3655>3656>3660>3662,>3655>3660>3662,>3655>3656>3657>3658>3660>3662,>3655>3656>3657>3660>3662,>3655>3656>3661>3662;NS=45;LV=0     GT      0|0     1|1     1|1     1|0     5|1     0|4     0|1     0|1     1|1     1|1     1|1     1|1     1|1     1|1     1|1     4|3     1|1     1|1     1|1     1|0     1|0     1|0     1|0     1|1     1|1     1|4     1|1     1|1     3|0     1|0     1|1     0|1     1|1     1|1     2|1     1|2     1|1     1|1     0|1     1|1     1|1     1|0     1|2     1|1     0
\f[R]
.fi
.PP
This aligns and adjusts the genotypes accordingly splitting into
multiple records, one for each unique allele found in the alignments:
.IP
.nf
\f[C]

>>> sh(\[dq]../build/vcfwave -L 1000 ../samples/10158243.vcf|grep -v \[ha]\[rs]#\[dq])
grch38#chr4     10158244        >3655>3662_1    CCCCCACCCCCAC   C       60      .       AC=1;AF=0.011236;AN=89;AT=>3655>3656>3657>3660>3662;NS=45;LV=0;ORIGIN=grch38#chr4:10158243;LEN=12;TYPE=del        GT      0|0     0|0     0|0     0|0     1|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0
grch38#chr4     10158244        >3655>3662_2    CCCCCACCCCCACC  C       60      .       AC=3;AF=0.033708;AN=89;AT=>3655>3656>3660>3662;NS=45;LV=0;ORIGIN=grch38#chr4:10158243;LEN=13;TYPE=del     GT      0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     1|0     0|1     0|0     0|0     0|0     0|0     0|0     0|0     0|1     0|0     0
grch38#chr4     10158245        >3655>3662_3    CCCCACCCCCACC   C       60      .       AC=64;AF=0.719101;AN=89;AT=>3655>3656>3657>3658>3659>3660>3662;NS=45;LV=0;ORIGIN=grch38#chr4:10158243;LEN=12;TYPE=del     GT      0|0     1|1     1|1     1|0     0|1     0|0     0|1     0|1     1|1     1|1     1|1     1|1     1|1     1|1     1|1     0|0     1|1     1|1     1|1     1|0     1|0     1|0     1|0     1|1     1|1     1|0     1|1     1|1     0|0     1|0     1|1     0|1     1|1     1|1     0|1     1|0     1|1     1|1     0|1     1|1     1|1     1|0     1|0     1|1     0
grch38#chr4     10158251        >3655>3662_4    CCCCACC C       60      .       AC=3;AF=0.033708;AN=89;AT=>3655>3656>3657>3658>3660>3662;NS=45;LV=0;ORIGIN=grch38#chr4:10158243;LEN=6;TYPE=del    GT      0|0     0|0     0|0     0|0     0|0     0|1     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     1|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|1     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0
grch38#chr4     10158256        >3655>3662_5    CC      C       60      .       AC=2;AF=0.022472;AN=89;AT=>3655>3660>3662;NS=45;LV=0;ORIGIN=grch38#chr4:10158243;LEN=1;TYPE=del   GT      0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|1     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     1|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0
grch38#chr4     10158257        >3655>3662_6    C       A       60      .       AC=1;AF=0.011236;AN=89;AT=>3655>3656>3657>3660>3662;NS=45;LV=0;ORIGIN=grch38#chr4:10158243;LEN=1;TYPE=snp GT      0|0     .|.     .|.     .|.     .|.     .|.     .|.     .|.     .|.     .|.     .|.     .|.     .|.     .|.     .|.     .|.     .|.     .|.     .|.     .|.     .|.     .|.     .|.     .|.     .|.     .|.     .|.     .|.     .|.     .|.     .|.     .|.     .|.     .|.     .|.     .|.     .|.     .|.     .|.     .|.     .|.     .|.     .|.     .|.     0

\f[R]
.fi
.SH ./vcfwave -L 10000 ../samples/grch38#chr8_36353854-36453166.vcf > ../test/data/regression/vcfwave_4.vcf
.RS
.RS
.RS
.PP
run_stdout(\[lq]vcfwave -L 10000
\&../samples/grch38#chr8_36353854-36453166.vcf\[rq], ext=\[lq]vcf\[rq])
output in vcfwave_4.vcf
.RE
.RE
.RE
.SH ./vcfwave -L 10000 ../samples/grch38#chr4_10083863-10181258.vcf > ../test/data/regression/vcfwave_5.vcf
.RS
.RS
.RS
.PP
run_stdout(\[lq]vcfwave -L 10000
\&../samples/grch38#chr4_10083863-10181258.vcf\[rq], ext=\[lq]vcf\[rq])
output in vcfwave_5.vcf
.RE
.RE
.RE
.IP
.nf
\f[C]

## Inversions

We can also handle inversions.
This test case includes one that was introduced by building a variation graph with an inversion and then decomposing it into a VCF with \[ga]vg deconstruct\[ga] and finally \[dq]popping\[dq] the inversion variant with [\[ga]vcfbub\[ga]](https://github.com/pangenome/vcfbub).

From
\f[R]
.fi
.PP
a 281 >1>9
AGCCGGGGCAGAAAGTTCTTCCTTGAATGTGGTCATCTGCATTTCAGCTCAGGAATCCTGCAAAAGACAG
CTGTCTTTTGCAGGATTCCTGTGCTGAAATGCAGATGACCGCATTCAAGGAAGAACTATCTGCCCCGGCT
60 .
AC=1;AF=1;AN=1;AT=>1>2>3>4>5>6>7>8>9,>1<8>10<6>11<4>12<2>9;NS=1;LV=0 GT
1
.IP
.nf
\f[C]

To

\[ga]\[ga]\[ga]python
>>> sh(\[dq]../build/vcfwave ../samples/inversion.vcf|grep INV\[dq])
##INFO=<ID=INV,Number=0,Type=Flag,Description=\[dq]Inversion detected\[dq]>
a       281     >1>9    AGCCGGGGCAGAAAGTTCTTCCTTGAATGTGGTCATCTGCATTTCAGCTCAGGAATCCTGCAAAAGACAG  CTGTCTTTTGCAGGATTCCTGTGCTGAAATGCAGATGACCGCATTCAAGGAAGAACTATCTGCCCCGGCT  60      .       AC=1;AF=1.000000;AN=1;AT=>1>2>3>4>5>6>7>8>9;NS=1;LV=0;LEN=70;INV=YES;TYPE=mnp  GT      1
\f[R]
.fi
.PP
Note the \[ga]INV=YES\[cq] info.
.SH LICENSE
.PP
Copyright 2022-2024 (C) Erik Garrison, Pjotr Prins and vcflib
contributors.
MIT licensed.
.SH AUTHORS
Erik Garrison, Pjotr Prins and other vcflib contributors.