File: vcftools.1

package info (click to toggle)
vcftools 0.1.9-1
  • links: PTS, VCS
  • area: main
  • in suites: wheezy
  • size: 1,396 kB
  • sloc: perl: 10,233; cpp: 7,950; pascal: 751; makefile: 60; php: 43; sh: 12
file content (472 lines) | stat: -rw-r--r-- 19,775 bytes parent folder | download | duplicates (2)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
.TH VCFTOOLS "1" "July 2011" "vcftools 0.1.5" "User Commands"
.SH NAME
vcftools \- analyse VCF files
.SH SYNOPSIS
.B vcftools \fR[\fIOPTIONS\fR] 
.SH DESCRIPTION
The vcftools program is run from the command line. The interface is 
inspired by PLINK, and so should be largely familiar to users of that 
package. Commands take the following form:

  vcftools \-\-vcf file1.vcf \-\-chr 20 \-\-freq

The above command tells vcftools to read in the file file1.vcf, extract 
sites on chromosome 20, and calculate the allele frequency at each site. 
The resulting allele frequency estimates are stored in the output file, 
out.freq. As in the above example, output from vcftools is mainly sent to 
output files, as opposed to being shown on the screen.

Note that some commands may only be available in the latest version of 
vcftools. To obtain the latest version, you should use SVN to checkout the 
latest code, as described on the home page.

Also note that polyploid genotypes are not currently supported.

.SS Basic Options
.TP
\fB\-\-vcf\fR <filename>
This option defines the VCF file to be processed. The files need to be 
decompressed prior to use with vcftools. vcftools expects files in VCF 
format v4.0, a specification of which can be found here.
.TP
\fB\-\-gzvcf\fR <filename>
This option can be used in place of the \-\-vcf option to read compressed 
(gzipped) VCF files directly. Note that this option can be quite slow when 
used with large files.
.TP
\fB\-\-out\fR <prefix>
This option defines the output filename prefix for all files generated by 
vcftools. For example, if <prefix> is set to output_filename, then all 
output files will be of the form output_filename.*** . If this option is 
omitted, all output files will have the prefix 'out.'.

.SS Site Filter Options

.TP
\fB\-\-chr\fR <chromosom>
Only process sites with a chromosome identifier matching <chromosome>
.TP
\fB\-\-from\-bp\fR <integer>
.TP
\fB\-\-to\-bp\fR <integer>
These options define the physical range of sites will be processed. Sites 
outside of this range will be excluded. These options can only be used in 
conjunction with \-\-chr.
.TP
\fB\-\-snp\fR <string>
Include SNP(s) with matching ID. This command can be used multiple times 
in order to include more than one SNP.
.TP
\fB\-\-snps\fR <filename>
Include a list of SNPs given in a file. The file should contain a list of 
SNP IDs, with one ID per line.
.TP
\fB\-\-exclude\fR <filename>
Exclude a list of SNPs given in a file. The file should contain a list of 
SNP IDs, with one ID per line.
.TP
\fB\-\-positions\fR <filename>
Include a set of sites on the basis of a list of positions. Each line of 
the input file should contain a (tab-separated) chromosome and position. 
The file should have a header line. Sites not included in the list are 
excluded.
.TP
\fB\-\-bed\fR <filename>
.TP
\fB\-\-exclude\-bed\fR <filename>
Include or exclude a set of sites on the basis of a BED file. Only the 
first three columns (chrom, chromStart and chromEnd) are required. The 
BED file should have a header line.
.TP
\fB\-\-remove\-filtered\-all\fR
.TP
\fB\-\-remove\-filtered\fR <sting>
.TP
\fB\-\-keep\-filtered\fR <sting>
These options are used to filter sites on the basis of their FILTER flag. 
The first option removes all sites with a FILTER flag. The second option 
can be used to exclude sites with a specific filter flag. The third option 
can be used to select sites on the basis of specific filter flags. 
The second and third options can be used multiple times to specify multiple 
FILTERs. The \-\-keep\-filtered option is applied before 
the \-\-remove\-filtered 
option.
.TP
\fB\-\-minQ\fR <float>
Include only sites with Quality above this threshold.
.TP
\fB\-\-min\-meanDP\fR <float>
.TP
\fB\-\-max\-meanDP\fR <float>
Include sites with mean Depth within the thresholds defined by these options.
.TP
\fB\-\-maf\fR <float>
.TP
\fB\-\-max\-maf\fR <float>
Include only sites with Minor Allele Frequency within the specified range.
.TP
\fB\-\-non\-ref\-af\fR <float>
.TP
\fB\-\-max\-non\-ref\-af\fR <float>
Include only sites with Non-Reference Allele Frequency within the specified 
range.
.TP
\fB\-\-hue\fR <float>
Assesses sites for Hardy-Weinberg Equilibrium using an exact test, as 
defined by Wigginton, Cutler and Abecasis (2005). Sites with a p-value 
below the threshold defined by this option are taken to be out of HWE, 
and therefore excluded.
.TP
\fB\-\-geno\fR <float>
Exclude sites on the basis of the proportion of missing data (defined to 
be between 0 and 1).
.TP
\fB\-\-min\-alleles\fR <int>
.TP
\fB\-\-max\-alleles\fR <int>
Include only sites with a number of alleles within the specified range. 
For example, to include only bi\-allelic sites, one could use:

      vcftools \-\-vcf file1.vcf \-\-min\-alleles 2 \-\-max\-alleles 2

.TP
\fB\-\-mask\fR <filename>
.TP
\fB\-\-invert\-mask\fR <filename>
.TP
\fB\-\-mask\-min\fR <filename>
Include sites on the basis of a FASTA-like file. The provided file contains 
a sequence of integer digits (between 0 and 9) for each position on a 
chromosome that specify if a site at that position should be filtered or not. 
An example mask file would look like:

      >1
      0000011111222...

In this example, sites in the VCF file located within the first 5 bases of 
the start of chromosome 1 would be kept, whereas sites at position 6 onwards 
would be filtered out. The threshold integer that determines if sites are 
filtered or not is set using the \-\-mask\-min option, which defaults to 0. 
The chromosomes contained in the mask file must be sorted in the same order 
as the VCF file. The \-\-mask option is used to specify the mask file to be 
used, whereas the \-\-invert\-mask option can be used to specify a mask file 
that will be inverted before being applied.

.SS Individual Filters

.TP
\fB\-\-indv\fR <string>
Specify an individual to be kept in the analysis. This option can be used 
multiple times to specify multiple individuals.
.TP
\fB\-\-keep\fR <filename>
Provide a file containing a list of individuals to include in subsequent a
nalysis. Each individual ID (as defined in the VCF headerline) should be 
included on a separate line.
.TP
\fB\-\-remove\-indv\fR <string>
Specify an individual to be removed from the analysis. This option can be 
used multiple times to specify multiple individuals. If the \-\-indv option 
is also specified, then the \-\-indv option is executed before 
the \-\-remove\-indv option.
.TP
\fB\-\-remove\fR <filename>
Provide a file containing a list of individuals to exclude in subsequent 
analysis. Each individual ID (as defined in the VCF headerline) should be 
included on a separate line. If both the \-\-keep and the \-\-remove options 
are used, then the \-\-keep option is execute before the \-\-remove option.
.TP
\fB\-\-mon\-indv\-meanDP\fR <float>
.TP
\fB\-\-max\-indv\-meanDP\fR <float>
Calculate the mean coverage on a per-individual basis. Only individuals with 
coverage within the range specified by these options are included in 
subsequent analyses.
.TP
\fB\-\-mind\fR <float>
Specify the minimum call rate threshold for each individual.
.TP
\fB\-\-phased\fR
First excludes all individuals having all genotypes unphased, and 
subsequently excludes all sites with unphased genotypes. The remaining data 
therefore consists of phased data only.

.SS Genotype Filters
.TP
\fB\-\-remove\-filtered\-geno\-all\fR
.TP
\fB\-\-remove\-filtered\-geno\fR <string>
The first option removes all genotypes with a FILTER flag. The second option 
can be used to exclude genotypes with a specific filter flag.
.TP
\fB\-\-minGQ\fR <float>
Exclude all genotypes with a quality below the threshold specified by 
this option (GQ).
.TP
\fB\-\-minDP\fR <float>
Exclude all genotypes with a sequencing depth below that specified by 
this option (DP)

.SS Output Statistics
.TP
\fB\-\-freq\fR
.TP
\fB\-\-counts\fR
.TP
\fB\-\-freq2\fR
.TP
\fB\-\-counts2\fR
Output per\-site frequency information. The \-\-freq outputs the allele 
frequency in a file with the suffix '.frq'. The \-\-counts option outputs a 
similar file with the suffix '.frq.count', that contains the raw allele 
counts at each site.
The \-\-freq2 and \-\-count2 options are used to suppress allele information in 
the output file. In this case, the order of the freqs/counts depends on the
numbering in the VCF file.
.TP
\fB\-\-depth\fR
Generates a file containing the mean depth per individual. This file has 
the suffix '.idepth'.
.TP
\fB\-\-site\-depth\fR
.TP
\fB\-\-site\-mean\-depth\fR
Generates a file containing the depth per site. The \-\-site\-depth option 
outputs the depth for each site summed across individuals. This file has 
the suffix '.ldepth'. Likewise, the \-\-site\-mean\-depth outputs the mean 
depth for each site, and the output file has the suffix '.ldepth.mean'.
.TP
\fB\-\-geno\-depth\fR
Generates a (possibly very large) file containing the depth for each 
genotype in the VCF file. Missing entries are given the value \-1. The 
file has the suffix '.gdepth'.
.TP
\fB\-\-site\-quality\fR
Generates a file containing the per\-site SNP quality, as found in the QUAL 
column of the VCF file. This file has the suffix '.lqual'.
.TP
\fB\-\-het\fR
Calculates a measure of heterozygosity on a per\-individual basis. 
Specfically, the inbreeding coefficient, F, is estimated for each 
individual using a method of moments. The resulting file has the suffix '.het'.
.TP
\fB\-\-hardy\fR
Reports a p\-value for each site from a Hardy\-Weinberg Equilibrium test 
(as defined by Wigginton, Cutler and Abecasis (2005)). The resulting file 
(with suffix '.hwe') also contains the Observed numbers of Homozygotes and 
Heterozygotes and the corresponding Expected numbers under HWE. 
.TP
\fB\-\-missing\fR
Generates two files reporting the missingness on a per\-individual and 
per\-site basis. The two files have suffixes '.imiss' and '.lmiss' 
respectively.
.TP
\fB\-\-hap\-r2\fR
.TP
\fB\-\-geno\-r2\fR
.TP
\fB\-\-ld\-window\fR <int>
.TP
\fB\-\-ld\-window\-bp\fR <int>
.TP
\fB\-\-min\-r2\fR <float>
These options are used to report Linkage Disequilibrium (LD) statistics 
as summarised by the r2 statistic. The \-\-hap\-r2 option informs vcftools 
to output a file reporting the r2 statistic using phased haplotypes. This 
is the traditional measure of LD often reported in the population genetics 
literature. If phased haplotypes are unavailable then the \-\-geno\-r2 option 
may be used, which calculates the squared correlation coefficient between 
genotypes encoded as 0, 1 and 2 to represent the number of non-reference 
alleles in each individual. This is the same as the LD measure reported 
by PLINK. The haplotype version outputs a file with the suffix '.hap.ld', 
whereas the genotype version outputs a file with the suffix '.geno.ld'. 
The haplotype version implies the option \-\-phased.

The \-\-ld\-window option defines the maximum SNP separation for the 
calculation of LD. Likewise, the \-\-ld\-window\-bp option can be used to 
define the maximum physical separation of SNPs included in the LD 
calculation. Finally, the \-\-min\-r2 sets a minimum value for r2 below 
which the LD statistic is not reported.
.TP
\fB\-\-SNPdnsity\fR <int>
Calculates the number and density of SNPs in bins of size defined by this 
option. The resulting output file has the suffix '.snpden'.
.TP
\fB\-\-TsTv\fR <int>
Calculates the Transition / Transversion ratio in bins of size defined by 
this option. The resulting output file has the suffix '.TsTv'. A summary 
is also supplied in a file with the suffix '.TsTv.summary'.
.TP
\fB\-\-FILTER\-summary\fR
Generates a summary of the number of SNPs and Ts/Tv ratio for each FILTER 
category. The output file has the suffix '.FILTER.summary.
.TP
\fB\-\-filtered\-sites\fR
Creates two files listing sites that have been kept or removed after 
filtering. The first file, with suffix '.kept.sites', lists sites kept 
by vcftools after filters have been applied. The second file, with the 
suffix '.removed.sites', list sites removed by the applied filters.
.TP
\fB\-\-singletons\fR
This option will generate a file detailing the location of singletons, and 
the individual they occur in. The file reports both true singletons, and 
private doubletons (i.e. SNPs where the minor allele only occurs in a 
single individual and that individual is homozygotic for that allele). 
The output file has the suffix '.singletons'.
.TP
\fB\-\-site\-pi\fR
.TP
\fB\-\-window\-pi\fR <int>
These options are used to estimate levels of nucleotide diversity. The first 
option does this on a per\-site basis, and the output file has the 
suffix '.sites.pi'. The second option calculates the nucleotide diversity in 
windows, with the window size defined in the option argument. Output for 
this option has the suffix '.windowed.pi'. The windowed version requires 
phased data, and hence use of this option implies the \-\-phased option.

.SS Output in Other Formats
.TP
\fB\-\-O12\fR
This option outputs the genotypes as a large matrix. Three files are 
produced. The first, with suffix '.012', contains the genotypes of each 
individual on a separate line. Genotypes are represented as 0, 1 and 2, 
where the number represent that number of non-reference alleles. Missing 
genotypes are represented by \-1. The second file, with suffix '.012.indv' 
details the individuals included in the main file. The third file, with 
suffix '.012.pos' details the site locations included in the main file.
.TP
\fB\-\-IMPUTE\fR
This option outputs phased haplotypes in IMPUTE reference\-panel format. As 
IMPUTE requires phased data, using this option also implies \-\-phased. 
Unphased individuals and genotypes are therefore excluded. Only bi\-allelic 
sites are included in the output. Using this option generates three files. 
The IMPUTE haplotype file has the suffix '.impute.hap', and the IMPUTE 
legend file has the suffix '.impute.hap.legend'. The third file, with 
suffix '.impute.hap.indv', details the individuals included in the 
haplotype file, although this file is not needed by IMPUTE.
.TP
\fB\-\-ldhat\fR
.TP
\fB\-\-ldhat\-geno\fR
These options output data in LDhat format. Use of these options  also 
require the \-\-chr option to by used. The \-\-ldhat option outputs phased 
data only, and therefore also implies \-\-phased, leading to unphased 
individuals and genotypes being excluded. Alternatively, the \-\-ldhat\-geno 
option treats all of the data as unphased, and therefore outputs LDhat 
files in genotype/unphased format. In either case, two files are generated 
with the suffixes '.ldhat.sites' and '.ldhat.locs', which correspond to the 
LDhat 'sites' and 'locs' input files respectively.
.TP
\fB\-\-BEAGLE\-GL\fR
This option outputs genotype likelihood information for input into the 
BEAGLE program. This option requires the VCF file to contain the FORMAT 
GL tag, which can generally be output by SNP callers such as the GATK. 
Use of this option requires a chromosome to be specified via the
\-\-chr option. The resulting output file (with the suffix '.BEAGLE.GL') 
contains genotype likelihoods for biallelic sites, and is suitable for 
input into BEAGLE via the 'like=' argument.
.TP
\fB\-\-plink\fR
This option outputs the genotype data in PLINK PED format. Two files are 
generated, with suffixes '.ped' and '.map'. Note that only bi\-allelic loci 
will be output. Further details of these files can be found in the PLINK 
documentation.

Note: This option can be very slow on large datasets. Using the \-\-chr option 
to divide up the dataset is advised.
.TP
\fB\-\-plink\-tped\fR
The \-\-plink option above can be extremely slow on large datasets. An 
alternative that might be considerably quicker is to output in the 
PLINK transposed format. This can be achieved using the \-\-plink\-tped 
option, which produces two files with suffixes '.tped' and '.tfam'.
.TP
\fB\-\-recode\fR
The \-\-recode option is used to generate a VCF file from the input VCF file 
having applied the options specified by the user. The output file has the 
suffix '.recode.vcf'.

By default, the INFO fields are removed from the output file, as the INFO 
values may be invalidated by the recoding (e.g. the total depth may need to 
be recalculated if individuals are removed). This default functionality can 
be overridden by using the \-\-keep\-INFO <string> option, where <string> 
defines the INFO key to keep in the output file. The \-\-keep\-INFO flag can 
be used multiple times. Alternatively, the option \-\-keep\-INFO-all can be 
used to retain all INFO fields.

.SS Miscellaneous
.TP
\fB\-\-extract\-FORMAT\-info\fR <string>
Extract information from the genotype fields in the VCF file relating to a 
specfied FORMAT identifier. For example, using the 
option '\-\-extract\-FORMAT\-info GT' would extract the all of the GT 
(i.e. Genotype) 
entries. The resulting output file has the suffix '.<FORMAT_ID>.FORMAT'.
.TP
\fB\-\-get\-INFO\fR <string>
This option is used to extract information from the INFO field in the VCF 
file. The <string> argument specifies the INFO tag to be extracted, and the 
option can be used multiple times in order to extract multiple INFO entries. 
The resulting file, with suffix '.INFO', contains the required INFO 
information in a tab\-separated table. For example, to extract the NS and 
DB flags, one would use the command:

      vcftools \-\-vcf file1.vcf \-\-get\-INFO NS \-\-get\-INFO DB

.SS VCF File Comparison Options

The file comparison options are currently in a state of flux and likely buggy. 
If you find a bug, please report it. Note that genotype\-level filters are not 
supported in these options.

.TP
\fB\-\-diff\fR <filename>
.TP
\fB\-\-gzdiff\fR <filename>
Select a VCF file for comparison with the file specified by the \-\-vcf option. 
Outputs two files describing the sites and individuals common / unique to 
each file. These files have the suffixes '.diff.sites_in_files' 
and '.diff.indv_in_files' respectively. The \-\-gzdiff version can be used to 
read compressed VCF files.
.TP
\fB\-\-diff\-site\-discordance\fR
Used in conjunction with the \-\-diff option to calculate discordance on a 
site by site basis. The resulting output file has the suffix '.diff.sites'.
.TP
\fB\-\-diff\-indv\-discordance\fR
Used in conjunction with the \-\-diff option to calculate discordance on a 
per-individual basis. The resulting output file has the suffix '.diff.indv'.
.TP
\fB\-\-diff\-discordance\-matrix\fR
Used in conjunction with the \-\-diff option to calculate a discordance matrix. 
This option only works with bi\-allelic loci with matching alleles that are 
present in both files. The resulting output file has the 
suffix '.diff.discordance.matrix'.
.TP
\fB\-\-diff\-switch\-error\fR
Used in conjunction with the \-\-diff option to calculate phasing errors 
(specifically 'switch errors'). This option generates two output files 
describing switch errors found between sites, and the average switch error 
per individual. These two files have the suffixes '.diff.switch'
and '.diff.indv.switch' respectively.

.SS Options still in development

The following options are yet to be finalised, are likely to contain bugs, 
and are likely to change in the future.
.TP
\fB\-\-fst\fR <filename>
.TP
\fB\-\-gzfst\fR <filename>
Calculate FST for a pair of VCF files, with the second file being specified 
by this option. FST is currently calculated using the formula described in 
the supplementary material of the Phase I HapMap paper. Currently, only 
pairwise FST calculations are supported, although this will likely change 
in the future. The \-\-gzfst option can be used to read compressed VCF files.

.TP
\fB\-\-LROH\fR
Identify Long Runs of Homozygosity.
.TP
\fB\-\-relatedness\fR
Output Individual Relatedness Statistics.