1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93
|
Modifying Existing VCFs
=======================
`cyvcf2` is optimized for fast reading and extraction from existing files.
However, it also offers some means of modifying existing VCFs.
Modifying the INFO field
------------------------
.. this is the same example as on the index.rst page
Here, we show an example of how to modify the INFO field at a locus to annotate
variants with the genes that they overlap.
.. code-block:: python
from cyvcf2 import VCF, Writer
vcf = VCF(VCF_PATH)
# adjust the header to contain the new field
# the keys 'ID', 'Description', 'Type', and 'Number' are required.
vcf.add_info_to_header({'ID': 'gene', 'Description': 'overlapping gene',
'Type':'Character', 'Number': '1'})
# create a new vcf Writer using the input vcf as a template.
fname = "out.vcf"
w = Writer(fname, vcf)
for v in vcf:
# The get_gene_intersections function is not shown.
# This could be any operation to find intersections
# or any manipulation required by the user.
genes = get_gene_intersections(v)
if genes is not None:
v.INFO["gene"] = ",".join(genes)
w.write_record(v)
w.close(); vcf.close()
Modifying genotypes and the FORMAT field
----------------------------------------
Here is an example where we filter some calls from records in a VCF. This demonstrates
how to add to the FORMAT field and how to modify genotypes at a locus.
.. code-block:: python
from cyvcf2 import VCF, Writer
vcf = VCF(VCF_PATH)
# adjust the header to contain the new field
# the keys 'ID', 'Description', 'Type', and 'Number' are required.
vcf.add_format_to_header({
'ID': 'FILTER_CODE',
'Description': 'Numeric code for filtering reason',
'Type': 'Integer',
'Number': '1'
})
# create a new vcf Writer using the input vcf as a template.
fname = "out.vcf"
w = Writer(fname, vcf)
for v in vcf:
# The filter_samples function is not shown.
# This could be any manipulation required by the user.
# Since we specified the FILTER_CODE format field is an Integer
# with Number=1, reasons must be an n x 1 numpy array of integers
# where n is the number of samples.
indicies, reasons = filter_samples(v)
if indicies:
# add the reasons array to the format dictionary at this locus
v.set_format('FILTER_CODE', reasons)
for index in indicies:
# overwrite the genotypes of each filtered locus to be nocalls
# Note: until the reassignment to v.genotypes below, this
# leaves the v Variant object in an inconsistent state
v.genotypes[index] = [-1]*v.ploidy + [False]
# it is necessary to reassign the genotypes field
# so that the v Variant object reprocess it and future calls
# to functions like v.genotype.array() properly reflect
# the changed genotypes
v.genotypes = v.genotypes
w.write_record(v)
w.close(); vcf.close()
.. _Limitations with writing:
Known Limitations with Writing VCFs
-----------------------------------
* `cyvcf2` currently does not support writing VCFs encoded with UTF-8 with non-ASCII characters in the contents of string-typed FORMAT fields.
* `cyvcf2` currently does not support writing string type format fields with number>1.
|