File: FILTERS.rst

package info (click to toggle)
python-pyvcf 0.6.8%2Bgit20170215.476169c-11
  • links: PTS, VCS
  • area: main
  • in suites: sid, trixie
  • size: 1,872 kB
  • sloc: python: 2,921; makefile: 125
file content (158 lines) | stat: -rw-r--r-- 6,807 bytes parent folder | download | duplicates (4)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
Filtering VCF files
===================

The filter script: vcf_filter.py
--------------------------------

Filtering a VCF file based on some properties of interest is a common enough 
operation that PyVCF offers an extensible script.  ``vcf_filter.py`` does 
the work of reading input, updating the metadata and filtering the records.


Existing Filters
----------------

.. autoclass:: vcf.filters.SiteQuality

.. autoclass:: vcf.filters.VariantGenotypeQuality

.. autoclass:: vcf.filters.ErrorBiasFilter

.. autoclass:: vcf.filters.DepthPerSample

.. autoclass:: vcf.filters.AvgDepthPerSample

.. autoclass:: vcf.filters.SnpOnly




Adding a filter
---------------

You can reuse this work by providing a filter class, rather than writing your own filter.
For example, lets say I want to filter each site based on the quality of the site.
I can create a class like this::
   
    import vcf.filters
    class SiteQuality(vcf.filters.Base):
        'Filter sites by quality'

        name = 'sq'

        @classmethod
        def customize_parser(self, parser):
            parser.add_argument('--site-quality', type=int, default=30,
                    help='Filter sites below this quality')

        def __init__(self, args):
            self.threshold = args.site_quality

        def __call__(self, record):
            if record.QUAL < self.threshold:
                return record.QUAL


This class subclasses ``vcf.filters.Base`` which provides the interface for VCF filters.
The docstring  and ``name`` are metadata about the parser.  The docstring provides
the help for the script, and the first line is included in the FILTER metadata when 
applied to a file.

The ``customize_parser`` method allows you to add arguments to the script.
We use the ``__init__`` method to grab the argument of interest from the parser.
Finally, the ``__call__`` method processes each record and returns a value if the 
filter failed.  The base class uses the ``name`` and ``threshold`` to create
the filter ID in the VCF file.

To make vcf_filter.py aware of the filter, you can either use the local script option
or declare an entry point.  To use a local script, simply call vcf_filter::

    $ vcf_filter.py --local-script my_filters.py ...

To use an entry point, you need to declare a ``vcf.filters`` entry point in your ``setup``::

    setup(
        ...
        entry_points = {
            'vcf.filters': [
                'site_quality = module.path:SiteQuality',
            ]
        }
    )

Either way, when you call vcf_filter.py, you should see your filter in the list of available filters::

    usage: vcf_filter.py [-h] [--no-short-circuit] [--no-filtered] 
                  [--output OUTPUT] [--local-script LOCAL_SCRIPT]
                  input filter [filter_args] [filter [filter_args]] ...
                

    Filter a VCF file

    positional arguments:
      input                 File to process (use - for STDIN) (default: None)

    optional arguments:
      -h, --help            Show this help message and exit. (default: False)
      --no-short-circuit    Do not stop filter processing on a site if any filter
                            is triggered (default: False)
      --output OUTPUT       Filename to output [STDOUT] (default: <open file
                            '<stdout>', mode 'w' at 0x1002841e0>)
      --no-filtered         Output only sites passing the filters (default: False)
      --local-script LOCAL_SCRIPT
                            Python file in current working directory with the
                            filter classes (default: None)

    sq:
      Filter sites by quality

      --site-quality SITE_QUALITY
                            Filter sites below this quality (default: 30)

The filter base class: vcf.filters.Base
---------------------------------------

.. autoclass:: vcf.filters.Base
   :members:



Utilities
=========

.. automodule:: vcf.utils

Simultaneously iterate two or more files
----------------------------------------

.. autofunction:: vcf.utils.walk_together

Trim common suffix
--------------------
.. autofunction:: vcf.utils.trim_common_suffix


vcf_melt
--------

This script converts a VCF file from wide format (many calls per row) 
to a long format (one call per row).  This is useful if you want to grep per sample
or for really quick import into, say, a spreadsheet::

    $ vcf_melt < vcf/test/gatk.vcf 
    SAMPLE	AD	DP	GQ	GT	PL	FILTER	CHROM	POS	REF	ALT	ID	info.AC	info.AF	info.AN	info.BaseQRankSum	info.DB	info.DP	info.DS	info.Dels	info.FS	info.HRun	info.HaplotypeScore	info.InbreedingCoeff	info.MQ	info.MQ0	info.MQRankSum	info.QD	info.ReadPosRankSum
    BLANK	6,0	6	18.04	0/0	0,18,211	.	chr22	42522392	G	[A]	rs28371738	2	0.143	14	0.375	True	1506	True	0.0	0.0	0	123.5516		253.92	0	0.685	5.9	0.59
    NA12878	138,107	250	99.0	0/1	1961,0,3049	.	chr22	42522392	G	[A]	rs28371738	2	0.143	14	0.375	True	1506	True	0.0	0.0	0	123.5516		253.92	0	0.685	5.9	0.59
    NA12891	169,77	250	99.0	0/1	1038,0,3533	.	chr22	42522392	G	[A]	rs28371738	2	0.143	14	0.375	True	1506	True	0.0	0.0	0	123.5516		253.92	0	0.685	5.9	0.59
    NA12892	249,0	250	99.0	0/0	0,600,5732	.	chr22	42522392	G	[A]	rs28371738	2	0.143	14	0.375	True	1506	True	0.0	0.0	0	123.5516		253.92	0	0.685	5.9	0.59
    NA19238	248,1	250	99.0	0/0	0,627,6191	.	chr22	42522392	G	[A]	rs28371738	2	0.143	14	0.375	True	1506	True	0.0	0.0	0	123.5516		253.92	0	0.685	5.9	0.59
    NA19239	250,0	250	99.0	0/0	0,615,5899	.	chr22	42522392	G	[A]	rs28371738	2	0.143	14	0.375	True	1506	True	0.0	0.0	0	123.5516		253.92	0	0.685	5.9	0.59
    NA19240	250,0	250	99.0	0/0	0,579,5674	.	chr22	42522392	G	[A]	rs28371738	2	0.143	14	0.375	True	1506	True	0.0	0.0	0	123.5516		253.92	0	0.685	5.9	0.59
    BLANK	13,4	17	62.64	0/1	63,0,296	.	chr22	42522613	G	[C]	rs1135840	6	0.429	14	16.289	True	1518	True	0.03	0.0	0	142.5716		242.46	0	2.01	9.16	-1.731
    NA12878	118,127	246	99.0	0/1	2396,0,1719	.	chr22	42522613	G	[C]	rs1135840	6	0.429	14	16.289	True	1518	True	0.03	0.0	0	142.5716		242.46	0	2.01	9.16	-1.731
    NA12891	241,0	244	99.0	0/0	0,459,4476	.	chr22	42522613	G	[C]	rs1135840	6	0.429	14	16.289	True	1518	True	0.03	0.0	0	142.5716		242.46	0	2.01	9.16	-1.731
    NA12892	161,85	246	99.0	0/1	1489,0,2353	.	chr22	42522613	G	[C]	rs1135840	6	0.429	14	16.289	True	1518	True	0.03	0.0	0	142.5716		242.46	0	2.01	9.16	-1.731
    NA19238	110,132	242	99.0	0/1	2561,0,1488	.	chr22	42522613	G	[C]	rs1135840	6	0.429	14	16.289	True	1518	True	0.03	0.0	0	142.5716		242.46	0	2.01	9.16	-1.731
    NA19239	106,135	242	99.0	0/1	2613,0,1389	.	chr22	42522613	G	[C]	rs1135840	6	0.429	14	16.289	True	1518	True	0.03	0.0	0	142.5716		242.46	0	2.01	9.16	-1.731
    NA19240	116,126	243	99.0	0/1	2489,0,1537	.	chr22	42522613	G	[C]	rs1135840	6	0.429	14	16.289	True	1518	True	0.03	0.0	0	142.5716		242.46	0	2.01	9.16	-1.731