File: usage.rst

package info (click to toggle)
python-pysam 0.15.4%2Bds-3
  • links: PTS, VCS
  • area: main
  • in suites: bullseye
  • size: 27,992 kB
  • sloc: ansic: 140,738; python: 7,881; sh: 265; makefile: 223; perl: 41
file content (422 lines) | stat: -rw-r--r-- 14,643 bytes parent folder | download
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
.. _Usage: 

=========================================
Working with BAM/CRAM/SAM-formatted files
=========================================

Opening a file
==============

To begin with, import the pysam module and open a
:class:`pysam.AlignmentFile`::

   import pysam
   samfile = pysam.AlignmentFile("ex1.bam", "rb")

The above command opens the file :file:`ex1.bam` for reading.
The ``b`` qualifier indicates that this is a :term:`BAM` file. 
To open a :term:`SAM` file, type::

   import pysam
   samfile = pysam.AlignmentFile("ex1.sam", "r")

:term:`CRAM` files are identified by a ``c`` qualifier::

   import pysam
   samfile = pysam.AlignmentFile("ex1.cram", "rc")

Fetching reads mapped to a :term:`region`
=========================================

Reads are obtained through a call to the
:meth:`pysam.AlignmentFile.fetch` method which returns an iterator.
Each call to the iterator will returns a :class:`pysam.AlignedSegment`
object::

   iter = samfile.fetch("seq1", 10, 20)
   for x in iter:
       print (str(x))

:meth:`pysam.AlignmentFile.fetch` returns all reads overlapping a
region sorted by the first aligned base in the :term:`reference`
sequence.  Note that it will also return reads that are only partially
overlapping with the :term:`region`. Thus the reads returned might
span a region that is larger than the one queried.

Using the pileup-engine
=======================

In contrast to :term:`fetching`, the :term:`pileup` engine returns for
each base in the :term:`reference` sequence the reads that map to that
particular position. In the typical view of reads stacking vertically
on top of the reference sequence similar to a multiple alignment,
:term:`fetching` iterates over the rows of this implied multiple
alignment while a :term:`pileup` iterates over the :term:`columns`.

Calling :meth:`~pysam.AlignmentFile.pileup` will return an iterator
over each :term:`column` (reference base) of a specified
:term:`region`. Each call to the iterator returns an object of the
type :class:`pysam.PileupColumn` that provides access to all the
reads aligned to that particular reference position as well as
some additional information::

   iter = samfile.pileup('seq1', 10, 20)
   for x in iter:
      print (str(x))
 

Creating BAM/CRAM/SAM files from scratch
========================================

The following example shows how a new :term:`BAM` file is constructed
from scratch.  The important part here is that the
:class:`pysam.AlignmentFile` class needs to receive the sequence
identifiers. These can be given either as a dictionary in a header
structure, as lists of names and sizes, or from a template file.
Here, we use a header dictionary::

   header = { 'HD': {'VN': '1.0'},
               'SQ': [{'LN': 1575, 'SN': 'chr1'}, 
                      {'LN': 1584, 'SN': 'chr2'}] }

   with pysam.AlignmentFile(tmpfilename, "wb", header=header) as outf:
       a = pysam.AlignedSegment()
       a.query_name = "read_28833_29006_6945"
       a.query_sequence="AGCTTAGCTAGCTACCTATATCTTGGTCTTGGCCG"
       a.flag = 99
       a.reference_id = 0
       a.reference_start = 32
       a.mapping_quality = 20
       a.cigar = ((0,10), (2,1), (0,25))
       a.next_reference_id = 0
       a.next_reference_start=199
       a.template_length=167
       a.query_qualities = pysam.qualitystring_to_array("<<<<<<<<<<<<<<<<<<<<<:<9/,&,22;;<<<")
       a.tags = (("NM", 1),
		 ("RG", "L1"))
       outf.write(a)

Using streams
=============

Pysam does not support reading and writing from true python file
objects, but it does support reading and writing from stdin and
stdout. The following example reads from stdin and writes to stdout::

   infile = pysam.AlignmentFile("-", "r")
   outfile = pysam.AlignmentFile("-", "w", template=infile)
   for s in infile:
       outfile.write(s)

It will also work with :term:`BAM` files. The following script
converts a :term:`BAM` formatted file on stdin to a :term:`SAM`
formatted file on stdout::

   infile = pysam.AlignmentFile("-", "rb")
   outfile = pysam.AlignmentFile("-", "w", template=infile)
   for s in infile:
       outfile.write(s)

Note that the file open mode needs to changed from ``r`` to ``rb``.

=====================================
Using samtools commands within python
=====================================

Commands available in :term:`csamtools` are available as simple
function calls. Command line options are provided as arguments. For
example::

   pysam.sort("-o", "output.bam", "ex1.bam")

corresponds to the command line::

   samtools sort -o output.bam ex1.bam

Or for example::

   pysam.sort("-m", "1000000", "-o", "output.bam", "ex1.bam")

In order to get usage information, try::

   print(pysam.sort.usage())

Argument errors raise a :class:`pysam.SamtoolsError`::

   pysam.sort()

   Traceback (most recent call last):
   File "x.py", line 12, in <module>
     pysam.sort()
   File "/build/lib.linux-x86_64-2.6/pysam/__init__.py", line 37, in __call__
     if retval: raise SamtoolsError( "\n".join( stderr ) )
   pysam.SamtoolsError: 'Usage: samtools sort [-n] [-m <maxMem>] <in.bam> <out.prefix>\n'

Messages from :term:`csamtools` on stderr are captured and are
available using the :meth:`getMessages` method::

   pysam.sort.getMessage()

Note that only the output from the last invocation of a command is
stored.

In order for pysam to make the output of samtools commands accessible
the stdout stream needs to be redirected. This is the default
behaviour, but can cause problems in environments such as the ipython
notebook. A solution is to pass the ``catch_stdout`` keyword
argument::

   pysam.sort(catch_stdout=False)

Note that this means that output from commands which produce output on
stdout will not be available. The only solution is to run samtools
commands through subprocess.

================================
Working with tabix-indexed files
================================

To open a tabular file that has been indexed with tabix_, use
:class:`~pysam.TabixFile`::

    import pysam
    tbx = pysam.TabixFile("example.bed.gz")

Similar to :class:`~pysam.AlignmentFile.fetch`, intervals within a
region can be retrieved by calling :meth:`~pysam.TabixFile.fetch()`::

    for row in tbx.fetch("chr1", 1000, 2000):
         print (str(row))

This will return a tuple-like data structure in which columns can
be retrieved by numeric index::

    for row in tbx.fetch("chr1", 1000, 2000):
         print ("chromosome is", row[0])

By providing a parser to :class:`~pysam.AlignmentFile.fetch`
or :class:`~pysam.TabixFile`, the data will we presented in parsed
form::

    for row in tbx.fetch("chr1", 1000, 2000, parser=pysam.asTuple()):
         print ("chromosome is", row.contig)
         print ("first field (chrom)=", row[0])

Pre-built parsers are available for :term:`bed`
(:class:`~pysam.asBed`) formatted files and :term:`gtf`
(:class:`~pysam.asGTF`) formatted files. Thus, additional fields
become available through named access, for example::

    for row in tbx.fetch("chr1", 1000, 2000, parser=pysam.asBed()):
         print ("name is", row.name)


.. Currently inactivated as pileup deprecated
.. Using the samtools SNP caller
.. -----------------------------

.. There are two ways to access the samtools SNP caller. The :class:`pysam.IteratorSNPCalls`
.. is appropriate when calling many consecutive SNPs, while :class:`pysam.SNPCaller` is
.. best when calling SNPs at non-consecutive genomic positions. Each snp caller returns objects of
.. type :class:`pysam.SNPCall`.

.. To use :class:`pysam.IteratorSNPCalls`, associate it with a :class:`pysam.IteratorColumn`::

..     samfile = pysam.AlignmentFile( "ex1.bam", "rb")  
..     fastafile = pysam.Fastafile( "ex1.fa" )
..     pileup_iter = samfile.pileup( stepper = "samtools", fastafile = fastafile )
..     sncpall_iter = pysam.IteratorSNPCalls(pileup_iter)
..     for call in snpcall_iter:
..         print str(call)

.. Usage of :class:`pysam.SNPCaller` is similar::

..     samfile = pysam.AlignmentFile( "ex1.bam", "rb")  
..     fastafile = pysam.Fastafile( "ex1.fa" )
..     pileup_iter = samfile.pileup( stepper = "samtools", fastafile = fastafile )
..     snpcaller = pysam.SNPCaller.call(pileup_iter)
..     print snpcaller( "chr1", 100 )

.. Note the use of the option *stepper* to control which reads are included in the 
.. in the :term:`pileup`. The ``samtools`` stepper implements the same read selection
.. and processing as in the samtools pileup command.

.. Calling indels works along the same lines, using the :class:`pysam.IteratorIndelCalls`
.. and :class:`pysam.IteratorIndelCaller`.


====================================
Working with VCF/BCF formatted files
====================================

To iterate through a VCF/BCF formatted file use
:class:`~pysam.VariantFile`::

   from pysam import VariantFile

   bcf_in = VariantFile("test.bcf")  # auto-detect input format
   bcf_out = VariantFile('-', 'w', header=bcf_in.header)
   
   for rec in bcf_in.fetch('chr1', 100000, 200000):
       bcf_out.write(rec)

:meth:`_pysam.VariantFile.fetch()` iterates over
:class:`~pysam.VariantRecord` objects which provides access to
simple variant attributes such as :class:`~pysam.VariantRecord.contig`,
:class:`~pysam.VariantRecord.pos`, :class:`~pysam.VariantRecord.ref`::

   for rec in bcf_in.fetch():
       print (rec.pos)

but also to complex attributes such as the contents to the
:term:`info`, :term:`format` and :term:`genotype` columns. These
complex attributes are views on the underlying htslib data structures
and provide dictionary-like access to the data::

   for rec in bcf_in.fetch():
       print (rec.info)
       print (rec.info.keys())
       print (rec.info["DP"])

The :py:attr:`~pysam.VariantFile.header` attribute
(:class:`~pysam.VariantHeader`) provides access information
stored in the :term:`vcf` header. The complete header can be printed::

   >>> print (bcf_in.header)
   ##fileformat=VCFv4.2
   ##FILTER=<ID=PASS,Description="All filters passed">
   ##fileDate=20090805
   ##source=myImputationProgramV3.1
   ##reference=1000GenomesPilot-NCBI36
   ##phasing=partial
   ##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples
   With Data">
   ##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
   ##INFO=<ID=AF,Number=.,Type=Float,Description="Allele Frequency">
   ##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele">
   ##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membership, build
   129">
   ##INFO=<ID=H2,Number=0,Type=Flag,Description="HapMap2 membership">
   ##FILTER=<ID=q10,Description="Quality below 10">
   ##FILTER=<ID=s50,Description="Less than 50% of samples have data">
   ##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
   ##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
   ##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
   ##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality">
   ##contig=<ID=M>
   ##contig=<ID=17>
   ##contig=<ID=20>
   ##bcftools_viewVersion=1.3+htslib-1.3
   ##bcftools_viewCommand=view -O b -o example_vcf42.bcf
   example_vcf42.vcf.gz
   #CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO   FORMAT    NA00001 NA00002 NA0000
  
Individual contents such as contigs, info fields, samples, formats can
be retrieved as attributes from :py:attr:`~pysam.VariantFile.header`::

   >>> print (bcf_in.header.contigs)
   <pysam.cbcf.VariantHeaderContigs object at 0xf250f8>

To convert these views to native python types, iterate through the views::

   >>> print list((bcf_in.header.contigs))
   ['M', '17', '20']
   >>> print list((bcf_in.header.filters))
   ['PASS', 'q10', 's50']
   >>> print list((bcf_in.header.info))
   ['NS', 'DP', 'AF', 'AA', 'DB', 'H2']
   >>> print list((bcf_in.header.samples))
   ['NA00001', 'NA00002', 'NA00003']

Alternatively, it is possible to iterate through all records in the
header returning objects of type :py:class:`~pysam.VariantHeaderRecord`:: ::

   >>> for x in bcf_in.header.records:
   >>>    print (x)
   >>>    print (x.type, x.key)
   GENERIC fileformat
   FILTER FILTER
   GENERIC fileDate
   GENERIC source
   GENERIC reference
   GENERIC phasing
   INFO INFO
   INFO INFO
   INFO INFO
   INFO INFO
   INFO INFO
   INFO INFO
   FILTER FILTER
   FILTER FILTER
   FORMAT FORMAT
   FORMAT FORMAT
   FORMAT FORMAT
   FORMAT FORMAT
   CONTIG contig
   CONTIG contig
   CONTIG contig
   GENERIC bcftools_viewVersion
   GENERIC bcftools_viewCommand

===============
Extending pysam
===============

Using pyximport_, it is (relatively) straight-forward to access pysam
internals and the underlying samtools library. An example is provided
in the :file:`tests` directory. The example emulates the samtools
flagstat command and consists of three files:

1. The main script :file:`pysam_flagstat.py`. The important lines in
   this script are::

      import pyximport
      pyximport.install()
      import _pysam_flagstat

      ...
   
      flag_counts = _pysam_flagstat.count(pysam_in)

   The first part imports, sets up pyximport_ and imports the cython
   module :file:`_pysam_flagstat`.  The second part calls the
   ``count`` method in :file:`_pysam_flagstat`.
 
2. The cython implementation :file:`_pysam_flagstat.pyx`. This script
   imports the pysam API via::

      from pysam.libcalignmentfile cimport AlignmentFile, AlignedSegment

   This statement imports, amongst others, :class:`AlignedSegment`
   into the namespace. Speed can be gained from declaring
   variables. For example, to efficiently iterate over a file, an
   :class:`AlignedSegment` object is declared::

      # loop over samfile
      cdef AlignedSegment read
      for read in samfile:
          ...

3. A :file:`pyxbld` providing pyximport_ with build information.
   Required are the locations of the samtools and pysam header
   libraries of a source installation of pysam plus the
   :file:`csamtools.so` shared library. For example::

     def make_ext(modname, pyxfilename):
	 from distutils.extension import Extension
	 import pysam
	 return Extension(name=modname,
               sources=[pyxfilename],
               extra_link_args=pysam.get_libraries(),
	       include_dirs=pysam.get_include(),
	       define_macros=pysam.get_defines())

If the script :file:`pysam_flagstat.py` is called the first time,
pyximport_ will compile the cython_ extension
:file:`_pysam_flagstat.pyx` and make it available to the
script. Compilation requires a working compiler and cython_
installation.  Each time :file:`_pysam_flagstat.pyx` is modified, a
new compilation will take place.

pyximport_ comes with cython_.