1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422
|
.. _Usage:
=========================================
Working with BAM/CRAM/SAM-formatted files
=========================================
Opening a file
==============
To begin with, import the pysam module and open a
:class:`pysam.AlignmentFile`::
import pysam
samfile = pysam.AlignmentFile("ex1.bam", "rb")
The above command opens the file :file:`ex1.bam` for reading.
The ``b`` qualifier indicates that this is a :term:`BAM` file.
To open a :term:`SAM` file, type::
import pysam
samfile = pysam.AlignmentFile("ex1.sam", "r")
:term:`CRAM` files are identified by a ``c`` qualifier::
import pysam
samfile = pysam.AlignmentFile("ex1.cram", "rc")
Fetching reads mapped to a :term:`region`
=========================================
Reads are obtained through a call to the
:meth:`pysam.AlignmentFile.fetch` method which returns an iterator.
Each call to the iterator will returns a :class:`pysam.AlignedSegment`
object::
iter = samfile.fetch("seq1", 10, 20)
for x in iter:
print (str(x))
:meth:`pysam.AlignmentFile.fetch` returns all reads overlapping a
region sorted by the first aligned base in the :term:`reference`
sequence. Note that it will also return reads that are only partially
overlapping with the :term:`region`. Thus the reads returned might
span a region that is larger than the one queried.
Using the pileup-engine
=======================
In contrast to :term:`fetching`, the :term:`pileup` engine returns for
each base in the :term:`reference` sequence the reads that map to that
particular position. In the typical view of reads stacking vertically
on top of the reference sequence similar to a multiple alignment,
:term:`fetching` iterates over the rows of this implied multiple
alignment while a :term:`pileup` iterates over the :term:`columns`.
Calling :meth:`~pysam.AlignmentFile.pileup` will return an iterator
over each :term:`column` (reference base) of a specified
:term:`region`. Each call to the iterator returns an object of the
type :class:`pysam.PileupColumn` that provides access to all the
reads aligned to that particular reference position as well as
some additional information::
iter = samfile.pileup('seq1', 10, 20)
for x in iter:
print (str(x))
Creating BAM/CRAM/SAM files from scratch
========================================
The following example shows how a new :term:`BAM` file is constructed
from scratch. The important part here is that the
:class:`pysam.AlignmentFile` class needs to receive the sequence
identifiers. These can be given either as a dictionary in a header
structure, as lists of names and sizes, or from a template file.
Here, we use a header dictionary::
header = { 'HD': {'VN': '1.0'},
'SQ': [{'LN': 1575, 'SN': 'chr1'},
{'LN': 1584, 'SN': 'chr2'}] }
with pysam.AlignmentFile(tmpfilename, "wb", header=header) as outf:
a = pysam.AlignedSegment()
a.query_name = "read_28833_29006_6945"
a.query_sequence="AGCTTAGCTAGCTACCTATATCTTGGTCTTGGCCG"
a.flag = 99
a.reference_id = 0
a.reference_start = 32
a.mapping_quality = 20
a.cigar = ((0,10), (2,1), (0,25))
a.next_reference_id = 0
a.next_reference_start=199
a.template_length=167
a.query_qualities = pysam.qualitystring_to_array("<<<<<<<<<<<<<<<<<<<<<:<9/,&,22;;<<<")
a.tags = (("NM", 1),
("RG", "L1"))
outf.write(a)
Using streams
=============
Pysam does not support reading and writing from true python file
objects, but it does support reading and writing from stdin and
stdout. The following example reads from stdin and writes to stdout::
infile = pysam.AlignmentFile("-", "r")
outfile = pysam.AlignmentFile("-", "w", template=infile)
for s in infile:
outfile.write(s)
It will also work with :term:`BAM` files. The following script
converts a :term:`BAM` formatted file on stdin to a :term:`SAM`
formatted file on stdout::
infile = pysam.AlignmentFile("-", "rb")
outfile = pysam.AlignmentFile("-", "w", template=infile)
for s in infile:
outfile.write(s)
Note that the file open mode needs to changed from ``r`` to ``rb``.
=====================================
Using samtools commands within python
=====================================
Commands available in :term:`csamtools` are available as simple
function calls. Command line options are provided as arguments. For
example::
pysam.sort("-o", "output.bam", "ex1.bam")
corresponds to the command line::
samtools sort -o output.bam ex1.bam
Or for example::
pysam.sort("-m", "1000000", "-o", "output.bam", "ex1.bam")
In order to get usage information, try::
print(pysam.sort.usage())
Argument errors raise a :class:`pysam.SamtoolsError`::
pysam.sort()
Traceback (most recent call last):
File "x.py", line 12, in <module>
pysam.sort()
File "/build/lib.linux-x86_64-2.6/pysam/__init__.py", line 37, in __call__
if retval: raise SamtoolsError( "\n".join( stderr ) )
pysam.SamtoolsError: 'Usage: samtools sort [-n] [-m <maxMem>] <in.bam> <out.prefix>\n'
Messages from :term:`csamtools` on stderr are captured and are
available using the :meth:`getMessages` method::
pysam.sort.getMessage()
Note that only the output from the last invocation of a command is
stored.
In order for pysam to make the output of samtools commands accessible
the stdout stream needs to be redirected. This is the default
behaviour, but can cause problems in environments such as the ipython
notebook. A solution is to pass the ``catch_stdout`` keyword
argument::
pysam.sort(catch_stdout=False)
Note that this means that output from commands which produce output on
stdout will not be available. The only solution is to run samtools
commands through subprocess.
================================
Working with tabix-indexed files
================================
To open a tabular file that has been indexed with tabix_, use
:class:`~pysam.TabixFile`::
import pysam
tbx = pysam.TabixFile("example.bed.gz")
Similar to :class:`~pysam.AlignmentFile.fetch`, intervals within a
region can be retrieved by calling :meth:`~pysam.TabixFile.fetch()`::
for row in tbx.fetch("chr1", 1000, 2000):
print (str(row))
This will return a tuple-like data structure in which columns can
be retrieved by numeric index::
for row in tbx.fetch("chr1", 1000, 2000):
print ("chromosome is", row[0])
By providing a parser to :class:`~pysam.AlignmentFile.fetch`
or :class:`~pysam.TabixFile`, the data will we presented in parsed
form::
for row in tbx.fetch("chr1", 1000, 2000, parser=pysam.asTuple()):
print ("chromosome is", row.contig)
print ("first field (chrom)=", row[0])
Pre-built parsers are available for :term:`bed`
(:class:`~pysam.asBed`) formatted files and :term:`gtf`
(:class:`~pysam.asGTF`) formatted files. Thus, additional fields
become available through named access, for example::
for row in tbx.fetch("chr1", 1000, 2000, parser=pysam.asBed()):
print ("name is", row.name)
.. Currently inactivated as pileup deprecated
.. Using the samtools SNP caller
.. -----------------------------
.. There are two ways to access the samtools SNP caller. The :class:`pysam.IteratorSNPCalls`
.. is appropriate when calling many consecutive SNPs, while :class:`pysam.SNPCaller` is
.. best when calling SNPs at non-consecutive genomic positions. Each snp caller returns objects of
.. type :class:`pysam.SNPCall`.
.. To use :class:`pysam.IteratorSNPCalls`, associate it with a :class:`pysam.IteratorColumn`::
.. samfile = pysam.AlignmentFile( "ex1.bam", "rb")
.. fastafile = pysam.Fastafile( "ex1.fa" )
.. pileup_iter = samfile.pileup( stepper = "samtools", fastafile = fastafile )
.. sncpall_iter = pysam.IteratorSNPCalls(pileup_iter)
.. for call in snpcall_iter:
.. print str(call)
.. Usage of :class:`pysam.SNPCaller` is similar::
.. samfile = pysam.AlignmentFile( "ex1.bam", "rb")
.. fastafile = pysam.Fastafile( "ex1.fa" )
.. pileup_iter = samfile.pileup( stepper = "samtools", fastafile = fastafile )
.. snpcaller = pysam.SNPCaller.call(pileup_iter)
.. print snpcaller( "chr1", 100 )
.. Note the use of the option *stepper* to control which reads are included in the
.. in the :term:`pileup`. The ``samtools`` stepper implements the same read selection
.. and processing as in the samtools pileup command.
.. Calling indels works along the same lines, using the :class:`pysam.IteratorIndelCalls`
.. and :class:`pysam.IteratorIndelCaller`.
====================================
Working with VCF/BCF formatted files
====================================
To iterate through a VCF/BCF formatted file use
:class:`~pysam.VariantFile`::
from pysam import VariantFile
bcf_in = VariantFile("test.bcf") # auto-detect input format
bcf_out = VariantFile('-', 'w', header=bcf_in.header)
for rec in bcf_in.fetch('chr1', 100000, 200000):
bcf_out.write(rec)
:meth:`_pysam.VariantFile.fetch()` iterates over
:class:`~pysam.VariantRecord` objects which provides access to
simple variant attributes such as :class:`~pysam.VariantRecord.contig`,
:class:`~pysam.VariantRecord.pos`, :class:`~pysam.VariantRecord.ref`::
for rec in bcf_in.fetch():
print (rec.pos)
but also to complex attributes such as the contents to the
:term:`info`, :term:`format` and :term:`genotype` columns. These
complex attributes are views on the underlying htslib data structures
and provide dictionary-like access to the data::
for rec in bcf_in.fetch():
print (rec.info)
print (rec.info.keys())
print (rec.info["DP"])
The :py:attr:`~pysam.VariantFile.header` attribute
(:class:`~pysam.VariantHeader`) provides access information
stored in the :term:`vcf` header. The complete header can be printed::
>>> print (bcf_in.header)
##fileformat=VCFv4.2
##FILTER=<ID=PASS,Description="All filters passed">
##fileDate=20090805
##source=myImputationProgramV3.1
##reference=1000GenomesPilot-NCBI36
##phasing=partial
##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples
With Data">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
##INFO=<ID=AF,Number=.,Type=Float,Description="Allele Frequency">
##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele">
##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membership, build
129">
##INFO=<ID=H2,Number=0,Type=Flag,Description="HapMap2 membership">
##FILTER=<ID=q10,Description="Quality below 10">
##FILTER=<ID=s50,Description="Less than 50% of samples have data">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality">
##contig=<ID=M>
##contig=<ID=17>
##contig=<ID=20>
##bcftools_viewVersion=1.3+htslib-1.3
##bcftools_viewCommand=view -O b -o example_vcf42.bcf
example_vcf42.vcf.gz
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001 NA00002 NA0000
Individual contents such as contigs, info fields, samples, formats can
be retrieved as attributes from :py:attr:`~pysam.VariantFile.header`::
>>> print (bcf_in.header.contigs)
<pysam.cbcf.VariantHeaderContigs object at 0xf250f8>
To convert these views to native python types, iterate through the views::
>>> print list((bcf_in.header.contigs))
['M', '17', '20']
>>> print list((bcf_in.header.filters))
['PASS', 'q10', 's50']
>>> print list((bcf_in.header.info))
['NS', 'DP', 'AF', 'AA', 'DB', 'H2']
>>> print list((bcf_in.header.samples))
['NA00001', 'NA00002', 'NA00003']
Alternatively, it is possible to iterate through all records in the
header returning objects of type :py:class:`~pysam.VariantHeaderRecord`:: ::
>>> for x in bcf_in.header.records:
>>> print (x)
>>> print (x.type, x.key)
GENERIC fileformat
FILTER FILTER
GENERIC fileDate
GENERIC source
GENERIC reference
GENERIC phasing
INFO INFO
INFO INFO
INFO INFO
INFO INFO
INFO INFO
INFO INFO
FILTER FILTER
FILTER FILTER
FORMAT FORMAT
FORMAT FORMAT
FORMAT FORMAT
FORMAT FORMAT
CONTIG contig
CONTIG contig
CONTIG contig
GENERIC bcftools_viewVersion
GENERIC bcftools_viewCommand
===============
Extending pysam
===============
Using pyximport_, it is (relatively) straight-forward to access pysam
internals and the underlying samtools library. An example is provided
in the :file:`tests` directory. The example emulates the samtools
flagstat command and consists of three files:
1. The main script :file:`pysam_flagstat.py`. The important lines in
this script are::
import pyximport
pyximport.install()
import _pysam_flagstat
...
flag_counts = _pysam_flagstat.count(pysam_in)
The first part imports, sets up pyximport_ and imports the cython
module :file:`_pysam_flagstat`. The second part calls the
``count`` method in :file:`_pysam_flagstat`.
2. The cython implementation :file:`_pysam_flagstat.pyx`. This script
imports the pysam API via::
from pysam.libcalignmentfile cimport AlignmentFile, AlignedSegment
This statement imports, amongst others, :class:`AlignedSegment`
into the namespace. Speed can be gained from declaring
variables. For example, to efficiently iterate over a file, an
:class:`AlignedSegment` object is declared::
# loop over samfile
cdef AlignedSegment read
for read in samfile:
...
3. A :file:`pyxbld` providing pyximport_ with build information.
Required are the locations of the samtools and pysam header
libraries of a source installation of pysam plus the
:file:`csamtools.so` shared library. For example::
def make_ext(modname, pyxfilename):
from distutils.extension import Extension
import pysam
return Extension(name=modname,
sources=[pyxfilename],
extra_link_args=pysam.get_libraries(),
include_dirs=pysam.get_include(),
define_macros=pysam.get_defines())
If the script :file:`pysam_flagstat.py` is called the first time,
pyximport_ will compile the cython_ extension
:file:`_pysam_flagstat.pyx` and make it available to the
script. Compilation requires a working compiler and cython_
installation. Each time :file:`_pysam_flagstat.pyx` is modified, a
new compilation will take place.
pyximport_ comes with cython_.
|