1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597
|
'\" t
.TH samtools-mpileup 1 "12 September 2024" "samtools-1.21" "Bioinformatics tools"
.SH NAME
samtools mpileup \- produces "pileup" textual format from an alignment
.\"
.\" Copyright (C) 2008-2011, 2013-2024 Genome Research Ltd.
.\" Portions copyright (C) 2010, 2011 Broad Institute.
.\"
.\" Author: Heng Li <lh3@sanger.ac.uk>
.\" Author: Joshua C. Randall <jcrandall@alum.mit.edu>
.\"
.\" Permission is hereby granted, free of charge, to any person obtaining a
.\" copy of this software and associated documentation files (the "Software"),
.\" to deal in the Software without restriction, including without limitation
.\" the rights to use, copy, modify, merge, publish, distribute, sublicense,
.\" and/or sell copies of the Software, and to permit persons to whom the
.\" Software is furnished to do so, subject to the following conditions:
.\"
.\" The above copyright notice and this permission notice shall be included in
.\" all copies or substantial portions of the Software.
.\"
.\" THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
.\" IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
.\" FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
.\" THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
.\" LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
.\" FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
.\" DEALINGS IN THE SOFTWARE.
.
.\" For code blocks and examples (cf groff's Ultrix-specific man macros)
.de EX
. in +\\$1
. nf
. ft CR
..
.de EE
. ft
. fi
. in
..
.
.SH SYNOPSIS
.PP
samtools mpileup
.RB [ -EB ]
.RB [ -C
.IR capQcoef ]
.RB [ -r
.IR reg ]
.RB [ -f
.IR in.fa ]
.RB [ -l
.IR list ]
.RB [ -Q
.IR minBaseQ ]
.RB [ -q
.IR minMapQ ]
.I in.bam
.RI [ in2.bam
.RI [ ... ]]
.SH DESCRIPTION
.PP
Generate text pileup output for one or multiple BAM files.
Each input file produces a separate group of pileup columns in the output.
Note that there are two orthogonal ways to specify locations in the
input file; via \fB-r\fR \fIregion\fR and \fB-l\fR \fIfile\fR. The
former uses (and requires) an index to do random access while the
latter streams through the file contents filtering out the specified
regions, requiring no index. The two may be used in conjunction. For
example a BED file containing locations of genes in chromosome 20
could be specified using \fB-r 20 -l chr20.bed\fR, meaning that the
index is used to find chromosome 20 and then it is filtered for the
regions listed in the bed file.
Unmapped reads are not considered and are always discarded.
By default secondary alignments, QC failures and duplicate reads will
be omitted, along with low quality bases and some reads in high depth
regions. See the \fB--ff\fR, \fB-Q\fR and \fB-d\fR options for
changing this.
.SS Pileup Format
Pileup format consists of TAB-separated lines, with each line representing
the pileup of reads at a single genomic position.
Several columns contain numeric quality values encoded as individual ASCII
characters.
Each character can range from \(lq!\(rq to \(lq~\(rq and is decoded by
taking its ASCII value and subtracting 33; e.g., \(lqA\(rq encodes the
numeric value 32.
The first three columns give the position and reference:
.IP \(ci 2
Chromosome name.
.IP \(ci 2
1-based position on the chromosome.
.IP \(ci 2
Reference base at this position (this will be \(lqN\(rq on all lines
if \fB-f\fR/\fB--fasta-ref\fR has not been used).
.PP
The remaining columns show the pileup data, and are repeated for each
input BAM file specified:
.IP \(ci 2
Number of reads covering this position.
.IP \(ci 2
Read bases.
This encodes information on matches, mismatches, indels, strand,
mapping quality, and starts and ends of reads.
For each read covering the position, this column contains:
.RS
.IP \(bu 2
If this is the first position covered by the read, a \(lq^\(rq character
followed by the alignment's mapping quality encoded as an ASCII character.
.IP \(bu 2
A single character indicating the read base and the strand to which the read
has been mapped:
.TS
c c c
- - -
ceb ceb l .
Forward Reverse Meaning
\&.\fR dot ,\fR comma Base matches the reference base
ACGTN acgtn Base is a mismatch to the reference base
> < Reference skip (due to CIGAR \(lqN\(rq)
* *\fR/\fB# Deletion of the reference base (CIGAR \(lqD\(rq)
.TE
Deleted bases are shown as \(lq*\(rq on both strands
unless \fB--reverse-del\fR is used, in which case they are shown as \(lq#\(rq
on the reverse strand.
.IP \(bu 2
If there is an insertion after this read base, text matching
\(lq\\+[0-9]+[ACGTNacgtn*#]+\(rq: a \(lq+\(rq character followed by an integer
giving the length of the insertion and then the inserted sequence.
Pads are shown as \(lq*\(rq unless \fB--reverse-del\fR is used,
in which case pads on the reverse strand will be shown as \(lq#\(rq.
.IP \(bu 2
If there is a deletion after this read base, text matching
\(lq-[0-9]+[ACGTNacgtn]+\(rq: a \(lq-\(rq character followed by the deleted
reference bases represented similarly. (Subsequent pileup lines will
contain \(lq*\(rq for this read indicating the deleted bases.)
.IP \(bu 2
If this is the last position covered by the read, a \(lq$\(rq character.
.RE
.IP \(ci 2
Base qualities, encoded as ASCII characters.
.IP \(ci 2
Alignment mapping qualities, encoded as ASCII characters.
(Column only present when \fB-s\fR/\fB--output-MQ\fR is used.)
.IP \(ci 2
Comma-separated 1-based positions within the alignments, in the
orientation shown in the input file. E.g., 5 indicates
that it is the fifth base of the corresponding read that is mapped to this
genomic position.
(Column only present when \fB-O\fR/\fB--output-BP\fR is used.)
.IP \(ci 2
Additional comma-separated read field columns,
as selected via \fB--output-extra\fR.
The fields selected appear in the same order as in SAM:
.BR QNAME ,
.BR FLAG ,
.BR RNAME ,
.BR POS ,
.B MAPQ
(displayed numerically),
.BR RNEXT ,
.BR PNEXT ,
followed by
.BR RLEN
for unclipped read length.
.IP \(ci 2
Comma-separated 1-based positions within the alignments, in 5' to 3'
orientation. E.g., 5 indicates that it is the fifth base of the
corresponding read as produced by the sequencing instrument, that is
mapped to this genomic position. (Column only present when \fB--output-BP-5\fR is used.)
.IP \(ci 2
Additional read tag field columns, as selected via \fB--output-extra\fR.
These columns are formatted as determined by \fB--output-sep\fR and
\fB--output-empty\fR (comma-separated by default), and appear in the
same order as the tags are given in \fB--output-extra\fR.
Any output column that would be empty, such as a tag which is not
present or the filtered sequence depth is zero, is reported as "*".
This ensures a consistent number of columns across all reported positions.
.SH OPTIONS
.TP 10
.B -6, --illumina1.3+
Assume the quality is in the Illumina 1.3+ encoding.
.TP
.B -A, --count-orphans
Do not skip anomalous read pairs in variant calling. Anomalous read
pairs are those marked in the FLAG field as paired in sequencing but
without the properly-paired flag set.
.TP
.BI -b,\ --bam-list \ FILE
List of input BAM files, one file per line [null]
.TP
.B -B, --no-BAQ
Disable base alignment quality (BAQ) computation.
See
.B BAQ
below.
.TP
.BI -C,\ --adjust-MQ \ INT
.RS
Coefficient for downgrading mapping quality for reads containing
excessive mismatches. Mismatches are counted as a proportion of the
number of aligned bases ("M", "X" or "=" CIGAR operations), along with
their quality, to derive an upper-bound of the mapping quality.
Original mapping qualities lower than this are left intact, while
higher ones are capped at the new adjusted score.
The exact formula is complex and likely tuned to specific instruments
and specific alignment tools, so this option is disabled by default
(indicated as having a zero value). Variables in the formulae and
their meaning are defined below.
.TS
lb lb
- -
lb l .
Variable Meaning / formula
M T{
The number of matching CIGAR bases (operation "M", "X" or "=").
T}
X The number of substitutions with quality >= 13.
SubQ T{
The summed quality of substitution bases included in X, capped at a maximum of quality 33 per mismatching base.
T}
ClipQ T{
The summed quality of soft-clipped or hard-clipped bases. This has no minimum or maximum quality threshold per base. For hard-clipped bases the per-base quality is taken as 13.
T}
T SubQ - 10 * log10(M^X / X!) + ClipQ/5
Cap MAX(0, INT * sqrt((INT - T) / INT))
.TE
Some notes on the impact of this.
.IP \(ci 2
As the number of mismatches increases, the mapping quality cap
reduces, eventually resulting in discarded alignments.
.IP \(ci 2
High quality mismatches reduces the cap faster than low quality
mismatches.
.IP \(ci 2
The starting INT value also acts as a hard cap on mapping quality,
even when zero mismatches are observed.
.IP \(ci 2
Indels have no impact on the mapping quality.
.PP
The intent of this option is to work around aligners that compute a
mapping quality using a local alignment without having any regard to
the degree of clipping required or consideration of potential
contamination or large scale insertions with respect to the reference.
A record may align uniquely and have no close second match, but having
a high number of mismatches may still imply that the reference is not
the correct site.
However we do not recommend use of this parameter unless you fully
understand the impact of it and have determined that it is appropriate
for your sequencing technology.
.RE
.TP
.BI -d,\ --max-depth \ INT
At a position, read maximally
.I INT
reads per input file. Setting this limit reduces the amount of memory and
time needed to process regions with very high coverage. Passing zero for this
option sets it to the highest possible value, effectively removing the depth
limit. [8000]
Note that up to release 1.8, samtools would enforce a minimum value for
this option. This no longer happens and the limit is set exactly as
specified.
.TP
.B -E, --redo-BAQ
Recalculate BAQ on the fly, ignore existing BQ tags.
See
.B BAQ
below.
.TP
.BI -f,\ --fasta-ref \ FILE
The
.BR faidx -indexed
reference file in the FASTA format. The file can be optionally compressed by
.BR bgzip .
[null]
Supplying a reference file will enable base alignment quality calculation
for all reads aligned to a reference in the file. See
.B BAQ
below.
.TP
.BI -G,\ --exclude-RG \ FILE
Exclude reads from read groups listed in FILE (one @RG-ID per line)
.TP
.BI -l,\ --positions \ FILE
BED or position list file containing a list of regions or sites where
pileup or BCF should be generated. Position list files contain two
columns (chromosome and position) and start counting from 1. BED
files contain at least 3 columns (chromosome, start and end position)
and are 0-based half-open.
.br
While it is possible to mix both position-list and BED coordinates in
the same file, this is strongly ill advised due to the differing
coordinate systems. [null]
.TP
.BI -q,\ --min-MQ \ INT
Minimum mapping quality for an alignment to be used [0]
.TP
.BI -Q,\ --min-BQ \ INT
Minimum base quality for a base to be considered. [13]
Note base-quality 0 is used as a filtering mechanism for overlap
removal which marks bases as having quality zero and lets the base
quality filter remove them. Hence using \fB--min-BQ 0\fR will make
the overlapping bases reappear, albeit with quality zero.
.TP
.BI -r,\ --region \ STR
Only generate pileup in region. Requires the BAM files to be indexed.
If used in conjunction with -l then considers the intersection of the
two requests.
.I STR
[all sites]
.TP
.B -R,\ --ignore-RG
Ignore RG tags. Treat all reads in one BAM as one sample.
.TP
.BI --rf,\ --incl-flags \ STR|INT
Required flags: only include reads with any of the mask bits set [null].
Note this is implemented as a filter-out rule, rejecting reads that have
none of the mask bits set. Hence this does not override the
\fB--excl-flags\fR option.
.TP
.BI --ff,\ --excl-flags \ STR|INT
Filter flags: skip reads with any of the mask bits set. This defaults
to SECONDARY,QCFAIL,DUP. The option is not accumulative, so
specifying e.g. \fB--ff QCFAIL\fR will reenable output of
secondary and duplicate alignments. Note this does not override the
\fB--incl-flags\fR option.
.TP
.B -x,\ --ignore-overlaps-removal, --disable-overlap-removal
Overlap detection and removal is enabled by default. This option
turns it off.
When enabled, where the ends of a read-pair overlap the overlapping
region will have one base selected and the duplicate base nullified by
setting its phred score to zero. It will then be discarded by the
\fB--min-BQ\fR option unless this is zero.
The quality values of the retained base within an overlap will be the
summation of the two bases if they agree, or 0.8 times the higher of
the two bases if they disagree, with the base nucleotide also being
the higher confident call.
.TP
.B -X
Include customized index file as a part of arguments. See
.B EXAMPLES
section for sample of usage.
.PP
.B Output Options:
.TP 10
.BI "-o, --output " FILE
Write pileup output to
.IR FILE ,
rather than the default of standard output.
.TP
.B -O, --output-BP
Output base positions on reads in orientation listed in the SAM file
(left to right).
.TP
.B --output-BP-5
Output base positions on reads in their original 5' to 3' orientation.
.TP
.B -s, --output-MQ
Output mapping qualities encoded as ASCII characters.
.TP
.B --output-QNAME
Output an extra column containing comma-separated read names.
Equivalent to \fB--output-extra QNAME\fR.
.TP
.BI "--output-extra" \ STR
Output extra columns containing comma-separated values of read fields or read
tags. The names of the selected fields have to be provided as they are
described in the SAM Specification (pag. 6) and will be output by the
mpileup command in the same order as in the document (i.e.
.BR QNAME ", " FLAG ", " RNAME ,...)
The names are case sensitive. Currently, only the following fields are
supported:
.IP
.B QNAME, FLAG, RNAME, POS, MAPQ, RNEXT, PNEXT, RLEN
.IP
Anything that is not on this list is treated as a potential tag, although only
two character tags are accepted. In the mpileup output, tag columns are
displayed in the order they were provided by the user in the command line.
Field and tag names have to be provided in a comma-separated string to the
mpileup command. Tags with type \fBB\fR (byte array) type are not
supported. An absent or unsupported tag will be listed as "*".
E.g.
.IP
.B samtools mpileup --output-extra FLAG,QNAME,RG,NM in.bam
.IP
will display four extra columns in the mpileup output, the first being a list of
comma-separated read names, followed by a list of flag values, a list of RG tag
values and a list of NM tag values. Field values are always displayed before
tag values.
.TP
.BI "--output-sep" \ CHAR
Specify a different separator character for tag value lists, when those values
might contain one or more commas (\fB,\fR), which is the default list separator.
This option only affects columns for two-letter tags like NM; standard
fields like FLAG or QNAME will always be separated by commas.
.TP
.BI "--output-empty" \ CHAR
Specify a different 'no value' character for tag list entries corresponding to
reads that don't have a tag requested with the \fB--output-extra\fR option. The
default is \fB*\fR.
This option only applies to rows that have at least one read in the pileup,
and only to columns for two-letter tags.
Columns for empty rows will always be printed as \fB*\fR.
.TP
.B -M, --output-mods
Adds base modification markup into the sequence column. This uses the
\fBMm\fR and \fBMl\fR auxiliary tags (or their uppercase
equivalents). Any base in the sequence output may be followed by a
series of \fIstrand\fR \fIcode\fR \fIquality\fR strings enclosed
within square brackets where strand is "+" or "-", code is a single
character (such as "m" or "h") or a ChEBI numeric in parentheses, and
quality is an optional numeric quality value. For example a "C" base
with possible 5mC and 5hmC base modification may be reported as
"C[+m179+h40]".
Quality values are from 0 to 255 inclusive, representing a linear
scale of probability 0.0 to 1.0 in 1/256ths increments. If quality
values are absent (no \fBMl\fR tag) these are omitted, giving an
example string of "C[+m+h]".
Note the base modifications may be identified on the reverse strand,
either due to the native ability for this detection by the sequencing
instrument or by the sequence subsequently being reverse
complemented. This can lead to modification codes, such as "m"
meaning 5mC, being shown for their complementary bases, such as
"G[-m50]".
When \fB--output-mods\fR is selected base modifications can appear on
any base in the sequence output, including during insertions. This
may make parsing the string more complex, so also see the
\fB--no-output-ins-mods\fR and \fB--no-output-ins\fR options to
simplify this process.
.TP
.B --no-output-ins
Do not output the inserted bases in the sequence column. Usually this
is reported as "+\fIlength\fR \fIsequence\fR", but with this option
it becomes simply "+\fIlength\fR". For example an insertion of AGT
in a pileup column changes from "CCC+3AGTGCC" to "CCC+3GCC".
Specifying this option twice also removes the "+\fIlength\fR"
portion, changing the example above to "CCCGCC".
The purpose of this change is to simplify parsing using basic regular
expressions, which traditionally cannot perform counting operations.
It is particularly beneficial when used in conjunction with
\fB--output-mods\fR as the syntax of the inserted sequence is adjusted
to also report possible base modifications, but see also
\fB--no-output-ins-mods\fR as an alternative.
.TP
.B --no-output-ins-mods
Outputs the inserted bases in the sequence, but excluding any base
modifications. This only affects output when \fB--output-mods\fR is
also used.
.TP
.B --no-output-del
Do not output deleted reference bases in the sequence column.
Normally this is reported as "+\fIlength\fR \fIsequence\fR", but with this option
it becomes simply "+\fIlength\fR". For example an deletion of 3
unknown bases (due to no reference being specified) would normally be
seen in a column as e.g. "CCC-3NNNGCC", but will be reported as
"CCC-3GCC" with this option.
Specifying this option twice also removes the "-\fIlength\fR"
portion, changing the example above to "CCCGCC".
The purpose of this change is to simplify parsing using basic regular
expressions, which traditionally cannot perform counting operations.
See also \fB--no-output-ins\fR.
.TP
.B --no-output-ends
Removes the \(lq^\(rq (with mapping quality) and \(lq$\(rq markup from
the sequence column.
.TP
.B --reverse-del
Mark the deletions on the reverse strand with the character
.BR # ,
instead of the usual
.BR * .
.TP
.B -a
Output all positions, including those with zero depth.
.TP
.B -a -a, -aa
Output absolutely all positions, including unused reference sequences.
Note that when used in conjunction with a BED file the -a option may
sometimes operate as if -aa was specified if the reference sequence
has coverage outside of the region specified in the BED file.
.PP
.B BAQ (Base Alignment Quality)
.PP
BAQ is the Phred-scaled probability of a read base being misaligned.
It greatly helps to reduce false SNPs caused by misalignments.
BAQ is calculated using the probabilistic realignment method described
in the paper \*(lqImproving SNP discovery by base alignment quality\*(rq,
Heng Li, Bioinformatics, Volume 27, Issue 8
<https://doi.org/10.1093/bioinformatics/btr076>
BAQ is applied to modify quality values before the \fB-Q\fR filtering
happens and before the choice of which sequence to retain when
removing overlaps.
BAQ is turned on when a reference file is supplied using the
.B -f
option. To disable it, use the
.B -B
option.
It is possible to store precalculated BAQ values in a SAM BQ:Z tag.
Samtools mpileup will use the precalculated values if it finds them.
The
.B -E
option can be used to make it ignore the contents of the BQ:Z tag and
force it to recalculate the BAQ scores by making a new alignment.
.PP
.SH EXAMPLES
Using range:
With implicit index files in1.bam.<ext> and in2.sam.gz.<ext>,
.EX 2
samtools mpileup in1.bam in2.sam.gz -r chr10:100000-200000
.EE
With explicit index files,
.EX 2
samtools mpileup in1.bam in2.sam.gz idx/in1.csi idx/in2.csi -X -r chr10:100000-200000
.EE
With fofn being a file of input file names, and implicit index files present with inputs,
.EX 2
samtools mpileup -b fofn -r chr10:100000-200000
.EE
Using flags:
To get reads with flags READ2 or REVERSE and not having any of SECONDARY,QCFAIL,DUP,
.EX 2
samtools mpileup --rf READ2,REVERSE in.sam
.EE
or
.EX 2
samtools mpileup --rf 144 in.sam
.EE
To get reads with flag SECONDARY,
.EX 2
samtools mpileup --rf SECONDARY --ff QCFAIL,DUP in.sam
.EE
Using all possible alignmentes:
To show all possible alignments, either of below two equivalent commands may be used,
.EX 2
samtools mpileup --count-orphans --no-BAQ --max-depth 0 --fasta-ref ref_file.fa \\
--min-BQ 0 --excl-flags 0 --disable-overlap-removal in.sam
samtools mpileup -A -B -d 0 -f ref_file.fa -Q 0 --ff 0 -x in.sam
.EE
.SH AUTHOR
.PP
Written by Heng Li from the Sanger Institute.
.SH SEE ALSO
.IR samtools (1),
.IR samtools-depth (1),
.IR samtools-sort (1),
.IR bcftools (1)
.PP
Samtools website: <http://www.htslib.org/>
|