1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268
|
\name{findMateAlignment}
\alias{findMateAlignment}
\alias{findMateAlignment2}
\alias{makeGAlignmentPairs}
\alias{getDumpedAlignments}
\alias{countDumpedAlignments}
\alias{flushDumpedAlignments}
% Old stuff:
\alias{makeGappedAlignmentPairs}
\title{Pairing the elements of a GAlignments object}
\description{
Utilities for pairing the elements of a \link{GAlignments} object.
NOTE: Until BioC 2.13, \code{findMateAlignment} was the power horse used
by \code{\link{readGAlignmentPairsFromBam}} for pairing the records loaded
from a BAM file containing aligned paired-end reads.
Starting with BioC 2.14, \code{\link{readGAlignmentPairsFromBam}} relies
on \code{\link{readGAlignmentsListFromBam}} which itself relies on
\code{\link[Rsamtools]{scanBam}(BamFile(asMates=TRUE), ...)} for the
pairing.
}
\usage{
findMateAlignment(x)
makeGAlignmentPairs(x, use.names=FALSE, use.mcols=FALSE)
## Related low-level utilities:
getDumpedAlignments()
countDumpedAlignments()
flushDumpedAlignments()
}
\arguments{
\item{x}{
A named \link{GAlignments} object with metadata columns \code{flag},
\code{mrnm}, and \code{mpos}. Typically obtained by loading aligned
paired-end reads from a BAM file with:
\preformatted{
param <- ScanBamParam(what=c("flag", "mrnm", "mpos"))
x <- readGAlignmentsFromBam(..., use.names=TRUE, param=param)
}
}
\item{use.names}{
Whether the names on the input object should be propagated to the
returned object or not.
}
\item{use.mcols}{
Names of the metadata columns to propagate to the returned
\link{GAlignmentPairs} object.
}
}
\details{
\subsection{Pairing algorithm used by findMateAlignment}{
\code{findMateAlignment} is the power horse used by \code{makeGAlignmentPairs}
for pairing the records loaded from a BAM file containing aligned paired-end
reads.
It implements the following pairing algorithm:
\itemize{
\item First, only records with flag bit 0x1 (multiple segments) set to 1,
flag bit 0x4 (segment unmapped) set to 0, and flag bit 0x8 (next
segment in the template unmapped) set to 0, are candidates for
pairing (see the SAM Spec for a description of flag bits and fields).
\code{findMateAlignment} will ignore any other record. That is,
records that correspond to single-end reads, or records that
correspond to paired-end reads where one or both ends are unmapped,
are discarded.
\item Then the algorithm looks at the following fields and flag bits:
\itemize{
\item (A) QNAME
\item (B) RNAME, RNEXT
\item (C) POS, PNEXT
\item (D) Flag bits Ox10 (segment aligned to minus strand)
and 0x20 (next segment aligned to minus strand)
\item (E) Flag bits 0x40 (first segment in template) and 0x80 (last
segment in template)
\item (F) Flag bit 0x2 (proper pair)
\item (G) Flag bit 0x100 (secondary alignment)
}
2 records rec1 and rec2 are considered mates iff all the following
conditions are satisfied:
\itemize{
\item (A) QNAME(rec1) == QNAME(rec2)
\item (B) RNEXT(rec1) == RNAME(rec2) and RNEXT(rec2) == RNAME(rec1)
\item (C) PNEXT(rec1) == POS(rec2) and PNEXT(rec2) == POS(rec1)
\item (D) Flag bit 0x20 of rec1 == Flag bit 0x10 of rec2 and
Flag bit 0x20 of rec2 == Flag bit 0x10 of rec1
\item (E) rec1 corresponds to the first segment in the template and
rec2 corresponds to the last segment in the template, OR,
rec2 corresponds to the first segment in the template and
rec1 corresponds to the last segment in the template
\item (F) rec1 and rec2 have same flag bit 0x2
\item (G) rec1 and rec2 have same flag bit 0x100
}
}
}
\subsection{Timing and memory requirement of the pairing algorithm}{
The estimated timings and memory requirements on a modern Linux system are
(those numbers may vary depending on your hardware and OS):
\preformatted{
nb of alignments | time | required memory
-----------------+--------------+----------------
8 millions | 28 sec | 1.4 GB
16 millions | 58 sec | 2.8 GB
32 millions | 2 min | 5.6 GB
64 millions | 4 min 30 sec | 11.2 GB
}
This is for a \link{GAlignments} object coming from a file with an
"average nb of records per unique QNAME" of 2.04. A value of 2 (which means
the file contains only primary reads) is optimal for the pairing algorithm.
A greater value, say > 3, will significantly degrade its performance.
An easy way to avoid this degradation is to load only primary alignments
by setting the \code{isNotPrimaryRead} flag to \code{FALSE} in ScanBamParam().
See examples in \code{?\link{readGAlignmentPairsFromBam}} for how to do this.
}
\subsection{Ambiguous pairing}{
The above algorithm will find almost all pairs unambiguously, even when
the same pair of reads maps to several places in the genome. Note
that, when a given pair maps to a single place in the genome, looking
at (A) is enough to pair the 2 corresponding records. The additional
conditions (B), (C), (D), (E), (F), and (G), are only here to help in
the situation where more than 2 records share the same QNAME. And that
works most of the times. Unfortunately there are still situations where
this is not enough to solve the pairing problem unambiguously.
For example, here are 4 records (loaded in a GAlignments object)
that cannot be paired with the above algorithm:
Showing the 4 records as a GAlignments object of length 4:
\preformatted{
GAlignments with 4 alignments and 2 metadata columns:
seqnames strand cigar qwidth start end
<Rle> <Rle> <character> <integer> <integer> <integer>
SRR031714.2658602 chr2R + 21M384N16M 37 6983850 6984270
SRR031714.2658602 chr2R + 21M384N16M 37 6983850 6984270
SRR031714.2658602 chr2R - 13M372N24M 37 6983858 6984266
SRR031714.2658602 chr2R - 13M378N24M 37 6983858 6984272
width njunc | mrnm mpos
<integer> <integer> | <factor> <integer>
SRR031714.2658602 421 1 | chr2R 6983858
SRR031714.2658602 421 1 | chr2R 6983858
SRR031714.2658602 409 1 | chr2R 6983850
SRR031714.2658602 415 1 | chr2R 6983850
}
Note that the BAM fields show up in the following columns:
\itemize{
\item QNAME: the names of the GAlignments object (unnamed col)
\item RNAME: the seqnames col
\item POS: the start col
\item RNEXT: the mrnm col
\item PNEXT: the mpos col
}
As you can see, the aligner has aligned the same pair to the same
location twice! The only difference between the 2 aligned pairs is in
the CIGAR i.e. one end of the pair is aligned twice to the same location
with exactly the same CIGAR while the other end of the pair is aligned
twice to the same location but with slightly different CIGARs.
Now showing the corresponding flag bits:
\preformatted{
isPaired isProperPair isUnmappedQuery hasUnmappedMate isMinusStrand
[1,] 1 1 0 0 0
[2,] 1 1 0 0 0
[3,] 1 1 0 0 1
[4,] 1 1 0 0 1
isMateMinusStrand isFirstMateRead isSecondMateRead isNotPrimaryRead
[1,] 1 0 1 0
[2,] 1 0 1 0
[3,] 0 1 0 0
[4,] 0 1 0 0
isNotPassingQualityControls isDuplicate
[1,] 0 0
[2,] 0 0
[3,] 0 0
[4,] 0 0
}
As you can see, rec(1) and rec(2) are second mates, rec(3) and rec(4)
are both first mates. But looking at (A), (B), (C), (D), (E), (F), and (G),
the pairs could be rec(1) <-> rec(3) and rec(2) <-> rec(4), or they could
be rec(1) <-> rec(4) and rec(2) <-> rec(3). There is no way to
disambiguate!
So \code{findMateAlignment} is just ignoring (with a warning) those alignments
with ambiguous pairing, and dumping them in a place from which they can be
retrieved later (i.e. after \code{findMateAlignment} has returned) for
further examination (see "Dumped alignments" subsection below for the details).
In other words, alignments that cannot be paired unambiguously are not paired
at all. Concretely, this means that \code{\link{readGAlignmentPairs}} is
guaranteed to return a \link{GAlignmentPairs} object
where every pair was formed in an non-ambiguous way. Note that, in practice,
this approach doesn't seem to leave aside a lot of records because ambiguous
pairing events seem pretty rare.
}
\subsection{Dumped alignments}{
Alignments with ambiguous pairing are dumped in a place ("the dump
environment") from which they can be retrieved with
\code{getDumpedAlignments()} after \code{findMateAlignment} has returned.
Two additional utilities are provided for manipulation of the dumped
alignments: \code{countDumpedAlignments} for counting them (a fast equivalent
to \code{length(getDumpedAlignments())}), and \code{flushDumpedAlignments} to
flush "the dump environment". Note that "the dump environment" is
automatically flushed at the beginning of a call to \code{findMateAlignment}.
}
}
\value{
For \code{findMateAlignment}: An integer vector of the same length as
\code{x}, containing only positive or NA values, where the i-th element
is interpreted as follow:
\itemize{
\item An NA value means that no mate or more than 1 mate was found for
\code{x[i]}.
\item A non-NA value j gives the index in \code{x} of \code{x[i]}'s mate.
}
For \code{makeGAlignmentPairs}: A \link{GAlignmentPairs} object where the
pairs are formed internally by calling \code{findMateAlignment} on \code{x}.
For \code{getDumpedAlignments}: \code{NULL} or a \link{GAlignments} object
containing the dumped alignments. See "Dumped alignments" subsection in
the "Details" section above for the details.
For \code{countDumpedAlignments}: The number of dumped alignments.
Nothing for \code{flushDumpedAlignments}.
}
\author{H. Pages}
\seealso{
\itemize{
\item \link{GAlignments} and \link{GAlignmentPairs} objects.
\item \code{\link{readGAlignmentsFromBam}} and
\code{\link{readGAlignmentPairsFromBam}}.
}
}
\examples{
bamfile <- system.file("extdata", "ex1.bam", package="Rsamtools",
mustWork=TRUE)
param <- ScanBamParam(what=c("flag", "mrnm", "mpos"))
x <- readGAlignmentsFromBam(bamfile, use.names=TRUE, param=param)
mate <- findMateAlignment(x)
head(mate)
table(is.na(mate))
galp0 <- makeGAlignmentPairs(x)
galp <- makeGAlignmentPairs(x, use.name=TRUE, use.mcols="flag")
galp
colnames(mcols(galp))
colnames(mcols(first(galp)))
colnames(mcols(last(galp)))
}
\keyword{manip}
|