1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161
|
\name{foverlaps}
\alias{foverlaps}
\title{Fast overlap joins}
\description{
A \emph{fast} binary-search based \emph{overlap join} of two \code{data.table}s.
This is very much inspired by \code{findOverlaps} function from the Bioconductor
package \code{IRanges} (see link below under \code{See Also}).
Usually, \code{x} is a very large data.table with small interval ranges, and
\code{y} is much smaller \emph{keyed} \code{data.table} with relatively larger
interval spans. For a usage in \code{genomics}, see the examples section.
NOTE: This is still under development, meaning it is stable, but some features
are yet to be implemented. Also, some arguments and/or the function name itself
could be changed.
}
\usage{
foverlaps(x, y, by.x = if (!is.null(key(x))) key(x) else key(y),
by.y = key(y), maxgap = 0L, minoverlap = 1L,
type = c("any", "within", "start", "end", "equal"),
mult = c("all", "first", "last"),
nomatch = getOption("datatable.nomatch", NA),
which = FALSE, verbose = getOption("datatable.verbose"))
}
\arguments{
\item{x, y}{ \code{data.table}s. \code{y} needs to be keyed, but not necessarily
\code{x}. See examples. }
\item{by.x, by.y}{A vector of column names (or numbers) to compute the overlap
joins. The last two columns in both \code{by.x} and \code{by.y} should each
correspond to the \code{start} and \code{end} interval columns in \code{x} and
\code{y} respectively. And the \code{start} column should always be <= \code{end}
column. If \code{x} is keyed, \code{by.x} is equal to \code{key(x)}, else
\code{key(y)}. \code{by.y} defaults to \code{key(y)}. }
\item{maxgap}{It should be a non-negative integer value, >= 0. Default is 0 (no
gap). For intervals \code{[a,b]} and \code{[c,d]}, where \code{a<=b} and
\code{c<=d}, when \code{c > b} or \code{d < a}, the two intervals don't overlap.
If the gap between these two intervals is \code{<= maxgap}, these two intervals
are considered as overlapping. Note: This is not yet implemented.}
\item{minoverlap}{ It should be a positive integer value, > 0. Default is 1. For
intervals \code{[a,b]} and \code{[c,d]}, where \code{a<=b} and \code{c<=d}, when
\code{c<=b} and \code{d>=a}, the two intervals overlap. If the length of overlap
between these two intervals is \code{>= minoverlap}, then these two intervals are
considered to be overlapping. Note: This is not yet implemented.}
\item{type}{ Default value is \code{any}. Allowed values are \code{any},
\code{within}, \code{start}, \code{end} and \code{equal}.
The types shown here are identical in functionality to the function
\code{findOverlaps} in the bioconductor package \code{IRanges}. Let \code{[a,b]}
and \code{[c,d]} be intervals in \code{x} and \code{y} with \code{a<=b} and
\code{c<=d}. For \code{type="start"}, the intervals overlap iff \code{a == c}.
For \code{type="end"}, the intervals overlap iff \code{b == d}. For
\code{type="within"}, the intervals overlap iff \code{a>=c and b<=d}. For
\code{type="equal"}, the intervals overlap iff \code{a==c and b==d}. For
\code{type="any"}, as long as \code{c<=b and d>=a}, they overlap. In addition
to these requirements, they also have to satisfy the \code{minoverlap} argument
as explained above.
NB: \code{maxgap} argument, when > 0, is to be interpreted according to the type
of the overlap. This will be updated once \code{maxgap} is implemented.}
\item{mult}{ When multiple rows in \code{y} match to the row in \code{x},
\code{mult=.} controls which values are returned - \code{"all"} (default),
\code{"first"} or \code{"last"}.}
\item{nomatch}{ When a row (with interval say, \code{[a,b]}) in \code{x} has no
match in \code{y}, \code{nomatch=NA} (default) means \code{NA} is returned for
\code{y}'s non-\code{by.y} columns for that row of \code{x}. \code{nomatch=NULL}
(or \code{0} for backward compatibility) means no rows will be returned for that
row of \code{x}. Use \code{options(datatable.nomatch=NULL)} to change the default
value (used when \code{nomatch} is not supplied).}
\item{which}{ When \code{TRUE}, if \code{mult="all"} returns a two column
\code{data.table} with the first column corresponding to \code{x}'s row number
and the second corresponding to \code{y}'s. when \code{nomatch=NA}, no matches
return \code{NA} for \code{y}, and if \code{nomatch=NULL}, those rows where no
match is found will be skipped; if \code{mult="first" or "last"}, a vector of
length equal to the number of rows in \code{x} is returned, with no-match entries
filled with \code{NA} or \code{0} corresponding to the \code{nomatch} argument.
Default is \code{FALSE}, which returns a join with the rows in \code{y}.}
\item{verbose}{ \code{TRUE} turns on status and information messages to the
console. Turn this on by default using \code{options(datatable.verbose=TRUE)}.
The quantity and types of verbosity may be expanded in future.}
}
\details{
Very briefly, \code{foverlaps()} collapses the two-column interval in \code{y}
to one-column of \emph{unique} values to generate a \code{lookup} table, and
then performs the join depending on the type of \code{overlap}, using the
already available \code{binary search} feature of \code{data.table}. The time
(and space) required to generate the \code{lookup} is therefore proportional
to the number of unique values present in the interval columns of \code{y}
when combined together.
Overlap joins takes advantage of the fact that \code{y} is sorted to speed-up
finding overlaps. Therefore \code{y} has to be keyed (see \code{?setkey})
prior to running \code{foverlaps()}. A key on \code{x} is not necessary,
although it \emph{might} speed things further. The columns in \code{by.x}
argument should correspond to the columns specified in \code{by.y}. The last
two columns should be the \emph{interval} columns in both \code{by.x} and
\code{by.y}. The first interval column in \code{by.x} should always be <= the
second interval column in \code{by.x}, and likewise for \code{by.y}. The
\code{\link{storage.mode}} of the interval columns must be either \code{double}
or \code{integer}. It therefore works with \code{bit64::integer64} type as well.
The \code{lookup} generation step could be quite time consuming if the number
of unique values in \code{y} are too large (ex: in the order of tens of millions).
There might be improvements possible by constructing lookup using RLE, which is
a pending feature request. However most scenarios will not have too many unique
values for \code{y}.
}
\value{
A new \code{data.table} by joining over the interval columns (along with other
additional identifier columns) specified in \code{by.x} and \code{by.y}.
NB: When \code{which=TRUE}: \code{a)} \code{mult="first" or "last"} returns a
\code{vector} of matching row numbers in \code{y}, and \code{b)} when
\code{mult="all"} returns a data.table with two columns with the first
containing row numbers of \code{x} and the second column with corresponding
row numbers of \code{y}.
\code{nomatch=NA or 0} also influences whether non-matching rows are returned
or not, as explained above.
}
\examples{
require(data.table)
## simple example:
x = data.table(start=c(5,31,22,16), end=c(8,50,25,18), val2 = 7:10)
y = data.table(start=c(10, 20, 30), end=c(15, 35, 45), val1 = 1:3)
setkey(y, start, end)
foverlaps(x, y, type="any", which=TRUE) ## return overlap indices
foverlaps(x, y, type="any") ## return overlap join
foverlaps(x, y, type="any", mult="first") ## returns only first match
foverlaps(x, y, type="within") ## matches iff 'x' is within 'y'
## with extra identifiers (ex: in genomics)
x = data.table(chr=c("Chr1", "Chr1", "Chr2", "Chr2", "Chr2"),
start=c(5,10, 1, 25, 50), end=c(11,20,4,52,60))
y = data.table(chr=c("Chr1", "Chr1", "Chr2"), start=c(1, 15,1),
end=c(4, 18, 55), geneid=letters[1:3])
setkey(y, chr, start, end)
foverlaps(x, y, type="any", which=TRUE)
foverlaps(x, y, type="any")
foverlaps(x, y, type="any", nomatch=NULL)
foverlaps(x, y, type="within", which=TRUE)
foverlaps(x, y, type="within")
foverlaps(x, y, type="start")
## x and y have different column names - specify by.x
x = data.table(seq=c("Chr1", "Chr1", "Chr2", "Chr2", "Chr2"),
start=c(5,10, 1, 25, 50), end=c(11,20,4,52,60))
y = data.table(chr=c("Chr1", "Chr1", "Chr2"), start=c(1, 15,1),
end=c(4, 18, 55), geneid=letters[1:3])
setkey(y, chr, start, end)
foverlaps(x, y, by.x=c("seq", "start", "end"),
type="any", which=TRUE)
}
\seealso{
\code{\link{data.table}},
\url{https://www.bioconductor.org/packages/release/bioc/html/IRanges.html},
\code{\link{setNumericRounding}}
}
\keyword{ data }
|