File: foverlaps.Rd

package info (click to toggle)
r-cran-data.table 1.14.8%2Bdfsg-1
  • links: PTS, VCS
  • area: main
  • in suites: bookworm
  • size: 15,936 kB
  • sloc: ansic: 15,680; sh: 100; makefile: 6
file content (161 lines) | stat: -rw-r--r-- 8,602 bytes parent folder | download | duplicates (2)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
\name{foverlaps}
\alias{foverlaps}
\title{Fast overlap joins}
\description{
A \emph{fast} binary-search based \emph{overlap join} of two \code{data.table}s.
This is very much inspired by \code{findOverlaps} function from the Bioconductor
package \code{IRanges} (see link below under \code{See Also}).

Usually, \code{x} is a very large data.table with small interval ranges, and
\code{y} is much smaller \emph{keyed} \code{data.table} with relatively larger
interval spans. For a usage in \code{genomics}, see the examples section.

NOTE: This is still under development, meaning it is stable, but some features
are yet to be implemented. Also, some arguments and/or the function name itself
could be changed.
}

\usage{
foverlaps(x, y, by.x = if (!is.null(key(x))) key(x) else key(y),
    by.y = key(y), maxgap = 0L, minoverlap = 1L,
    type = c("any", "within", "start", "end", "equal"),
    mult = c("all", "first", "last"),
    nomatch = getOption("datatable.nomatch", NA),
    which = FALSE, verbose = getOption("datatable.verbose"))
}
\arguments{
\item{x, y}{ \code{data.table}s. \code{y} needs to be keyed, but not necessarily
\code{x}. See examples. }
\item{by.x, by.y}{A vector of column names (or numbers) to compute the overlap
joins. The last two columns in both \code{by.x} and \code{by.y} should each
correspond to the \code{start} and \code{end} interval columns in \code{x} and
\code{y} respectively. And the \code{start} column should always be <= \code{end}
column. If \code{x} is keyed,  \code{by.x} is equal to \code{key(x)}, else
\code{key(y)}. \code{by.y} defaults to \code{key(y)}. }
\item{maxgap}{It should be a non-negative integer value, >= 0. Default is 0 (no
gap). For intervals \code{[a,b]} and \code{[c,d]}, where \code{a<=b} and
\code{c<=d}, when \code{c > b} or \code{d < a}, the two intervals don't overlap.
If the gap between these two intervals is \code{<= maxgap}, these two intervals
are considered as overlapping. Note: This is not yet implemented.}
\item{minoverlap}{ It should be a positive integer value, > 0. Default is 1. For
intervals \code{[a,b]} and \code{[c,d]}, where \code{a<=b} and \code{c<=d}, when
\code{c<=b} and \code{d>=a}, the two intervals overlap. If the length of overlap
between these two intervals is \code{>= minoverlap}, then these two intervals are
considered to be overlapping. Note: This is not yet implemented.}
\item{type}{ Default value is \code{any}. Allowed values are \code{any},
\code{within}, \code{start}, \code{end} and \code{equal}.

The types shown here are identical in functionality to the function
\code{findOverlaps} in the bioconductor package \code{IRanges}. Let \code{[a,b]}
and \code{[c,d]} be intervals in \code{x} and \code{y} with \code{a<=b} and
\code{c<=d}. For \code{type="start"}, the intervals overlap iff \code{a == c}.
For \code{type="end"}, the intervals overlap iff \code{b == d}. For
\code{type="within"}, the intervals overlap iff \code{a>=c and b<=d}. For
\code{type="equal"}, the intervals overlap iff \code{a==c and b==d}. For
\code{type="any"}, as long as \code{c<=b and d>=a}, they overlap. In addition
to these requirements, they also have to satisfy the \code{minoverlap} argument
as explained above.

NB: \code{maxgap} argument, when > 0, is to be interpreted according to the type
of the overlap. This will be updated once \code{maxgap} is implemented.}

\item{mult}{ When multiple rows in \code{y} match to the row in \code{x},
\code{mult=.} controls which values are returned - \code{"all"} (default),
\code{"first"} or \code{"last"}.}
\item{nomatch}{ When a row (with interval say, \code{[a,b]}) in \code{x} has no
match in \code{y}, \code{nomatch=NA} (default) means \code{NA} is returned for
\code{y}'s non-\code{by.y} columns for that row of \code{x}. \code{nomatch=NULL}
(or \code{0} for backward compatibility) means no rows will be returned for that
row of \code{x}. Use \code{options(datatable.nomatch=NULL)} to change the default
value (used when \code{nomatch} is not supplied).}
\item{which}{ When \code{TRUE}, if \code{mult="all"} returns a two column
\code{data.table} with the first column corresponding to \code{x}'s row number
and the second corresponding to \code{y}'s. when \code{nomatch=NA}, no matches
return \code{NA} for \code{y}, and if \code{nomatch=NULL}, those rows where no
match is found will be skipped; if \code{mult="first" or "last"}, a vector of
length equal to the number of rows in \code{x} is returned, with no-match entries
filled with \code{NA} or \code{0} corresponding to the \code{nomatch} argument.
Default is \code{FALSE}, which returns a join with the rows in \code{y}.}
\item{verbose}{ \code{TRUE} turns on status and information messages to the
console. Turn this on by default using \code{options(datatable.verbose=TRUE)}.
The quantity and types of verbosity may be expanded in future.}
}
\details{
Very briefly, \code{foverlaps()} collapses the two-column interval in \code{y}
to one-column of \emph{unique} values to generate a \code{lookup} table, and
then performs the join depending on the type of \code{overlap}, using the
already available \code{binary search} feature of \code{data.table}. The time
(and space) required to generate the \code{lookup} is therefore proportional
to the number of unique values present in the interval columns of \code{y}
when combined together.

Overlap joins takes advantage of the fact that \code{y} is sorted to speed-up
finding overlaps. Therefore \code{y} has to be keyed (see \code{?setkey})
prior to running \code{foverlaps()}. A key on \code{x} is not necessary,
although it \emph{might} speed things further. The columns in \code{by.x}
argument should correspond to the columns specified in \code{by.y}. The last
two columns should be the \emph{interval} columns in both \code{by.x} and
\code{by.y}. The first interval column in \code{by.x} should always be <= the
second interval column in \code{by.x}, and likewise for \code{by.y}. The
\code{\link{storage.mode}} of the interval columns must be either \code{double}
or \code{integer}. It therefore works with \code{bit64::integer64} type as well.

The \code{lookup} generation step could be quite time consuming if the number
of unique values in \code{y} are too large (ex: in the order of tens of millions).
There might be improvements possible by constructing lookup using RLE, which is
a pending feature request. However most scenarios will not have too many unique
values for \code{y}.
}
\value{
A new \code{data.table} by joining over the interval columns (along with other
additional identifier columns) specified in \code{by.x} and \code{by.y}.

NB: When \code{which=TRUE}: \code{a)} \code{mult="first" or "last"} returns a
\code{vector} of matching row numbers in \code{y}, and \code{b)} when
\code{mult="all"} returns a data.table with two columns with the first
containing row numbers of \code{x} and the second column with corresponding
row numbers of \code{y}.

\code{nomatch=NA or 0} also influences whether non-matching rows are returned
or not, as explained above.
}

\examples{
require(data.table)
## simple example:
x = data.table(start=c(5,31,22,16), end=c(8,50,25,18), val2 = 7:10)
y = data.table(start=c(10, 20, 30), end=c(15, 35, 45), val1 = 1:3)
setkey(y, start, end)
foverlaps(x, y, type="any", which=TRUE) ## return overlap indices
foverlaps(x, y, type="any") ## return overlap join
foverlaps(x, y, type="any", mult="first") ## returns only first match
foverlaps(x, y, type="within") ## matches iff 'x' is within 'y'

## with extra identifiers (ex: in genomics)
x = data.table(chr=c("Chr1", "Chr1", "Chr2", "Chr2", "Chr2"),
               start=c(5,10, 1, 25, 50), end=c(11,20,4,52,60))
y = data.table(chr=c("Chr1", "Chr1", "Chr2"), start=c(1, 15,1),
               end=c(4, 18, 55), geneid=letters[1:3])
setkey(y, chr, start, end)
foverlaps(x, y, type="any", which=TRUE)
foverlaps(x, y, type="any")
foverlaps(x, y, type="any", nomatch=NULL)
foverlaps(x, y, type="within", which=TRUE)
foverlaps(x, y, type="within")
foverlaps(x, y, type="start")

## x and y have different column names - specify by.x
x = data.table(seq=c("Chr1", "Chr1", "Chr2", "Chr2", "Chr2"),
               start=c(5,10, 1, 25, 50), end=c(11,20,4,52,60))
y = data.table(chr=c("Chr1", "Chr1", "Chr2"), start=c(1, 15,1),
               end=c(4, 18, 55), geneid=letters[1:3])
setkey(y, chr, start, end)
foverlaps(x, y, by.x=c("seq", "start", "end"),
            type="any", which=TRUE)
}
\seealso{
\code{\link{data.table}},
\url{https://www.bioconductor.org/packages/release/bioc/html/IRanges.html},
\code{\link{setNumericRounding}}
}
\keyword{ data }