1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377
|
---
title: "Implementing A DelayedArray Backend"
author:
- name: Hervé Pagès
affiliation: Fred Hutchinson Cancer Research Center, Seattle, WA
output:
BiocStyle::html_document
package: DelayedArray
vignette: |
%\VignetteIndexEntry{Implementing A DelayedArray Backend}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
# Introduction
The DelayedArray framework currently supports a small number of on-disk
backends: HDF5 (via the _HDF5Array_ package), GDS (via the _GDSArray_ package),
and VCF (via the _VCFArray_ package). This can be extended to support other
on-disk backends. In theory, it should be possible to implement a DelayedArray
backend for any file format that has the capability to store array data with
fast random access.
Let's assume that the ADS format (Array Data Store) is such format (this
is a made-up format for the purpose of this vignette only). Implementing a
DelayedArray backend for ADS files should typically be done in a dedicated
package (say _ADSArray_) that will depend on the _DelayedArray_ package.
The _ADSArray_ package will need to implement:
- A low-level class for representing a reference to an array located in
an ADS file. We'll refer to this class as "the seed class" and will
name it ADSArraySeed.
- Two high-level classes that derive from DelayedArray: ADSArray and
ADSMatrix. Only the latter is needed if the ADS format only supports
2-dimensional arrays.
- A "realization sink" class if you also want to support realization of
DelayedArray objects as ADSArray objects. This is not documented yet.
The rest of this document covers the above topics in greater details.
Some familiarity with writing R packages is assumed.
Don't hesitate to look at the source of the
[_HDF5Array_](https://github.com/Bioconductor/HDF5Array) package
for a real example of DelayedArray on-disk backend implementation.
# Implementing the seed class
## Class definition
A "seed object" should store at least the path or URL to the file. If the
file format allows storing more than one array per file, then the seed object
should also store any additional information needed to locate a particular
array in the file.
The definition of the seed class will look something like this:
setClass("ADSArraySeed",
contains="Array",
slots=c(
filepath="character",
...
... additional slots needed
... to locate the array in the file
...
)
)
The `filepath` slot should be a single string that contains the absolute
path to the ADS file so the object doesn't break when the user changes
the working directory (e.g. with `setwd()`).
Note that storing an open connection to the file should be avoided because
connections don't work properly in the context of a fork (e.g. when
processing the seed object in parallel) and tend to break when serializing
the object.
## Constructor
It is highly recommended to provide a "seed constructor" e.g.:
ADSArraySeed <- function(filepath, other args)
{
sanity checks
...
filepath <- file_path_as_absolute(filepath)
...
new("ADSArraySeed", filepath=filepath, other args)
}
Note that `file_path_as_absolute()` is defined in the _tools_ package
so it needs to be imported by adding the following to the NAMESPACE file
of the _ADSArray_ package:
importFrom(tools, file_path_as_absolute)
and adding _tools_ to the `Imports` field of the DESCRIPTION file of the
package.
## The seed contract
Seed objects are expected to comply with the "seed contract" i.e. to
support `dim()`, `dimnames()`, and `extract_array()`. This is normally
done by implementing methods for these generics, but, as we will
see below, explicitly defining `dim()` or `dimnames()` methods is
rarely needed.
### dim() and dimnames()
For example, the `dim()` method for ADSArraySeed objects could look
like this:
### An implementation that extracts the dimensions from the file
### each time the method is called.
setMethod("dim", "ADSArraySeed",
function(x)
{
- open the connection to the file
- on.exit(close the connection)
- extract the dimensions and return them in an integer vector
}
)
Note that the above `dim()` method consults the ADS file each time it's
called. However this can be avoided by adding a `dim` (and `dimnames`)
slot (of type `integer` for `dim`, of type `list` for `dimnames`) to
the ADSArraySeed class, and to populate it at construction time, so this
information is retrieved from the file only once. With this approach
`dim()` and `dimnames()` work out-of-the-box on ADSArraySeed objects
i.e. there is no need to define `dim()` and `dimnames()` methods for
these objects. This is because the `dim()` and `dimnames()` primitive
functions in base R return the content of these slots if present.
If the ADS format does not allow storage of the dimnames, then there
is no need to implement a `dimnames()` method or to add a `dimnames()`
slot to the ADSArraySeed class. Calling `dimnames(x)` then will simply
return `NULL` for any ADSArraySeed object `x`.
If the ADS format allows storage of the dimnames, make sure that `dimnames()`
always returns them in the _standard form_, that is:
- The dimnames must be returned as a `NULL` (if the dataset has no dimnames)
or as an ordinary list with one list element per dimension in the dataset.
- Each element in the returned list is either `NULL` or a character vector
of length the extend of the dataset along the corresponding dimension.
It is particularly important to make sure that the vectors in the list
returned by `dimnames()` are character vectors. Other types like factors
or integer vectors are not allowed and _will_ break downstream code.
### extract\_array()
`extract_array()` is a generic function defined in the _DelayedArray_
package:
library(DelayedArray)
?extract_array
It takes 2 arguments: `x` and `index`. `x` is the seed object
to extract array values from. `index` must be an unnamed list of
subscripts as positive integer vectors, one vector per seed dimension.
Empty and missing subscripts (represented by `integer(0)` and `NULL` list
elements, respectively) are allowed. The subscripts in `index` can contain
duplicated indices. They cannot contain NAs or non-positive values.
The `extract_array()` method must return an _ordinary array_ of the
appropriate type (i.e. `integer`, `double`, etc...). For example, if
`x` is an ADSArraySeed object representing an M x N on-disk matrix
of complex numbers, `extract_array(x, list(NULL, 2L))` must
return its 2nd column as an M x 1 _ordinary matrix_ of type `complex`.
Note that the `extract_array()` method needs to support empty and missing
subscripts e.g. `extract_array(x, list(NULL, integer(0)))` must return
an M x 0 matrix of type `complex` and
`extract_array(x, list(integer(0), integer(0)))` a 0 x 0 matrix of
type `complex`. This last edge case is important because the `type()`
and `show()` methods for DelayedArray objects rely on it to work.
More precisely, once the `extract_array()` method supports an `index`
with empty integer vectors, the following should work:
seed <- ADSArraySeed(...)
M <- DelayedArray(seed)
type(M)
show(M)
Finally note that subscripts are allowed to contain duplicated indices
so things like `extract_array(seed, list(c(1:3, 3:1), 2L))` need to be
supported.
## What to import?
Make sure the NAMESPACE file of the _ADSArray_ package contains at least
the following imports:
import(methods)
importFrom(tools, file_path_as_absolute)
import(BiocGenerics)
import(S4Vectors)
import(IRanges)
import(DelayedArray)
Unless you have a good reason for it, don't try to selectively import
things from the _methods_, _BiocGenerics_, _S4Vectors_, _IRanges_, and
_DelayedArray_ packages. This will only complicate maintenance of the
_ADSArray_ package in the long run and has no real benefits (contrary
to popular belief).
Add _methods_, _BiocGenerics_, and _DelayedArray_ to the `Depends` field
of the DESCRIPTION file of the package, and _tools_, _S4Vectors_, and
_IRanges_ to its `Imports` field.
## Testing
Make sure to export the ADSArraySeed class, its constructor, and the
`dim`, `dimnames`, and `extract_array` methods.
At this point, you should be able to wrap an ADSArraySeed object `seed`
in a DelayedArray object with `DelayedArray(seed)`, and this should return
a fully functional DelayedArray object.
# Implementing high-level classes ADSArray and ADSMatrix
These classes are not strictly needed but add a nice level of convenience.
## ADSArray class definition
An ADSArray or ADSMatrix object is a DelayedArray derivative that doesn't
carry delayed operations yet. As soon as the user will start operating on it,
it will be degraded to a DelayedArray _instance_.
The ADSArray and ADSMatrix classes should extend the DelayedArray and
DelayedMatrix classes, respectively, without adding any slot to them.
So just:
setClass("ADSArray",
contains="DelayedArray",
representation(seed="ADSArraySeed")
)
We'll define the ADSMatrix class later.
## The ADSArray() constructor
Add a `DelayedArray()` method for ADSArraySeed objects that does:
setMethod("DelayedArray", "ADSArraySeed",
function(seed) new_DelayedArray(seed, Class="ADSArray")
)
Now you should be able to construct an ADSArray object with:
DelayedArray(ADSArraySeed(...))
The `ADSArray` constructor should just do that:
ADSArray <- function(filepath, other args)
DelayedArray(ADSArraySeed(filepath, other args))
However, it's also nice to be able to pass an ADSArraySeed object to
this constructor (with `ADSArray(seed)`). This can easily be supported
with something like:
### Works directly on an ADSArraySeed object, in which case it must be
### called with a single argument.
ADSArray <- function(filepath, other args)
{
if (is(filepath, "ADSArraySeed")) {
if (!(missing(other arg1) && missing(other arg2) && ...))
stop(wmsg("ADSArray() must be called with a single argument ",
"when passed an ADSArraySeed object"))
seed <- filepath
} else {
seed <- ADSArraySeed(filepath, other args)
}
DelayedArray(seed)
}
## ADSMatrix class definition
setClass("ADSMatrix", contains=c("ADSArray", "DelayedMatrix"))
## Going from ADSArray to ADSMatrix
Define a `matrixClass()` method for ADSArray objects as follow:
setMethod("matrixClass", "ADSArray", function(x) "ADSMatrix")
`matrixClass()` is a generic function defined in the _DelayedArray_ package.
When passed an ADSArraySeed object, low-level constructor `new_DelayedArray`
(see below) will generally return an ADSArray _instance_, except when the
ADSArraySeed object is 2-dimensional, in which case it needs to return an
ADSMatrix _instance_. It will obtain the name of the class of the object to
return (`"ADSMatrix"` in this case) by calling `matrixClass`.
Also coercion from ADSArray to ADSMatrix needs to be supported with:
setAs("ADSArray", "ADSMatrix", function(from) new("ADSMatrix", from))
This coercion will make sure that the end user gets the following error
when trying to coerce an ADSArray object that is not 2-dimensional to
ADSMatrix:
as(x, "ADSMatrix")
# Error in validObject(.Object) : invalid class "ADSMatrix" object:
# 'x' must have exactly 2 dimensions
Without the above coercion method, `as(x, "ADSMatrix")` would silently
return an invalid ADSMatrix object.
## Going from ADSMatrix to ADSArray
The user should not be able to degrade an ADSMatrix object to an ADSArray
object so `as(x, "ADSArray", strict=TRUE)` should fail or be a no-op
when `x` is an ADSMatrix object. The easiest (and recommended) way to
achieve this is to define the following coercion method:
setAs("ADSMatrix", "ADSArray", function(from) from) # no-op
## Implementing optimized backend-specific methods
It is possible, and enouraged, to overwrite current DelayedArray
block-processed operations (e.g. `max`, `colSums`, `%*%`, etc...) with
optimized backend-specific methods. For example, let's imagine that ADS
files have the capability to store some precomputed stats about the dataset.
Then one could define a fast `max()` method for ADSArray objects with
something like:
setMethod("max", "ADSArraySeed",
function(x, na.rm=FALSE)
{
get the precomputed max from the file
}
)
setMethod("max", "ADSArray",
function(x, na.rm=FALSE) max(x@seed, na.rm=na.rm)
)
Note that delayed operations like setting dimnames on an ADSArray object
(with `dimnames(A) <- new_dimnames`) or transposing an ADSMatrix object
(with `M2 <- t(M)`) will degrade the object to a DelayedArray or DelayedMatrix
_instance_, causing `max(A)` and `max(M2)` to use the far less efficient
block-processed `max()` method defined for DelayedArray objects. There is
clearly room for improvement here and work will be done in the near future
to make the `max()` method (and other block-processed methods) for
DelayedArray objects try to take advantage of the backend-specific methods
whenever it can.
However in the meantime, backend authors should resist the temptation to
overwrite the `dimnames<-()` and `t()` methods for DelayedArray objects with
backend-specific methods that modify the seed. This would be a violation
of the "never touch the seed" principle which is central to the DelayedArray
framework. More precisely, no matter what delayed operations are performed
on a DelayedArray object, the seeds of the result should always be identical
to the original seeds (e.g. `seed(t(M))` should always be identical to
`seed(M)`).
## What to export?
Make sure to export the ADSArray and ADSMatrix classes, the `ADSArray`
constructor, the `coerce` methods, and any backend-specific method.
# Testing
Install the _ADSArray_ package and load it in a fresh R session:
library(ADSArray)
... coming soon ...
|