File: 02-Implementing_a_backend.Rmd

package info (click to toggle)
r-bioc-delayedarray 0.8.0%2Bdfsg-2
  • links: PTS, VCS
  • area: main
  • in suites: buster
  • size: 980 kB
  • sloc: ansic: 93; makefile: 2; sh: 1
file content (320 lines) | stat: -rw-r--r-- 11,503 bytes parent folder | download | duplicates (2)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
---
title: "Implementing A DelayedArray Backend"
author:
- name: Hervé Pagès
  affiliation: Fred Hutchinson Cancer Research Center, Seattle, WA
output:
  BiocStyle::html_document
package: DelayedArray
vignette: |
  %\VignetteIndexEntry{Implementing A DelayedArray Backend}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---


# Introduction

The DelayedArray framework currently supports the HDF5 on-disk backend
(via the _HDF5Array_ package) but can be extended to support other on-disk
backends, that is, to support other file formats. In theory, it should be
possible to implement a DelayedArray backend for any file format that has
the capability to store array data with fast random access.
Let's assume that the ADS format (Array Data Store) is such format (this
is a made-up format for the purpose of this vignette only). Implementing a
DelayedArray backend for ADS files should typically be done in a dedicated
package (say _ADSArray_) that will depend on the _DelayedArray_ package.

The _ADSArray_ package will need to implement:

- A low-level class for representing a reference to an array located in
  an ADS file. We'll refer to this class as "the seed class" and will
  name it ADSArraySeed.

- Two high-level classes that derive from DelayedArray: ADSArray and
  ADSMatrix. Only the latter is needed if the ADS format supports only
  2-dimensional arrays.

- A "realization sink" class if you also want to support realization of
  DelayedArray objects as ADSArray objects. This is not documented yet.

The rest of this document covers the above topics in greater details.
Some familiarity with writing R packages is assumed.
Don't hesitate to look at the source of the
[_HDF5Array_](https://github.com/Bioconductor/HDF5Array) package
for a real example of DelayedArray on-disk backend implementation.


# Implementing the seed class

## Class definition

A "seed object" should store at least the path or URL to the file. If the
file format allows storing more than one array per file, then the seed object
should also store any additional information needed to locate a particular
array in the file.

The definition of the seed class will look something like this:

    setClass("ADSArraySeed",
        contains="Array",
        slots=c(
            filepath="character",
            ...
            ... additional slots needed
            ... to locate the array in the file
            ...
        )
    )

The `filepath` slot should be a single string that contains the absolute
path to the ADS file so the object doesn't break when the user changes
the working directory (e.g. with `setwd()`).

Note that storing an open connection to the file should be avoided because
connections don't work properly in the context of a fork (e.g. when
processing the seed object in parallel) and tend to break when serializing
the object.

## Constructor

It is highly recommended to provide a "seed constructor" e.g.:

    ADSArraySeed <- function(filepath, other args)
    {
        sanity checks
        ...
        filepath <- file_path_as_absolute(filepath)
        ...
        new("ADSArraySeed", filepath=filepath, other args)
    }

Note that `file_path_as_absolute()` is defined in the _tools_ package
so it needs to be imported by adding the following to the NAMESPACE file
of the _ADSArray_ package:

    importFrom(tools, file_path_as_absolute)

and adding _tools_ to the `Imports` field of the DESCRIPTION file of the
package.

## The seed contract

Seed objects are expected to comply with the "seed contract" i.e. to
support `dim()`, `dimnames()`, and `extract_array()`. This is normally
done by implementing methods for these generics, but, as we will
see below, a method is rarely needed for `dim()` or `dimnames()`.

For example, the `dim` method for ADSArraySeed objects could look like
this:

    ### An implementation that extracts the dimensions from the file
    ### each time the method is called.
    setMethod("dim", "ADSArraySeed",
        function(x)
        {
            - open the connection to the file
            - on.exit(close the connection)
            - extract the dimensions and return them in an integer vector
        }
    )

Note that the above `dim` method consults the ADS file each time it's
called. However this can be avoided by adding a `dim` (and `dimnames`)
slot (of type `integer` for `dim`, of type `list` for `dimnames`) to
the ADSArraySeed class, and to populate it at construction time, so this
information is retrieved from the file only once. With this approach,
the `dim` and  `dimnames` methods are actually not needed, because, by
default, the `dim` and `dimnames` primitive functions return the content
of these slots if present.

If the ADS format does not allow storage of the dimnames, then there
is no need to implement a `dimnames` method or to add a `dimnames` slot
to the ADSArraySeed class.

`extract_array` is a generic function defined in the _DelayedArray_ package:

    library(DelayedArray)
    ?extract_array

It takes 2 arguments: `x` and `index`. `x` is the seed object
to extract array values from. `index` must be an unnamed list of
subscripts as positive integer vectors, one vector per seed dimension.
Empty and missing subscripts (represented by `integer(0)` and `NULL` list
elements, respectively) are allowed. The subscripts in `index` can contain
duplicated indices. They cannot contain NAs or non-positive values.

The `extract_array` method must return an *ordinary* array of the
appropriate type (i.e. `integer`, `double`, etc...). For example, if
`x` is an ADSArraySeed object representing an M x N on-disk matrix
of complex numbers, `extract_array(x, list(NULL, 2L))` must
return its 2nd column as an *ordinary* M x 1 matrix of type `complex`.

Note that the `extract_array` method needs to support empty and missing
subscripts e.g. `extract_array(x, list(NULL, integer(0)))` must return
an M x 0 matrix of type `complex` and
`extract_array(x, list(integer(0), integer(0)))` a 0 x 0 matrix of
type `complex`. This last edge case is important because the `type`
and `show` methods for DelayedArray objects rely on it to work.
More precisely, once the `extract_array` method supports an `index`
with empty integer vectors, the following should work:

    seed <- ADSArraySeed(...)
    M <- DelayedArray(seed)
    type(M)
    show(M)

Finally note that subscripts are allowed to contain duplicated indices
so things like `extract_array(seed, list(c(1:3, 3:1), 2L))` need to be
supported.

## What to import?

Make sure the NAMESPACE file of the _ADSArray_ package contains at least
the following imports:

    import(methods)
    importFrom(tools, file_path_as_absolute)
    
    import(BiocGenerics)
    import(S4Vectors)
    import(IRanges)
    import(DelayedArray)

Unless you have a good reason for it, don't try to selectively import
things from the _methods_, _BiocGenerics_, _S4Vectors_, _IRanges_, and
_DelayedArray_ packages. This will only complicate maintenance of the
_ADSArray_ package in the long run and has no real benefits (contrary
to popular belief).

Add _methods_, _BiocGenerics_, and _DelayedArray_ to the `Depends` field
of the DESCRIPTION file of the package, and _tools_, _S4Vectors_, and
_IRanges_ to its `Imports` field.

## Testing

Make sure to export the ADSArraySeed class, its constructor, and the
`dim`, `dimnames`, and `extract_array` methods.

At this point, you should be able to wrap an ADSArraySeed object `seed`
in a DelayedArray object with `DelayedArray(seed)`, and this should return
a fully functional DelayedArray object.


# Implementing high-level classes ADSArray and ADSMatrix

These classes are not strictly needed but add a nice level of convenience.

## Class definitions

An ADSArray or ADSMatrix object is a DelayedArray derivative that doesn't
carry delayed operations yet. As soon as the user will start operating on it,
it will be degraded to a DelayedArray *instance*.

The ADSArray and ADSMatrix classes should extend the DelayedArray and
DelayedMatrix classes, respectively, without adding any slot to them.
So just:

    setClass("ADSArray", contains="DelayedArray")

    setClass("ADSMatrix", contains=c("DelayedMatrix", "ADSArray"))

## Going from ADSArray to ADSMatrix

Define a `matrixClass` method for ADSArray objects as follow:

    setMethod("matrixClass", "ADSArray", function(x) "ADSMatrix")

`matrixClass` is a generic function defined in the _DelayedArray_ package.
When passed an ADSArraySeed object, low-level constructor `new_DelayedArray`
(see below) will generally return an ADSArray *instance*, except when the
ADSArraySeed object is 2-dimensional, in which case it needs to return an
ADSMatrix *instance*. It will obtain the name of the class of the object to
return (`"ADSMatrix"` in this case) by calling `matrixClass`.

Also coercion from ADSArray to ADSMatrix needs to be supported with:

    setAs("ADSArray", "ADSMatrix", function(from) new("ADSMatrix", from))

This coercion will make sure that the end-user gets the following error
when trying to coerce an ADSArray object that is not 2-dimensional to
ADSMatrix:

    as(x, "ADSMatrix")
    # Error in validObject(.Object) : invalid class "ADSMatrix" object:
    #     'x' must have exactly 2 dimensions

Without the above coercion method, `as(x, "ADSMatrix")` would silently
return an invalid ADSMatrix object.

## Going from ADSMatrix to ADSArray

The user should not be able to degrade an ADSMatrix object to an ADSArray
object so `as(x, "ADSArray", strict=TRUE)` should fail or be a no-op
when `x` is an ADSMatrix object. The easiest (and recommended) way to
achieve this is to define the following coercion method:

    setAs("ADSMatrix", "ADSArray", function(from) from)  # no-op

## Constructor

Add a `DelayedArray` method for ADSArraySeed objects that does:

    setMethod("DelayedArray", "ADSArraySeed",
        function(seed) new_DelayedArray(seed, Class="ADSArray")
    )

Now you should be able to construct an ADSArray object with:

    DelayedArray(ADSArraySeed(...))

The `ADSArray` constructor should just do that:

    ADSArray <- function(filepath, other args)
        DelayedArray(ADSArraySeed(filepath, other args))

However, it's also nice to be able to pass an ADSArraySeed object to
this constructor (with `ADSArray(seed)`). This can easily be supported
with something like:

    ### Works directly on an ADSArraySeed object, in which case it must be
    ### called with a single argument.
    ADSArray <- function(filepath, other args)
    {
        if (is(filepath, "ADSArraySeed")) {
            if (!(missing(other arg1) && missing(other arg2) && ...))
                stop(wmsg("ADSArray() must be called with a single argument ",
                          "when passed an ADSArraySeed object"))
            seed <- filepath
        } else {
            seed <- ADSArraySeed(filepath, other args)
        }
        DelayedArray(seed)
    }

## Validity method

It's also highly recommended to define a validity method for ADSArray objects:

    .validate_ADSArray <- function(x)
    {
        if (!is(x@seed, "ADSArraySeed"))
            return(wmsg("'x@seed' must be an ADSArraySeed object"))
        TRUE
    }

    setValidity2("ADSArray", .validate_ADSArray)

## What to export?

Make sure to export the ADSArray and ADSMatrix classes, the `ADSArray`
constructor, and the `coerce` methods.


# Testing

Install the _ADSArray_ package and load it in a fresh R session:

    library(ADSArray)
    ... coming soon ...