File: DelayedMatrixStatsOverview.Rmd

package info (click to toggle)
r-bioc-delayedmatrixstats 1.28.1%2Bds-1
  • links: PTS, VCS
  • area: main
  • in suites: sid, trixie
  • size: 620 kB
  • sloc: makefile: 2
file content (341 lines) | stat: -rw-r--r-- 13,815 bytes parent folder | download | duplicates (2)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
---
title: "Overview of DelayedMatrixStats"
author: "Peter Hickey"
date: "Modified: 06 Feb 2021 Compiled: `r format(Sys.Date(), '%d %b %Y')`"
output: BiocStyle::html_document
vignette: >
  %\VignetteIndexEntry{Overview of DelayedMatrixStats}
  %\VignetteEncoding{UTF-8}
  %\VignetteEngine{knitr::rmarkdown}
editor_options: 
  chunk_output_type: console
---

```{r, include = FALSE, setup}
knitr::opts_chunk$set(echo = TRUE, comment = "#>", collapse = TRUE,
                      message = FALSE)
library(BiocStyle)
```

# Overview

`r Biocpkg("DelayedMatrixStats")` ports the `r CRANpkg("matrixStats")` API 
to work with *DelayedMatrix* objects from the `r Biocpkg("DelayedArray")` 
package. It provides high-performing functions operating on rows and columns of 
*DelayedMatrix* objects, including all subclasses such as *RleArray* (from the 
`r Biocpkg("DelayedArray")` package) and *HDF5Array* (from the 
`r Biocpkg("HDF5Array")`) as well as supporting all types of *seeds*, such as 
*matrix* (from the *base* package) and *Matrix* (from the `r CRANpkg("Matrix")` 
package).

# How can DelayedMatrixStats help me?

The `r Biocpkg("DelayedArray")` package allows developers to store array-like 
data using in-memory or on-disk representations (e.g., in HDF5 files) and 
provides a common and familiar array-like interface for interacting with these 
data.

The `r Biocpkg("DelayedMatrixStats")` package is designed to make life easier 
for Bioconductor developers wanting to use `r Biocpkg("DelayedArray")` by 
providing a rich set of column-wise and row-wise summary functions. 

We briefly demonstrate and explain these two features using a simple example.
We'll simulate some (unrealistic) RNA-seq read counts data from 10,000 genes 
and 20 samples and store it on disk as a *HDF5Array*:

```{r data_sim, message = FALSE}
library(DelayedArray)

x <- do.call(cbind, lapply(1:20, function(j) {
  rpois(n = 10000, lambda = sample(20:40, 10000, replace = TRUE))
}))
colnames(x) <- paste0("S", 1:20)
x <- realize(x, "HDF5Array")
x
```

Suppose you wish to compute the standard deviation of the read counts for each 
gene. 

You might think to use `apply()` like in the following:

```{r apply}
system.time(row_sds <- apply(x, 1, sd))
head(row_sds)
```

This works, but takes quite a while.

Or perhaps you already know that the `r CRANpkg("matrixStats")` package 
provides a `rowSds()` function:

```{r matrixStats, error = TRUE}
matrixStats::rowSds(x)
```

Unfortunately (and perhaps unsurprisingly) this doesn't work. 
`r CRANpkg("matrixStats")` is designed for use on in-memory *matrix* objects. 
Well, why don't we just first realize our data in-memory and then use 
`r CRANpkg("matrixStats")`

```{r realization}
system.time(row_sds <- matrixStats::rowSds(as.matrix(x)))
head(row_sds)
```

This works and is many times faster than the `apply()`-based approach! However, 
it rather defeats the purpose of using a *HDF5Array* for storing the 
data since we have to bring all the data into memory at once to compute the 
result. 

Instead, we can use `DelayedMatrixStats::rowSds()`, which has the speed 
benefits of `matrixStats::rowSds()`[^speed] but without having to load the 
entire data into memory at once[^block_size]:

[^speed]: In fact, it currently uses `matrixStats::rowSds()` under the hood.
[^block_size]: In this case, it loads blocks of data row-by-row. The amount of 
data loaded into memory at any one time is controlled by the 
*default block size* global setting; see `?DelayedArray::getAutoBlockSize`
for details. Notably, if the data are small enough (and the default block size
is large enough) then all the data is loaded as a single block, but this
approach  generalizes and still works when the data are too large to be
loaded into memory in one block.

```{r DelayedMatrixStats}
library(DelayedMatrixStats)

system.time(row_sds <- rowSds(x))
head(row_sds)
```

Finally, by using `r Biocpkg("DelayedMatrixStats")` we can use the same code, 
(`colMedians(x)`) regardless of whether the input is an ordinary *matrix* or a 
*DelayedMatrix*. This is useful for packages wishing to support both types of 
objects, e.g., packages wanting to retain backward compatibility or during a 
transition period from *matrix*-based to *DelayeMatrix*-based objects.

# Supported methods

The initial release of `r Biocpkg("DelayedMatrixStats")` supports the complete 
column-wise and row-wise API `r CRANpkg("matrixStats")` API[^api]. Please 
see the `r CRANpkg("matrixStats")` vignette 
([available online](https://cran.r-project.org/package=matrixStats/vignettes/matrixStats-methods.html)) 
for a summary these methods. The following table documents the API coverage and 
availability of ['seed-aware' methods](#seed_aware_methods) in the current 
version of `r Biocpkg("DelayedMatrixStats")`, where:

- ✔ = Implemented in `r Biocpkg("DelayedMatrixStats")`
- ☑️ = Implemented in [**DelayedArray**](http://bioconductor.org/packages/DelayedArray/) or [**sparseMatrixStats**](http://bioconductor.org/packages/sparseMatrixStats/)
- ❌ = Not yet implemented

[^api]: Some of the API is covered via inheritance to functionality in `r Biocpkg("DelayedArray")`


```{r API, echo = FALSE}
matrixStats <- sort(
  c("colsum", "rowsum", grep("^(col|row)", 
                             getNamespaceExports("matrixStats"), 
                             value = TRUE)))
sparseMatrixStats <- getNamespaceExports("sparseMatrixStats")
DelayedMatrixStats <- getNamespaceExports("DelayedMatrixStats")
DelayedArray <- getNamespaceExports("DelayedArray")

api_df <- data.frame(
  Method = paste0("`", matrixStats, "()`"),
  `Block processing` = ifelse(
    matrixStats %in% DelayedMatrixStats,
    "✔",
    ifelse(matrixStats %in% c(DelayedArray, sparseMatrixStats), "☑️", "❌")),
  `_base::matrix_ optimized` = 
    ifelse(sapply(matrixStats, existsMethod, signature = "matrix_OR_array_OR_table_OR_numeric"), 
           "✔", 
           "❌"),
  `_Matrix::dgCMatrix_ optimized` = 
    ifelse(sapply(matrixStats, existsMethod, signature = "xgCMatrix") | sapply(matrixStats, existsMethod, signature = "dgCMatrix"), 
           "✔", 
           "❌"),
  `_Matrix::lgCMatrix_ optimized` = 
    ifelse(sapply(matrixStats, existsMethod, signature = "xgCMatrix") | sapply(matrixStats, existsMethod, signature = "lgCMatrix"), 
           "✔", 
           "❌"),
  `_DelayedArray::RleArray_ (_SolidRleArraySeed_) optimized` = 
    ifelse(sapply(matrixStats, existsMethod, signature = "SolidRleArraySeed"),
           "✔", 
           "❌"),
  `_DelayedArray::RleArray_  (_ChunkedRleArraySeed_) optimized` = 
    ifelse(sapply(matrixStats, existsMethod, signature = "ChunkedRleArraySeed"),
           "✔", 
           "❌"),
  `_HDF5Array::HDF5Matrix_ optimized` = 
    ifelse(sapply(matrixStats, existsMethod, signature = "HDF5ArraySeed"),
           "✔", 
           "❌"),
  `_base::data.frame_ optimized` = 
    ifelse(sapply(matrixStats, existsMethod, signature = "data.frame"),
           "✔", 
           "❌"),
  `_S4Vectors::DataFrame_ optimized` =
    ifelse(sapply(matrixStats, existsMethod, signature = "DataFrame"),
           "✔", 
           "❌"), 
  check.names = FALSE)
knitr::kable(api_df, row.names = FALSE)
```

# 'Seed-aware' methods {#seed_aware_methods}

As well as offering a familiar API, `r Biocpkg("DelayedMatrixStats")` provides 
'seed-aware' methods that are optimized for specific types of *DelayedMatrix* 
objects. 

To illustrate this idea, we will compare two ways of computing the column sums 
of a *DelayedMatrix* object:

1. The 'block-processing' strategy. This was developed in the `r Biocpkg("DelayedArray")` package and is available for all methods in the `r Biocpkg("DelayedMatrixStats")` through the `force_block_processing` argument
2. The 'seed-aware' strategy. This is implemented in the `r Biocpkg("DelayedMatrixStats")` and is optimized for both speed and memory but only for *DelayedMatrix* objects with certain types of *seed*.

We will demonstrate this by computing the column sums matrices with 20,000 rows 
and 600 columns where the data have different structure and are stored in 
*DelayedMatrix* objects with different types of seed:

- Dense data with values in $(0, 1)$ using an ordinary _matrix_ as the seed
- Sparse data with values in $[0, 1)$, where $60\%$ are zeros, using a _dgCMatrix_, a sparse matrix representation from the `r CRANpkg("Matrix")` package, as the seed
- Dense data in ${0, 1, \ldots, 100}$, where there are multiple runs of identical values, using a _RleArraySeed_ from the `r Biocpkg("DelayedArray")` package as the seed

We use the `r CRANpkg("microbenchmark")` package to measure running time and 
the `r CRANpkg("profmem")` package to measure the total memory allocations of 
each method. 

In each case, the 'seed-aware' method is many times faster and allocates 
substantially lower total memory.

```{r benchmarking, message = FALSE, echo = TRUE, error = TRUE}
library(DelayedMatrixStats)
library(sparseMatrixStats)
library(microbenchmark)
library(profmem)

set.seed(666)

# -----------------------------------------------------------------------------
# Dense with values in (0, 1)
# Fast, memory-efficient column sums of DelayedMatrix with ordinary matrix seed
#

# Generate some data
dense_matrix <- matrix(runif(20000 * 600), 
                       nrow = 20000,
                       ncol = 600)

# Benchmark
dm_matrix <- DelayedArray(dense_matrix)
class(seed(dm_matrix))
dm_matrix
microbenchmark(
  block_processing = colSums2(dm_matrix, force_block_processing = TRUE),
  seed_aware = colSums2(dm_matrix),
  times = 10)
total(profmem(colSums2(dm_matrix, force_block_processing = TRUE)))
total(profmem(colSums2(dm_matrix)))

# -----------------------------------------------------------------------------
# Sparse (60% zero) with values in (0, 1)
# Fast, memory-efficient column sums of DelayedMatrix with ordinary matrix seed
#

# Generate some data
sparse_matrix <- dense_matrix
zero_idx <- sample(length(sparse_matrix), 0.6 * length(sparse_matrix))
sparse_matrix[zero_idx] <- 0

# Benchmark
dm_dgCMatrix <- DelayedArray(Matrix(sparse_matrix, sparse = TRUE))
class(seed(dm_dgCMatrix))
dm_dgCMatrix
microbenchmark(
  block_processing = colSums2(dm_dgCMatrix, force_block_processing = TRUE),
  seed_aware = colSums2(dm_dgCMatrix),
  times = 10)
total(profmem(colSums2(dm_dgCMatrix, force_block_processing = TRUE)))
total(profmem(colSums2(dm_dgCMatrix)))

# -----------------------------------------------------------------------------
# Dense with values in {0, 100} featuring runs of identical values
# Fast, memory-efficient column sums of DelayedMatrix with Rle-based seed
#

# Generate some data
runs <- rep(sample(100, 500000, replace = TRUE), rpois(500000, 100))
runs <- runs[seq_len(20000 * 600)]
runs_matrix <- matrix(runs, 
                      nrow = 20000,
                      ncol = 600)

# Benchmark
dm_rle <- RleArray(Rle(runs),
                   dim = c(20000, 600))
class(seed(dm_rle))
dm_rle
microbenchmark(
  block_processing = colSums2(dm_rle, force_block_processing = TRUE),
  seed_aware = colSums2(dm_rle),
  times = 10)
total(profmem(colSums2(dm_rle, force_block_processing = TRUE)))
total(profmem(colSums2(dm_rle)))
```

The development of 'seed-aware' methods is ongoing work (see the [Roadmap](#roadmap)), and for now only a few methods and seed-types have a 
'seed-aware' method.

An extensive set of benchmarks is under development at [http://peterhickey.org/BenchmarkingDelayedMatrixStats/](http://peterhickey.org/BenchmarkingDelayedMatrixStats/). 

# Delayed operations

A key feature of a _DelayedArray_ is the ability to register 'delayed 
operations'. For example, let's compute `sin(dm_matrix)`:

```{r sin}
system.time(sin_dm_matrix <- sin(dm_matrix))
```

This instantaneous because the operation is not actually performed, rather 
it is registered and only performed when the object is _realized_. All methods 
in `r Biocpkg("DelayedMatrixStats")` will correctly realise these delayed 
operations before computing the final result. For example, let's compute  
`colSums2(sin_dm_matrix)` and compare check we get the correct answer:

```{r colSums2_sin}
all.equal(colSums2(sin_dm_matrix), colSums(sin(as.matrix(dm_matrix))))
```

# Roadmap {#roadmap}

The initial version of `r Biocpkg("DelayedMatrixStats")` provides complete 
coverage of the `r CRANpkg("matrixStats")` column-wise and row-wise API[^api], 
allowing package developers to use these functions with _DelayedMatrix_ objects 
as well as with ordinary _matrix_ objects. This should simplify package 
development and assist authors to support to their software for large datasets 
stored in disk-backed data structures such as _HDF5Array_. Such large datasets 
are increasingly common with the rise of single-cell genomics.

Future releases of `r Biocpkg("DelayedMatrixStats")` will improve the 
performance of these methods, specifically by developing additional 'seed-aware' 
methods. The plan is to prioritise commonly used methods (e.g.,  
`colMeans2()`/`rowMeans2()`, `colSums2()`/`rowSums2()`, etc.) and the 
development of 'seed-aware' methods for the _HDF5Matrix_ class. To do so, we 
will leverage the `r Biocpkg("beachmat")` package. Proof-of-concept code 
has shown that this can greatly increase the performance when analysing such 
disk-backed data.

Importantly, all package developers using methods from 
`r Biocpkg("DelayedMatrixStats")` will immediately gain from performance 
improvements to these low-level routines. By using 
`r Biocpkg("DelayedMatrixStats")`, package developers will be able to focus on 
higher level programming tasks and address important scientific questions and 
technological challenges in high-throughput biology.

# Session info

```{r}
sessionInfo()
```