1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341
|
---
title: "Overview of DelayedMatrixStats"
author: "Peter Hickey"
date: "Modified: 06 Feb 2021 Compiled: `r format(Sys.Date(), '%d %b %Y')`"
output: BiocStyle::html_document
vignette: >
%\VignetteIndexEntry{Overview of DelayedMatrixStats}
%\VignetteEncoding{UTF-8}
%\VignetteEngine{knitr::rmarkdown}
editor_options:
chunk_output_type: console
---
```{r, include = FALSE, setup}
knitr::opts_chunk$set(echo = TRUE, comment = "#>", collapse = TRUE,
message = FALSE)
library(BiocStyle)
```
# Overview
`r Biocpkg("DelayedMatrixStats")` ports the `r CRANpkg("matrixStats")` API
to work with *DelayedMatrix* objects from the `r Biocpkg("DelayedArray")`
package. It provides high-performing functions operating on rows and columns of
*DelayedMatrix* objects, including all subclasses such as *RleArray* (from the
`r Biocpkg("DelayedArray")` package) and *HDF5Array* (from the
`r Biocpkg("HDF5Array")`) as well as supporting all types of *seeds*, such as
*matrix* (from the *base* package) and *Matrix* (from the `r CRANpkg("Matrix")`
package).
# How can DelayedMatrixStats help me?
The `r Biocpkg("DelayedArray")` package allows developers to store array-like
data using in-memory or on-disk representations (e.g., in HDF5 files) and
provides a common and familiar array-like interface for interacting with these
data.
The `r Biocpkg("DelayedMatrixStats")` package is designed to make life easier
for Bioconductor developers wanting to use `r Biocpkg("DelayedArray")` by
providing a rich set of column-wise and row-wise summary functions.
We briefly demonstrate and explain these two features using a simple example.
We'll simulate some (unrealistic) RNA-seq read counts data from 10,000 genes
and 20 samples and store it on disk as a *HDF5Array*:
```{r data_sim, message = FALSE}
library(DelayedArray)
x <- do.call(cbind, lapply(1:20, function(j) {
rpois(n = 10000, lambda = sample(20:40, 10000, replace = TRUE))
}))
colnames(x) <- paste0("S", 1:20)
x <- realize(x, "HDF5Array")
x
```
Suppose you wish to compute the standard deviation of the read counts for each
gene.
You might think to use `apply()` like in the following:
```{r apply}
system.time(row_sds <- apply(x, 1, sd))
head(row_sds)
```
This works, but takes quite a while.
Or perhaps you already know that the `r CRANpkg("matrixStats")` package
provides a `rowSds()` function:
```{r matrixStats, error = TRUE}
matrixStats::rowSds(x)
```
Unfortunately (and perhaps unsurprisingly) this doesn't work.
`r CRANpkg("matrixStats")` is designed for use on in-memory *matrix* objects.
Well, why don't we just first realize our data in-memory and then use
`r CRANpkg("matrixStats")`
```{r realization}
system.time(row_sds <- matrixStats::rowSds(as.matrix(x)))
head(row_sds)
```
This works and is many times faster than the `apply()`-based approach! However,
it rather defeats the purpose of using a *HDF5Array* for storing the
data since we have to bring all the data into memory at once to compute the
result.
Instead, we can use `DelayedMatrixStats::rowSds()`, which has the speed
benefits of `matrixStats::rowSds()`[^speed] but without having to load the
entire data into memory at once[^block_size]:
[^speed]: In fact, it currently uses `matrixStats::rowSds()` under the hood.
[^block_size]: In this case, it loads blocks of data row-by-row. The amount of
data loaded into memory at any one time is controlled by the
*default block size* global setting; see `?DelayedArray::getAutoBlockSize`
for details. Notably, if the data are small enough (and the default block size
is large enough) then all the data is loaded as a single block, but this
approach generalizes and still works when the data are too large to be
loaded into memory in one block.
```{r DelayedMatrixStats}
library(DelayedMatrixStats)
system.time(row_sds <- rowSds(x))
head(row_sds)
```
Finally, by using `r Biocpkg("DelayedMatrixStats")` we can use the same code,
(`colMedians(x)`) regardless of whether the input is an ordinary *matrix* or a
*DelayedMatrix*. This is useful for packages wishing to support both types of
objects, e.g., packages wanting to retain backward compatibility or during a
transition period from *matrix*-based to *DelayeMatrix*-based objects.
# Supported methods
The initial release of `r Biocpkg("DelayedMatrixStats")` supports the complete
column-wise and row-wise API `r CRANpkg("matrixStats")` API[^api]. Please
see the `r CRANpkg("matrixStats")` vignette
([available online](https://cran.r-project.org/package=matrixStats/vignettes/matrixStats-methods.html))
for a summary these methods. The following table documents the API coverage and
availability of ['seed-aware' methods](#seed_aware_methods) in the current
version of `r Biocpkg("DelayedMatrixStats")`, where:
- ✔ = Implemented in `r Biocpkg("DelayedMatrixStats")`
- ☑️ = Implemented in [**DelayedArray**](http://bioconductor.org/packages/DelayedArray/) or [**sparseMatrixStats**](http://bioconductor.org/packages/sparseMatrixStats/)
- ❌ = Not yet implemented
[^api]: Some of the API is covered via inheritance to functionality in `r Biocpkg("DelayedArray")`
```{r API, echo = FALSE}
matrixStats <- sort(
c("colsum", "rowsum", grep("^(col|row)",
getNamespaceExports("matrixStats"),
value = TRUE)))
sparseMatrixStats <- getNamespaceExports("sparseMatrixStats")
DelayedMatrixStats <- getNamespaceExports("DelayedMatrixStats")
DelayedArray <- getNamespaceExports("DelayedArray")
api_df <- data.frame(
Method = paste0("`", matrixStats, "()`"),
`Block processing` = ifelse(
matrixStats %in% DelayedMatrixStats,
"✔",
ifelse(matrixStats %in% c(DelayedArray, sparseMatrixStats), "☑️", "❌")),
`_base::matrix_ optimized` =
ifelse(sapply(matrixStats, existsMethod, signature = "matrix_OR_array_OR_table_OR_numeric"),
"✔",
"❌"),
`_Matrix::dgCMatrix_ optimized` =
ifelse(sapply(matrixStats, existsMethod, signature = "xgCMatrix") | sapply(matrixStats, existsMethod, signature = "dgCMatrix"),
"✔",
"❌"),
`_Matrix::lgCMatrix_ optimized` =
ifelse(sapply(matrixStats, existsMethod, signature = "xgCMatrix") | sapply(matrixStats, existsMethod, signature = "lgCMatrix"),
"✔",
"❌"),
`_DelayedArray::RleArray_ (_SolidRleArraySeed_) optimized` =
ifelse(sapply(matrixStats, existsMethod, signature = "SolidRleArraySeed"),
"✔",
"❌"),
`_DelayedArray::RleArray_ (_ChunkedRleArraySeed_) optimized` =
ifelse(sapply(matrixStats, existsMethod, signature = "ChunkedRleArraySeed"),
"✔",
"❌"),
`_HDF5Array::HDF5Matrix_ optimized` =
ifelse(sapply(matrixStats, existsMethod, signature = "HDF5ArraySeed"),
"✔",
"❌"),
`_base::data.frame_ optimized` =
ifelse(sapply(matrixStats, existsMethod, signature = "data.frame"),
"✔",
"❌"),
`_S4Vectors::DataFrame_ optimized` =
ifelse(sapply(matrixStats, existsMethod, signature = "DataFrame"),
"✔",
"❌"),
check.names = FALSE)
knitr::kable(api_df, row.names = FALSE)
```
# 'Seed-aware' methods {#seed_aware_methods}
As well as offering a familiar API, `r Biocpkg("DelayedMatrixStats")` provides
'seed-aware' methods that are optimized for specific types of *DelayedMatrix*
objects.
To illustrate this idea, we will compare two ways of computing the column sums
of a *DelayedMatrix* object:
1. The 'block-processing' strategy. This was developed in the `r Biocpkg("DelayedArray")` package and is available for all methods in the `r Biocpkg("DelayedMatrixStats")` through the `force_block_processing` argument
2. The 'seed-aware' strategy. This is implemented in the `r Biocpkg("DelayedMatrixStats")` and is optimized for both speed and memory but only for *DelayedMatrix* objects with certain types of *seed*.
We will demonstrate this by computing the column sums matrices with 20,000 rows
and 600 columns where the data have different structure and are stored in
*DelayedMatrix* objects with different types of seed:
- Dense data with values in $(0, 1)$ using an ordinary _matrix_ as the seed
- Sparse data with values in $[0, 1)$, where $60\%$ are zeros, using a _dgCMatrix_, a sparse matrix representation from the `r CRANpkg("Matrix")` package, as the seed
- Dense data in ${0, 1, \ldots, 100}$, where there are multiple runs of identical values, using a _RleArraySeed_ from the `r Biocpkg("DelayedArray")` package as the seed
We use the `r CRANpkg("microbenchmark")` package to measure running time and
the `r CRANpkg("profmem")` package to measure the total memory allocations of
each method.
In each case, the 'seed-aware' method is many times faster and allocates
substantially lower total memory.
```{r benchmarking, message = FALSE, echo = TRUE, error = TRUE}
library(DelayedMatrixStats)
library(sparseMatrixStats)
library(microbenchmark)
library(profmem)
set.seed(666)
# -----------------------------------------------------------------------------
# Dense with values in (0, 1)
# Fast, memory-efficient column sums of DelayedMatrix with ordinary matrix seed
#
# Generate some data
dense_matrix <- matrix(runif(20000 * 600),
nrow = 20000,
ncol = 600)
# Benchmark
dm_matrix <- DelayedArray(dense_matrix)
class(seed(dm_matrix))
dm_matrix
microbenchmark(
block_processing = colSums2(dm_matrix, force_block_processing = TRUE),
seed_aware = colSums2(dm_matrix),
times = 10)
total(profmem(colSums2(dm_matrix, force_block_processing = TRUE)))
total(profmem(colSums2(dm_matrix)))
# -----------------------------------------------------------------------------
# Sparse (60% zero) with values in (0, 1)
# Fast, memory-efficient column sums of DelayedMatrix with ordinary matrix seed
#
# Generate some data
sparse_matrix <- dense_matrix
zero_idx <- sample(length(sparse_matrix), 0.6 * length(sparse_matrix))
sparse_matrix[zero_idx] <- 0
# Benchmark
dm_dgCMatrix <- DelayedArray(Matrix(sparse_matrix, sparse = TRUE))
class(seed(dm_dgCMatrix))
dm_dgCMatrix
microbenchmark(
block_processing = colSums2(dm_dgCMatrix, force_block_processing = TRUE),
seed_aware = colSums2(dm_dgCMatrix),
times = 10)
total(profmem(colSums2(dm_dgCMatrix, force_block_processing = TRUE)))
total(profmem(colSums2(dm_dgCMatrix)))
# -----------------------------------------------------------------------------
# Dense with values in {0, 100} featuring runs of identical values
# Fast, memory-efficient column sums of DelayedMatrix with Rle-based seed
#
# Generate some data
runs <- rep(sample(100, 500000, replace = TRUE), rpois(500000, 100))
runs <- runs[seq_len(20000 * 600)]
runs_matrix <- matrix(runs,
nrow = 20000,
ncol = 600)
# Benchmark
dm_rle <- RleArray(Rle(runs),
dim = c(20000, 600))
class(seed(dm_rle))
dm_rle
microbenchmark(
block_processing = colSums2(dm_rle, force_block_processing = TRUE),
seed_aware = colSums2(dm_rle),
times = 10)
total(profmem(colSums2(dm_rle, force_block_processing = TRUE)))
total(profmem(colSums2(dm_rle)))
```
The development of 'seed-aware' methods is ongoing work (see the [Roadmap](#roadmap)), and for now only a few methods and seed-types have a
'seed-aware' method.
An extensive set of benchmarks is under development at [http://peterhickey.org/BenchmarkingDelayedMatrixStats/](http://peterhickey.org/BenchmarkingDelayedMatrixStats/).
# Delayed operations
A key feature of a _DelayedArray_ is the ability to register 'delayed
operations'. For example, let's compute `sin(dm_matrix)`:
```{r sin}
system.time(sin_dm_matrix <- sin(dm_matrix))
```
This instantaneous because the operation is not actually performed, rather
it is registered and only performed when the object is _realized_. All methods
in `r Biocpkg("DelayedMatrixStats")` will correctly realise these delayed
operations before computing the final result. For example, let's compute
`colSums2(sin_dm_matrix)` and compare check we get the correct answer:
```{r colSums2_sin}
all.equal(colSums2(sin_dm_matrix), colSums(sin(as.matrix(dm_matrix))))
```
# Roadmap {#roadmap}
The initial version of `r Biocpkg("DelayedMatrixStats")` provides complete
coverage of the `r CRANpkg("matrixStats")` column-wise and row-wise API[^api],
allowing package developers to use these functions with _DelayedMatrix_ objects
as well as with ordinary _matrix_ objects. This should simplify package
development and assist authors to support to their software for large datasets
stored in disk-backed data structures such as _HDF5Array_. Such large datasets
are increasingly common with the rise of single-cell genomics.
Future releases of `r Biocpkg("DelayedMatrixStats")` will improve the
performance of these methods, specifically by developing additional 'seed-aware'
methods. The plan is to prioritise commonly used methods (e.g.,
`colMeans2()`/`rowMeans2()`, `colSums2()`/`rowSums2()`, etc.) and the
development of 'seed-aware' methods for the _HDF5Matrix_ class. To do so, we
will leverage the `r Biocpkg("beachmat")` package. Proof-of-concept code
has shown that this can greatly increase the performance when analysing such
disk-backed data.
Importantly, all package developers using methods from
`r Biocpkg("DelayedMatrixStats")` will immediately gain from performance
improvements to these low-level routines. By using
`r Biocpkg("DelayedMatrixStats")`, package developers will be able to focus on
higher level programming tasks and address important scientific questions and
technological challenges in high-throughput biology.
# Session info
```{r}
sessionInfo()
```
|