File: ScaledMatrix.Rmd

package info (click to toggle)
r-bioc-scaledmatrix 1.6.0%2Bdfsg-1
  • links: PTS, VCS
  • area: main
  • in suites: bookworm
  • size: 172 kB
  • sloc: makefile: 2
file content (156 lines) | stat: -rw-r--r-- 5,244 bytes parent folder | download | duplicates (2)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
---
title: Using the `ScaledMatrix` class
author: 
- name: Aaron Lun
  email: infinite.monkeys.with.keyboards@gmail.com 
date: "Revised: 12 December 2020"
output:
  BiocStyle::html_document:
    toc_float: true 
package: ResidualMatrix
vignette: >
  %\VignetteIndexEntry{Using the ScaledMatrix}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}    
---

```{r, echo=FALSE, results="hide", message=FALSE}
knitr::opts_chunk$set(error=FALSE, message=FALSE, warning=FALSE)
```

# Overview

The `ScaledMatrix` provides yet another method of running `scale()` on a matrix.
In other words, these three operations are equivalent:

```{r}
mat <- matrix(rnorm(10000), ncol=10)

smat1 <- scale(mat)
head(smat1)

library(DelayedArray)
smat2 <- scale(DelayedArray(mat))
head(smat2)

library(ScaledMatrix)
smat3 <- ScaledMatrix(mat, center=TRUE, scale=TRUE)
head(smat3)
```

The biggest difference lies in how they behave in downstream matrix operations.

- `smat1` is an ordinary matrix, with the scaled and centered values fully realized in memory.
Nothing too unusual here.
- `smat2` is a `DelayedMatrix` and undergoes block processing whereby chunks are realized and operated on, one at a time.
This sacrifices speed for greater memory efficiency by avoiding a copy of the entire matrix. 
In particular, it preserves the structure of the original `mat`, e.g., from a sparse or file-backed representation. 
- `smat3` is a `ScaledMatrix` that refactors certain operations so that they can be applied to the original `mat` without any scaling or centering.
This takes advantage of the original data structure to speed up matrix multiplication and row/column sums,
albeit at the cost of numerical precision.

# Matrix multiplication

Given an original matrix $\mathbf{X}$ with $n$ columns, a vector of column centers $\mathbf{c}$ and a vector of column scaling values $\mathbf{s}$, 
our scaled matrix can be written as:

$$
\mathbf{Y} = (\mathbf{X} - \mathbf{c} \cdot \mathbf{1}_n^T) \mathbf{S}
$$

where $\mathbf{S} = \text{diag}(s_1^{-1}, ..., s_n^{-1})$.
If we wanted to right-multiply it with another matrix $\mathbf{A}$, we would have:

$$
\mathbf{YA} = \mathbf{X}\mathbf{S}\mathbf{A} - \mathbf{c} \cdot \mathbf{1}_n^T \mathbf{S}\mathbf{A}
$$

The right-most expression is simply the outer product of $\mathbf{c}$ with the column sums of $\mathbf{SA}$.
More important is the fact that we can use the matrix multiplication operator for $\mathbf{X}$ with $\mathbf{SA}$,
as this allows us to use highly efficient algorithms for certain data representations, e.g., sparse matrices.

```{r}
library(Matrix)
mat <- rsparsematrix(20000, 10000, density=0.01)
smat <- ScaledMatrix(mat, center=TRUE, scale=TRUE)

blob <- matrix(runif(ncol(mat) * 5), ncol=5)
system.time(out <- smat %*% blob)

# The slower way with block processing.
da <- scale(DelayedArray(mat))
system.time(out2 <- da %*% blob)
```

The same logic applies for left-multiplication and cross-products.
This allows us to easily speed up high-level operations involving matrix multiplication by just switching to a `ScaledMatrix`,
e.g., in approximate PCA algorithms from the `r Biocpkg("BiocSingular")` package.

```{r}
library(BiocSingular)
set.seed(1000)
system.time(pcs <- runSVD(smat, k=10, BSPARAM=IrlbaParam()))
```

# Other utilities

Row and column sums are special cases of matrix multiplication and can be computed quickly:

```{r}
system.time(rowSums(smat))
system.time(rowSums(da))
```

Subsetting, transposition and renaming of the dimensions are all supported without loss of the `ScaledMatrix` representation:

```{r}
smat[,1:5]
t(smat)
rownames(smat) <- paste0("GENE_", 1:20000)
smat
```

Other operations will cause the `ScaledMatrix` to collapse to the general `DelayedMatrix` representation, after which point block processing will be used.

```{r}
smat + 1
```

# Caveats 

For most part, the implementation of the multiplication assumes that the $\mathbf{A}$ matrix and the matrix product are small compared to $\mathbf{X}$.
It is also possible to multiply two `ScaledMatrix`es together if the underlying matrices have efficient operators for their product.
However, if this is not the case, the `ScaledMatrix` offers little benefit for increased overhead.

It is also worth noting that this speed-up is not entirely free. 
The expression above involves subtracting two matrix with potentially large values, which runs the risk of catastrophic cancellation.
The example below demonstrates how `ScaledMatrix` is more susceptible to loss of precision than a normal `DelayedArray`:

```{r}
set.seed(1000)
mat <- matrix(rnorm(1000000), ncol=100000) 
big.mat <- mat + 1e12

# The 'correct' value, unaffected by numerical precision.
ref <- rowMeans(scale(mat))
head(ref)

# The value from scale'ing a DelayedArray.
library(DelayedArray)
smat2 <- scale(DelayedArray(big.mat))
head(rowMeans(smat2))

# The value from a ScaledMatrix.
library(ScaledMatrix)
smat3 <- ScaledMatrix(big.mat, center=TRUE, scale=TRUE)
head(rowMeans(smat3))
```

In most practical applications, though, this does not seem to be a major concern, 
especially as most values (e.g., log-normalized expression matrices) lie close to zero anyway.

# Session information {-}

```{r}
sessionInfo()
```