File: sparseMatrixStats.Rmd

package info (click to toggle)
r-bioc-sparsematrixstats 1.2.1%2Bdfsg-4
  • links: PTS, VCS
  • area: main
  • in suites: bullseye
  • size: 1,052 kB
  • sloc: cpp: 1,332; makefile: 2
file content (165 lines) | stat: -rw-r--r-- 4,348 bytes parent folder | download | duplicates (6)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
---
title: "sparseMatrixStats"
author: Constantin Ahlmann-Eltze
date: "`r Sys.Date()`"
output: BiocStyle::html_document
vignette: >
  %\VignetteIndexEntry{sparseMatrixStats}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

# Installation

You can install the release version of `r BiocStyle::Biocpkg("sparseMatrixStats")` from BioConductor:

``` r
if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("sparseMatrixStats")
```

# Introduction

The sparseMatrixStats package implements a number of summary functions for sparse matrices from the `r  BiocStyle::CRANpkg("Matrix")` package.

Let us load the package and create a synthetic sparse matrix.

```{r}
library(sparseMatrixStats)
# Matrix defines the sparse Matrix class
# dgCMatrix that we will use
library(Matrix)
# For reproducibility
set.seed(1)
```

Create a synthetic table with customers, items, and how often they bought that item.

```{r}
customer_ids <- seq_len(100)
item_ids <-  seq_len(30)
n_transactions <- 1000
customer <- sample(customer_ids, size = n_transactions, replace = TRUE,
                    prob = runif(100))
item <- sample(item_ids, size = n_transactions, replace = TRUE,
               prob = runif(30))

tmp <- table(paste0(customer, "-", item))
tmp2 <- strsplit(names(tmp), "-")
purchase_table <- data.frame(
  customer = as.numeric(sapply(tmp2, function(x) x[1])),
  item = as.numeric(sapply(tmp2, function(x) x[2])),
  n = as.numeric(tmp)
)

head(purchase_table, n = 10)
```

Let us turn the table into a matrix to simplify the analysis:

```{r}
purchase_matrix <- sparseMatrix(purchase_table$customer, purchase_table$item, 
                x = purchase_table$n,
                dims = c(100, 30),
                dimnames = list(customer = paste0("Customer_", customer_ids),
                                item = paste0("Item_", item_ids)))
purchase_matrix[1:10, 1:15]
```

We can see that some customers did not buy anything, where as
some bought a lot.

`sparseMatrixStats` can help us to identify interesting patterns in this data:


```{r}
# How often was each item bough in total?
colSums2(purchase_matrix)

# What is the range of number of items each 
# customer bought?
head(rowRanges(purchase_matrix))

# What is the variance in the number of items
# each customer bought?
head(rowVars(purchase_matrix))

# How many items did a customer not buy at all, one time, 2 times,
# or exactly 4 times?
head(rowTabulates(purchase_matrix, values = c(0, 1, 2, 4)))
```


## Alternative Matrix Creation

In the previous section, I demonstrated how to create a sparse matrix from scratch using the `sparseMatrix()` function. 
However, often you already have an existing matrix and want to convert it to a sparse representation.

```{r}
mat <- matrix(0, nrow=10, ncol=6)
mat[sample(seq_len(60), 4)] <- 1:4
# Convert dense matrix to sparse matrix
sparse_mat <- as(mat, "dgCMatrix")
sparse_mat
```

The *sparseMatrixStats* package is a derivative of the `r  BiocStyle::CRANpkg("matrixStats")` package and implements it's API for 
sparse matrices. For example, to calculate the variance for each column of `mat` you can do

```{r}
apply(mat, 2, var)
```

However, this is quite inefficient and *matrixStats* provides the direct function

```{r}
matrixStats::colVars(mat)
```

Now for sparse matrices, you can also just call 

```{r}
sparseMatrixStats::colVars(sparse_mat)
```

# Benchmark

If you have a large matrix with many exact zeros, working on the sparse representation can considerably speed up the computations.

I generate a dataset with 10,000 rows and 50 columns that is 99% empty

```{r}
big_mat <- matrix(0, nrow=1e4, ncol=50)
big_mat[sample(seq_len(1e4 * 50), 5000)] <- rnorm(5000)
# Convert dense matrix to sparse matrix
big_sparse_mat <- as(big_mat, "dgCMatrix")
```

I use the bench package to benchmark the performance difference:

```{r}
bench::mark(
  sparseMatrixStats=sparseMatrixStats::colVars(big_sparse_mat),
  matrixStats=matrixStats::colVars(big_mat),
  apply=apply(big_mat, 2, var)
)
```

As you can see `sparseMatrixStats` is ca. 50 times fast than `matrixStats`, which in turn is 7 times faster than the `apply()` version.


# Session Info

```{r}
sessionInfo()
```