File: sparseMatrixStats.Rmd

package info (click to toggle)
r-bioc-sparsematrixstats 1.18.0%2Bdfsg-2
links: PTS, VCS
area: main
in suites: forky, sid, trixie
size: 1,048 kB
sloc: cpp: 1,749; makefile: 2
file content (165 lines) | stat: -rw-r--r-- 4,348 bytes
parent folder | download | duplicates (6)
---
title: "sparseMatrixStats"
author: Constantin Ahlmann-Eltze
date: "`r Sys.Date()`"
output: BiocStyle::html_document
vignette: >
  %\VignetteIndexEntry{sparseMatrixStats}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

# Installation

You can install the release version of `r BiocStyle::Biocpkg("sparseMatrixStats")` from BioConductor:

``` r
if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("sparseMatrixStats")
```

# Introduction

The sparseMatrixStats package implements a number of summary functions for sparse matrices from the `r  BiocStyle::CRANpkg("Matrix")` package.

Let us load the package and create a synthetic sparse matrix.

```{r}
library(sparseMatrixStats)
# Matrix defines the sparse Matrix class
# dgCMatrix that we will use
library(Matrix)
# For reproducibility
set.seed(1)
```

Create a synthetic table with customers, items, and how often they bought that item.

```{r}
customer_ids <- seq_len(100)
item_ids <-  seq_len(30)
n_transactions <- 1000
customer <- sample(customer_ids, size = n_transactions, replace = TRUE,
                    prob = runif(100))
item <- sample(item_ids, size = n_transactions, replace = TRUE,
               prob = runif(30))

tmp <- table(paste0(customer, "-", item))
tmp2 <- strsplit(names(tmp), "-")
purchase_table <- data.frame(
  customer = as.numeric(sapply(tmp2, function(x) x[1])),
  item = as.numeric(sapply(tmp2, function(x) x[2])),
  n = as.numeric(tmp)
)

head(purchase_table, n = 10)
```

Let us turn the table into a matrix to simplify the analysis:

```{r}
purchase_matrix <- sparseMatrix(purchase_table$customer, purchase_table$item, 
                x = purchase_table$n,
                dims = c(100, 30),
                dimnames = list(customer = paste0("Customer_", customer_ids),
                                item = paste0("Item_", item_ids)))
purchase_matrix[1:10, 1:15]
```

We can see that some customers did not buy anything, where as
some bought a lot.

`sparseMatrixStats` can help us to identify interesting patterns in this data:


```{r}
# How often was each item bough in total?
colSums2(purchase_matrix)

# What is the range of number of items each 
# customer bought?
head(rowRanges(purchase_matrix))

# What is the variance in the number of items
# each customer bought?
head(rowVars(purchase_matrix))

# How many items did a customer not buy at all, one time, 2 times,
# or exactly 4 times?
head(rowTabulates(purchase_matrix, values = c(0, 1, 2, 4)))
```


## Alternative Matrix Creation

In the previous section, I demonstrated how to create a sparse matrix from scratch using the `sparseMatrix()` function. 
However, often you already have an existing matrix and want to convert it to a sparse representation.

```{r}
mat <- matrix(0, nrow=10, ncol=6)
mat[sample(seq_len(60), 4)] <- 1:4
# Convert dense matrix to sparse matrix
sparse_mat <- as(mat, "dgCMatrix")
sparse_mat
```

The *sparseMatrixStats* package is a derivative of the `r  BiocStyle::CRANpkg("matrixStats")` package and implements it's API for 
sparse matrices. For example, to calculate the variance for each column of `mat` you can do

```{r}
apply(mat, 2, var)
```

However, this is quite inefficient and *matrixStats* provides the direct function

```{r}
matrixStats::colVars(mat)
```

Now for sparse matrices, you can also just call 

```{r}
sparseMatrixStats::colVars(sparse_mat)
```

# Benchmark

If you have a large matrix with many exact zeros, working on the sparse representation can considerably speed up the computations.

I generate a dataset with 10,000 rows and 50 columns that is 99% empty

```{r}
big_mat <- matrix(0, nrow=1e4, ncol=50)
big_mat[sample(seq_len(1e4 * 50), 5000)] <- rnorm(5000)
# Convert dense matrix to sparse matrix
big_sparse_mat <- as(big_mat, "dgCMatrix")
```

I use the bench package to benchmark the performance difference:

```{r}
bench::mark(
  sparseMatrixStats=sparseMatrixStats::colVars(big_sparse_mat),
  matrixStats=matrixStats::colVars(big_mat),
  apply=apply(big_mat, 2, var)
)
```

As you can see `sparseMatrixStats` is ca. 50 times fast than `matrixStats`, which in turn is 7 times faster than the `apply()` version.


# Session Info

```{r}
sessionInfo()
```