1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165
|
---
title: "sparseMatrixStats"
author: Constantin Ahlmann-Eltze
date: "`r Sys.Date()`"
output: BiocStyle::html_document
vignette: >
%\VignetteIndexEntry{sparseMatrixStats}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>"
)
```
# Installation
You can install the release version of `r BiocStyle::Biocpkg("sparseMatrixStats")` from BioConductor:
``` r
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("sparseMatrixStats")
```
# Introduction
The sparseMatrixStats package implements a number of summary functions for sparse matrices from the `r BiocStyle::CRANpkg("Matrix")` package.
Let us load the package and create a synthetic sparse matrix.
```{r}
library(sparseMatrixStats)
# Matrix defines the sparse Matrix class
# dgCMatrix that we will use
library(Matrix)
# For reproducibility
set.seed(1)
```
Create a synthetic table with customers, items, and how often they bought that item.
```{r}
customer_ids <- seq_len(100)
item_ids <- seq_len(30)
n_transactions <- 1000
customer <- sample(customer_ids, size = n_transactions, replace = TRUE,
prob = runif(100))
item <- sample(item_ids, size = n_transactions, replace = TRUE,
prob = runif(30))
tmp <- table(paste0(customer, "-", item))
tmp2 <- strsplit(names(tmp), "-")
purchase_table <- data.frame(
customer = as.numeric(sapply(tmp2, function(x) x[1])),
item = as.numeric(sapply(tmp2, function(x) x[2])),
n = as.numeric(tmp)
)
head(purchase_table, n = 10)
```
Let us turn the table into a matrix to simplify the analysis:
```{r}
purchase_matrix <- sparseMatrix(purchase_table$customer, purchase_table$item,
x = purchase_table$n,
dims = c(100, 30),
dimnames = list(customer = paste0("Customer_", customer_ids),
item = paste0("Item_", item_ids)))
purchase_matrix[1:10, 1:15]
```
We can see that some customers did not buy anything, where as
some bought a lot.
`sparseMatrixStats` can help us to identify interesting patterns in this data:
```{r}
# How often was each item bough in total?
colSums2(purchase_matrix)
# What is the range of number of items each
# customer bought?
head(rowRanges(purchase_matrix))
# What is the variance in the number of items
# each customer bought?
head(rowVars(purchase_matrix))
# How many items did a customer not buy at all, one time, 2 times,
# or exactly 4 times?
head(rowTabulates(purchase_matrix, values = c(0, 1, 2, 4)))
```
## Alternative Matrix Creation
In the previous section, I demonstrated how to create a sparse matrix from scratch using the `sparseMatrix()` function.
However, often you already have an existing matrix and want to convert it to a sparse representation.
```{r}
mat <- matrix(0, nrow=10, ncol=6)
mat[sample(seq_len(60), 4)] <- 1:4
# Convert dense matrix to sparse matrix
sparse_mat <- as(mat, "dgCMatrix")
sparse_mat
```
The *sparseMatrixStats* package is a derivative of the `r BiocStyle::CRANpkg("matrixStats")` package and implements it's API for
sparse matrices. For example, to calculate the variance for each column of `mat` you can do
```{r}
apply(mat, 2, var)
```
However, this is quite inefficient and *matrixStats* provides the direct function
```{r}
matrixStats::colVars(mat)
```
Now for sparse matrices, you can also just call
```{r}
sparseMatrixStats::colVars(sparse_mat)
```
# Benchmark
If you have a large matrix with many exact zeros, working on the sparse representation can considerably speed up the computations.
I generate a dataset with 10,000 rows and 50 columns that is 99% empty
```{r}
big_mat <- matrix(0, nrow=1e4, ncol=50)
big_mat[sample(seq_len(1e4 * 50), 5000)] <- rnorm(5000)
# Convert dense matrix to sparse matrix
big_sparse_mat <- as(big_mat, "dgCMatrix")
```
I use the bench package to benchmark the performance difference:
```{r}
bench::mark(
sparseMatrixStats=sparseMatrixStats::colVars(big_sparse_mat),
matrixStats=matrixStats::colVars(big_mat),
apply=apply(big_mat, 2, var)
)
```
As you can see `sparseMatrixStats` is ca. 50 times fast than `matrixStats`, which in turn is 7 times faster than the `apply()` version.
# Session Info
```{r}
sessionInfo()
```
|