File: medianSizeFactors.Rd

package info (click to toggle)
r-bioc-scuttle 1.0.4%2Bdfsg-5
  • links: PTS, VCS
  • area: main
  • in suites: bullseye
  • size: 728 kB
  • sloc: cpp: 356; sh: 17; makefile: 2
file content (86 lines) | stat: -rw-r--r-- 4,655 bytes parent folder | download | duplicates (3)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/medianSizeFactors.R
\name{medianSizeFactors}
\alias{medianSizeFactors}
\alias{medianSizeFactors,ANY-method}
\alias{medianSizeFactors,SummarizedExperiment-method}
\alias{computeMedianFactors}
\title{Compute median-based size factors}
\usage{
medianSizeFactors(x, ...)

\S4method{medianSizeFactors}{ANY}(x, subset.row = NULL, reference = NULL, subset_row = NULL)

\S4method{medianSizeFactors}{SummarizedExperiment}(x, ..., assay.type = "counts", exprs_values = NULL)

computeMedianFactors(x, ...)
}
\arguments{
\item{x}{For \code{medianSizeFactors}, a numeric matrix of counts with one row per feature and column per cell.
Alternatively, a \linkS4class{SummarizedExperiment} or \linkS4class{SingleCellExperiment} containing such counts.

For \code{computeMedianFactors}, only a \linkS4class{SingleCellExperiment} is accepted.}

\item{...}{For the \code{medianSizeFactors} generic, arguments to pass to specific methods.
For the SummarizedExperiment method, further arguments to pass to the ANY method.

For \code{computeMedianFactors}, further arguments to pass to \code{medianSizeFactors}.}

\item{subset.row}{A vector specifying whether the size factors should be computed from a subset of rows of \code{x}.}

\item{reference}{A numeric vector of length equal to \code{nrow(x)}, containing the reference expression profile.
Defaults to \code{\link{rowMeans}(x)}.}

\item{subset_row, exprs_values}{Soft-deprecated equivalent to the arguments above.}

\item{assay.type}{String or integer scalar indicating the assay of \code{x} containing the counts.}
}
\value{
For \code{medianSizeFactors}, a numeric vector of size factors is returned for all methods.

For \code{computeMedianFactors}, \code{x} is returned containing the size factors in \code{\link{sizeFactors}(x)}.
}
\description{
Define per-cell size factors by taking the median of ratios to a reference expression profile (a la \pkg{DESeq}).
}
\details{
This function implements a modified version of the \pkg{DESeq2} size factor calculation.
For each cell, the size factor is proportional to the median of the ratios of that cell's counts to \code{reference}.
The assumption is that most genes are not DE between the cell and the reference, such that the median captures any systematic increase due to technical biases.

The modification stems from the fact that we use the arithmetic mean instead of the geometric mean to compute the default \code{reference},
as the former is more robust to the many zeros in single-cell RNA sequencing data.
We also ignore all genes with values of zero in \code{reference}, as this usually results in undefined ratios when \code{reference} is itself computed from \code{x}.
}
\section{Caveats}{

For typical scRNA-seq datasets, the median-based approach tends to perform poorly, for various reasons:
\itemize{
\item The high number of zeroes in the count matrix means that the median ratio for each cell is often zero. 
If this method must be used, we recommend subsetting to only the highest-abundance genes to avoid problems with zeroes.
(Of course, the smaller the subset, the more sensitive the results are to noise or violations of the non-DE majority.)
\item The default reference effectively requires a non-DE majority of genes between \emph{any} pair of cells in the dataset.
This is a strong assumption for heterogeneous populations containing many cell types;
most genes are likely to exhibit DE between at least one pair of cell types. 
}
For these reasons, the simpler \code{\link{librarySizeFactors}} is usually preferred, which is no less inaccurate but is at least guaranteed to return a positive size factor for any cell with non-zero counts.

One valid application of this method lies in the normalization of antibody-derived tag counts for quantifying surface proteins.
These counts are usually large enough to avoid zeroes yet are also susceptible to strong composition biases that preclude the use of \code{\link{librarySizeFactors}}.
In such cases, we would also set \code{reference} to some estimate of the the ambient profile.
This assumes that most proteins are not expressed in each cell; thus, counts for most tags for any given cell can be attributed to background contamination that should not be DE between cells.
}

\examples{
example_sce <- mockSCE()
summary(medianSizeFactors(example_sce))
}
\seealso{
\code{\link{normalizeCounts}} and \code{\link{logNormCounts}}, where these size factors can be used.

\code{\link{librarySizeFactors}} and \code{\link{geometricSizeFactors}}
for other simple methods for computing size factors.
}
\author{
Aaron Lun
}