File: correctGroupSummary.Rd

package info (click to toggle)
r-bioc-scuttle 1.16.0%2Bdfsg-3
links: PTS, VCS
area: main
in suites: forky, sid, trixie
size: 912 kB
sloc: cpp: 531; sh: 7; makefile: 2
file content (89 lines) | stat: -rw-r--r-- 4,690 bytes
parent folder | download | duplicates (2)
% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/correctGroupSummary.R
\name{correctGroupSummary}
\alias{correctGroupSummary}
\title{Correct group-level summaries}
\usage{
correctGroupSummary(
  x,
  group,
  block,
  transform = c("raw", "log", "logit"),
  offset = NULL,
  weights = NULL,
  subset.row = NULL
)
}
\arguments{
\item{x}{A numeric matrix containing summary statistics for each gene (row) and combination of group and block (column),
computed by functions such as \code{\link{summarizeAssayByGroup}} - see Examples.}

\item{group}{A factor or vector specifying the group identity for each column of \code{x}, usually clusters or cell types.}

\item{block}{A factor or vector specifying the blocking level for each column of \code{x}, e.g., batch of origin.}

\item{transform}{String indicating how the differences between groups should be computed, for the batch adjustment.}

\item{offset}{Numeric scalar specifying the offset to use when \code{difference="log"} (default 1) or \code{difference="logit"} (default 0.01).}

\item{weights}{A numeric vector containing the weight of each combination, e.g., due to differences in the number of cells used to compute each summary.
If \code{NULL}, all combinations have equal weight.}

\item{subset.row}{Logical, integer or character vector specifying the rows in \code{x} to use to compute statistics.}
}
\value{
A numeric matrix with number of columns equal to the number of unique levels in \code{group}.
Each column corresponds to a group and contains the averaged statistic across batches.
Each row corresponds to a gene in \code{x} (or that specified by \code{subset.row} if not \code{NULL}).
}
\description{
Correct the summary statistic for each group for unwanted variation by fitting a linear model and extracting the coefficients.
}
\details{
Here, we consider group-level summary statistics such as the average expression of all cells or the proportion with detectable expression.
These are easy to intepret and helpful for any visualizations that operate on individual groups, e.g., heatmaps.

However, in the presence of unwanted factors of variation (e.g., batch effects), some adjustment is required to ensure these group-level statistics are comparable.
We cannot directly average group-level statistics across batches as some groups may not exist in particular batches, e.g., due to the presence of unique cell types in different samples.
A direct average would be biased by variable contributions of the batch effect for each group.

To overcome this, we use groups that are present across multiple levels of the unwanted factor in multiple batches to correct for the batch effect.
(That is, any level of \code{groups} that occurs for multiple levels of \code{block}.)
For each gene, we fit a linear model to the (transformed) values containing both the group and block factors.
We then report the coefficient for each group as the batch-adjusted average for that group;
this is possible as the fitted model has no intercept.

The default of \code{transform="raw"} will not transform the values, and is generally suitable for log-expression values.
Setting \code{transform="log"} will perform a log-transformation after adding \code{offset} (default of 1), and is suitable for normalized counts.
Setting \code{transform="logit"} will perform a logit transformation after adding \code{offset} (default of 0.01) - 
to the numerator and twice to the denominator, to shrink to 0.5 -
and is suitable for proportional data such as the proportion of detected cells.

After the model is fitted to the transformed values, the reverse transformation is applied to the coefficients to obtain a corrected summary statistic on the original scale.
For \code{transform="log"}, any negative values are coerced to zero,
while for \code{transform="logit"}, any values outside of [0, 1] are coerced to the closest boundary.
}
\examples{
y <- matrix(rnorm(10000), ncol=1000)
group <- sample(10, ncol(y), replace=TRUE)
block <- sample(5, ncol(y), replace=TRUE)

summaries <- summarizeAssayByGroup(y, DataFrame(group=group, block=block), 
    statistics=c("mean", "prop.detected"))

# Computing batch-aware averages:
averaged <- correctGroupSummary(assay(summaries, "mean"), 
    group=summaries$group, block=summaries$block)

num <- correctGroupSummary(assay(summaries, "prop.detected"),
    group=summaries$group, block=summaries$block, transform="logit") 

}
\seealso{
\code{\link{summarizeAssayByGroup}}, to generate the group-level summaries for this function.

\code{regressBatches} from the \pkg{batchelor} package, to remove the batch effect from per-cell expression values.
}
\author{
Aaron Lun
}