File: pairwiseTTests.Rd

package info (click to toggle)
r-bioc-scran 1.26.2%2Bdfsg-1
links: PTS, VCS
area: main
in suites: bookworm
size: 1,692 kB
sloc: cpp: 733; makefile: 2
file content (221 lines) | stat: -rw-r--r-- 11,894 bytes
parent folder | download | duplicates (2)
% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/pairwiseTTests.R
\name{pairwiseTTests}
\alias{pairwiseTTests}
\alias{pairwiseTTests,ANY-method}
\alias{pairwiseTTests,SummarizedExperiment-method}
\alias{pairwiseTTests,SingleCellExperiment-method}
\title{Perform pairwise t-tests}
\usage{
pairwiseTTests(x, ...)

\S4method{pairwiseTTests}{ANY}(
  x,
  groups,
  block = NULL,
  design = NULL,
  restrict = NULL,
  exclude = NULL,
  direction = c("any", "up", "down"),
  lfc = 0,
  std.lfc = FALSE,
  log.p = FALSE,
  gene.names = NULL,
  subset.row = NULL,
  BPPARAM = SerialParam()
)

\S4method{pairwiseTTests}{SummarizedExperiment}(x, ..., assay.type = "logcounts")

\S4method{pairwiseTTests}{SingleCellExperiment}(x, groups = colLabels(x, onAbsence = "error"), ...)
}
\arguments{
\item{x}{A numeric matrix-like object of normalized log-expression values, 
where each column corresponds to a cell and each row corresponds to an endogenous gene.

Alternatively, a \linkS4class{SummarizedExperiment} or \linkS4class{SingleCellExperiment} object containing such a matrix.}

\item{...}{For the generic, further arguments to pass to specific methods.

For the SummarizedExperiment method, further arguments to pass to the ANY method.

For the SingleCellExperiment method, further arguments to pass to the SummarizedExperiment method.}

\item{groups}{A vector of length equal to \code{ncol(x)}, specifying the group assignment for each cell.
If \code{x} is a SingleCellExperiment, this is automatically derived from \code{\link{colLabels}}.}

\item{block}{A factor specifying the blocking level for each cell.}

\item{design}{A numeric matrix containing blocking terms for uninteresting factors.
Note that these factors should not be confounded with \code{groups}.}

\item{restrict}{A vector specifying the levels of \code{groups} for which to perform pairwise comparisons.}

\item{exclude}{A vector specifying the levels of \code{groups} for which \emph{not} to perform pairwise comparisons.}

\item{direction}{A string specifying the direction of log-fold changes to be considered in the alternative hypothesis.}

\item{lfc}{A positive numeric scalar specifying the log-fold change threshold to be tested against.}

\item{std.lfc}{A logical scalar indicating whether log-fold changes should be standardized.}

\item{log.p}{A logical scalar indicating if log-transformed p-values/FDRs should be returned.}

\item{gene.names}{Deprecated, use \code{row.data} in \code{\link{findMarkers}} instead to add custom annotation.}

\item{subset.row}{See \code{?"\link{scran-gene-selection}"}.}

\item{BPPARAM}{A \linkS4class{BiocParallelParam} object indicating whether and how parallelization should be performed across genes.}

\item{assay.type}{A string specifying which assay values to use, usually \code{"logcounts"}.}
}
\value{
A list is returned containing \code{statistics} and \code{pairs}.

The \code{statistics} element is itself a list of \linkS4class{DataFrame}s.
Each DataFrame contains the statistics for a comparison between a pair of groups,
including the log-fold changes, p-values and false discovery rates.

The \code{pairs} element is a DataFrame where each row corresponds to an entry of \code{statistics}.
This contains the fields \code{first} and \code{second}, 
specifying the two groups under comparison in the corresponding DataFrame in \code{statistics}.

In each DataFrame in \code{statistics}, the log-fold change represents the change in the \code{first} group compared to the \code{second} group.
}
\description{
Perform pairwise Welch t-tests between groups of cells, possibly after blocking on uninteresting factors of variation.
}
\details{
This function performs t-tests to identify differentially expressed genes (DEGs) between pairs of groups of cells.
A typical aim is to use the DEGs to determine cluster identity based on expression of marker genes with known biological activity.
A list of tables is returned where each table contains per-gene statistics for a comparison between one pair of groups.
This can be examined directly or used as input to \code{\link{combineMarkers}} for marker gene detection.

We use t-tests as they are simple, fast and perform reasonably well for single-cell data (Soneson and Robinson, 2018).
However, if one of the groups contains fewer than two cells, no p-value will be reported for comparisons involving that group.
A warning will also be raised about insufficient degrees of freedom (d.f.) in such cases.

When \code{log.p=TRUE}, the log-transformed p-values and FDRs are reported using the natural base.
This is useful in cases with many cells such that reporting the p-values directly would lead to double-precision underflow.

If \code{restrict} is specified, comparisons are only performed between pairs of groups in \code{restrict}.
This can be used to focus on DEGs distinguishing between a subset of the groups (e.g., closely related cell subtypes).

If \code{exclude} is specified, comparisons are not performed between groups in \code{exclude}.
Similarly, if any entries of \code{groups} are \code{NA}, the corresponding cells are are ignored.
}
\section{Direction and magnitude of the log-fold change}{

Log-fold changes are reported as differences in the values of \code{x}.
Thus, all log-fold changes have the same base as whatever was used to perform the log-transformation in \code{x}.
If \code{\link{logNormCounts}} was used, this would be base 2.

If \code{direction="any"}, two-sided tests will be performed for each pairwise comparisons between groups.
Otherwise, one-sided tests in the specified direction will be used instead.
This can be used to focus on genes that are upregulated in each group of interest, which is often easier to interpret when assigning cell type to a cluster.

To interpret the setting of \code{direction}, consider the DataFrame for group X, in which we are comparing to another group Y.
If \code{direction="up"}, genes will only be significant in this DataFrame if they are upregulated in group X compared to Y.
If \code{direction="down"}, genes will only be significant if they are downregulated in group X compared to Y.

The magnitude of the log-fold changes can also be tested by setting \code{lfc}.
By default, \code{lfc=0} meaning that we will reject the null upon detecting any differential expression.
If this is set to some other positive value, the null hypothesis will change depending on \code{direction}:
\itemize{
\item If \code{direction="any"}, the null hypothesis is that the true log-fold change lies inside \code{[-lfc, lfc]}.
To be conservative, we perform one-sided tests against the boundaries of this interval, and combine them to obtain a two-sided p-value.
\item If \code{direction="up"}, the null hypothesis is that the true log-fold change is less than or equal to \code{lfc}.
A one-sided p-value is computed against the boundary of this interval.
\item If \code{direction="down"}, the null hypothesis is that the true log-fold change is greater than or equal to \code{-lfc}.
A one-sided p-value is computed against the boundary of this interval.
}
This is similar to the approach used in \code{\link[limma:eBayes]{treat}} and allows users to focus on genes with strong log-fold changes.

If \code{std.lfc=TRUE}, the log-fold change for each gene is standardized by the variance.
When the Welch t-test is being used, this is equivalent to Cohen's d.
Standardized log-fold changes may be more appealing for visualization as it avoids large fold changes due to large variance.
The choice of \code{std.lfc} does not affect the calculation of the p-values.
}

\section{Blocking on uninteresting factors}{

If \code{block} is specified, Welch t-tests are performed between groups within each level of \code{block}.
For each pair of groups, the p-values for each gene across all levels of \code{block} are combined using Stouffer's weighted Z-score method.
The reported log-fold change for each gene is also a weighted average of log-fold changes across levels.

The weight for a particular level is defined as \eqn{(1/N_x + 1/N_y)^{-1}}, 
where \eqn{Nx} and \eqn{Ny} are the number of cells in groups X and Y, respectively, for that level. 
This is inversely proportional to the expected variance of the log-fold change, provided that all groups and blocking levels have the same variance.

% In theory, a better weighting scheme would be to use the estimated standard error of the log-fold change to compute the weight.
% This would be more responsive to differences in variance between blocking levels, focusing on levels with low variance and high power.
% However, this is not safe in practice as genes with many zeroes can have very low standard errors, dominating the results inappropriately.

When comparing two groups, blocking levels are ignored if no p-value was reported, e.g., if there were insufficient cells for a group in a particular level. 
This includes levels that contain fewer than two cells for either group, as this cannot yield a p-value from the Welch t-test.
If all levels are ignored in this manner, the entire comparison will only contain \code{NA} p-values and a warning will be emitted.
}

\section{Regressing out unwanted factors}{

If \code{design} is specified, a linear model is instead fitted to the expression profile for each gene.
This linear model will include the \code{groups} as well as any blocking factors in \code{design}.
A t-test is then performed to identify DEGs between pairs of groups, using the values of the relevant coefficients and the gene-wise residual variance.

Note that \code{design} must be full rank when combined with the \code{groups} terms, 
i.e., there should not be any confounding variables.
We make an exception for the common situation where \code{design} contains an \code{"(Intercept)"} column,
which is automatically detected and removed (emitting a warning along the way).

We recommend using \code{block} instead of \code{design} for uninteresting categorical factors of variation.
The former accommodates differences in the variance of expression in each group via Welch's t-test.
As a result, it is more robust to misspecification of the groups, as misspecified groups (and inflated variances) do not affect the inferences for other groups.
Use of \code{block} also avoids assuming additivity of effects between the blocking factors and the groups.

Nonetheless, use of \code{design} is unavoidable when blocking on real-valued covariates.
It is also useful for ensuring that log-fold changes/p-values are computed for comparisons between all pairs of groups
(assuming that \code{design} is not confounded with the groups).
This may not be the case with \code{block} if a pair of groups never co-occur in a single blocking level.
}

\examples{
library(scuttle)
sce <- mockSCE()
sce <- logNormCounts(sce)

# Any clustering method is okay.
kout <- kmeans(t(logcounts(sce)), centers=3) 

# Vanilla application:
out <- pairwiseTTests(logcounts(sce), groups=kout$cluster)
out

# Directional with log-fold change threshold:
out <- pairwiseTTests(logcounts(sce), groups=kout$cluster, 
    direction="up", lfc=0.2)
out

}
\references{
Whitlock MC (2005). 
Combining probability from independent tests: the weighted Z-method is superior to Fisher's approach. 
\emph{J. Evol. Biol.} 18, 5:1368-73.

Soneson C and Robinson MD (2018). 
Bias, robustness and scalability in single-cell differential expression analysis. 
\emph{Nat. Methods}

Lun ATL (2018).
Comments on marker detection in \emph{scran}.
\url{https://ltla.github.io/SingleCellThoughts/software/marker_detection/comments.html}
}
\seealso{
\code{\link{t.test}}, on which this function is based.

\code{\link{combineMarkers}}, to combine pairwise comparisons into a single DataFrame per group.

\code{\link{getTopMarkers}}, to obtain the top markers from each pairwise comparison.
}
\author{
Aaron Lun
}