File: getClusteredPCs.Rd

package info (click to toggle)
r-bioc-scran 1.18.5%2Bdfsg-1
links: PTS, VCS
area: main
in suites: bullseye
size: 1,856 kB
sloc: cpp: 960; sh: 13; makefile: 2
file content (90 lines) | stat: -rw-r--r-- 3,842 bytes
% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/getClusteredPCs.R
\name{getClusteredPCs}
\alias{getClusteredPCs}
\title{Use clusters to choose the number of PCs}
\usage{
getClusteredPCs(
  pcs,
  FUN = NULL,
  ...,
  min.rank = 5,
  max.rank = ncol(pcs),
  by = 1
)
}
\arguments{
\item{pcs}{A numeric matrix of PCs, where rows are cells and columns are dimensions representing successive PCs.}

\item{FUN}{A clustering function that takes a numeric matrix with rows as cells and
returns a vector containing a cluster label for each cell.}

\item{...}{Further arguments to pass to \code{FUN}.}

\item{min.rank}{Integer scalar specifying the minimum number of PCs to use.}

\item{max.rank}{Integer scalar specifying the maximum number of PCs to use.}

\item{by}{Integer scalar specifying what intervals should be tested between \code{min.rank} and \code{max.rank}.}
}
\value{
A \linkS4class{DataFrame} with one row per tested number of PCs.
This contains the fields:
\describe{
\item{\code{n.pcs}:}{Integer scalar specifying the number of PCs used.}
\item{\code{n.clusters}:}{Integer scalar specifying the number of clusters identified.}
\item{\code{clusters}:}{A \linkS4class{List} containing the cluster identities for this number of PCs.}
}
The metadata of the DataFrame contains \code{chosen}, 
an integer scalar specifying the \dQuote{ideal} number of PCs to use.
}
\description{
Cluster cells after using varying number of PCs,
and pick the number of PCs using a heuristic based on the number of clusters.
}
\details{
Assume that the data contains multiple subpopulations, each of which is separated from the others on a different axis.
For example, each subpopulation could be defined by a unique set of marker genes that drives separation on its own PC.
If we had \eqn{x} subpopulations, we would need at least \eqn{x-1} PCs to successfully distinguish all of them.
This motivates the choice of the number of PCs provided we know the number of subpopulations in the data.

In practice, we do not know the number of subpopulations so we use the number of clusters as a proxy instead.
We apply a clustering function \code{FUN} on the first \eqn{d} PCs,
and only consider the values of \eqn{d} that yield no more than \eqn{d+1} clusters.
If we see more clusters with fewer dimensions, 
we consider this to represent overclustering rather than distinct subpopulations,
as multiple subpopulations should not be distinguishable on the same axes (based on the assumption above).

We choose \eqn{d} that satisfies the constraint above and maximizes the number of clusters.
The idea is that more PCs should include more biological signal, allowing \code{FUN} to detect more distinct subpopulations;
until the point that the extra signal outweights the added noise at high dimensions,
such that resolution decreases and it becomes more difficult for \code{FUN} to distinguish between subpopulations.

Any \code{FUN} can be used that automatically chooses the number of clusters based on the data.
The default is a graph-based clustering method using \code{\link{buildSNNGraph}} and \code{\link{cluster_walktrap}},
where arguments in \code{...} are passed to the former.
Users should not supply \code{FUN} where the number of clusters is fixed in advance, 
(e.g., k-means, hierarchical clustering with known \code{k} in \code{\link{cutree}}).

The identities of the output clusters are returned at each step for comparison, e.g., using methods like \pkg{clustree}.
}
\examples{
library(scuttle)
sce <- mockSCE()
sce <- logNormCounts(sce)

sce <- scater::runPCA(sce)
output <- getClusteredPCs(reducedDim(sce))
output

metadata(output)$chosen

}
\seealso{
\code{\link{runPCA}}, to compute the PCs in the first place.

\code{\link{buildSNNGraph}}, for arguments to use in with default \code{FUN}.
}
\author{
Aaron Lun
}