1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141
|
% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/cyclone.R
\name{cyclone}
\alias{cyclone}
\alias{cyclone,ANY-method}
\alias{cyclone,SummarizedExperiment-method}
\title{Cell cycle phase classification}
\usage{
cyclone(x, ...)
\S4method{cyclone}{ANY}(
x,
pairs,
gene.names = rownames(x),
iter = 1000,
min.iter = 100,
min.pairs = 50,
BPPARAM = SerialParam(),
verbose = FALSE,
subset.row = NULL
)
\S4method{cyclone}{SummarizedExperiment}(x, ..., assay.type = "counts")
}
\arguments{
\item{x}{A numeric matrix-like object of gene expression values where rows are genes and columns are cells.
Alternatively, a \linkS4class{SummarizedExperiment} object containing such a matrix.}
\item{...}{For the generic, additional arguments to pass to specific methods.
For the SummarizedExperiment method, additional arguments to pass to the ANY method.}
\item{pairs}{A list of data.frames produced by \code{\link{sandbag}}, containing pairs of marker genes.}
\item{gene.names}{A character vector of gene names, with one value per row in \code{x}.}
\item{iter}{An integer scalar specifying the number of iterations for random sampling to obtain a cycle score.}
\item{min.iter}{An integer scalar specifying the minimum number of iterations for score estimation.}
\item{min.pairs}{An integer scalar specifying the minimum number of pairs for cycle estimation.}
\item{BPPARAM}{A \linkS4class{BiocParallelParam} object to use for parallel processing across cells.}
\item{verbose}{A logical scalar specifying whether diagnostics should be printed to screen.}
\item{subset.row}{See \code{?"\link{scran-gene-selection}"}.}
\item{assay.type}{A string specifying which assay values to use, e.g., \code{"counts"} or \code{"logcounts"}.}
}
\value{
A list is returned containing:
\describe{
\item{\code{phases}:}{A character vector containing the predicted phase for each cell.}
\item{\code{scores}:}{A data frame containing the numeric phase scores for each phase and cell (i.e., each row is a cell).}
\item{\code{normalized.scores}:}{A data frame containing the row-normalized scores (i.e., where the row sum for each cell is equal to 1).}
}
}
\description{
Classify single cells into their cell cycle phases based on gene expression data.
}
\details{
This function implements the classification step of the pair-based prediction method described by Scialdone et al. (2015).
To illustrate, consider classification of cells into G1 phase.
Pairs of marker genes are identified with \code{\link{sandbag}}, where the expression of the first gene in the training data is greater than the second in G1 phase but less than the second in all other phases.
For each cell, \code{cyclone} calculates the proportion of all marker pairs where the expression of the first gene is greater than the second in the new data \code{x} (pairs with the same expression are ignored).
A high proportion suggests that the cell is likely to belong in G1 phase, as the expression ranking in the new data is consistent with that in the training data.
Proportions are not directly comparable between phases due to the use of different sets of gene pairs for each phase.
Instead, proportions are converted into scores (see below) that account for the size and precision of the proportion estimate.
The same process is repeated for all phases, using the corresponding set of marker pairs in \code{pairs}.
Cells with G1 or G2M scores above 0.5 are assigned to the G1 or G2M phases, respectively.
(If both are above 0.5, the higher score is used for assignment.)
Cells can be assigned to S phase based on the S score, but a more reliable approach is to define S phase cells as those with G1 and G2M scores below 0.5.
Pre-trained classifiers are provided for mouse and human datasets, see \code{?\link{sandbag}} for more details.
However, note that the classifier may not be accurate for data that are substantially different from those used in the training set, e.g., due to the use of a different protocol.
In such cases, users can construct a custom classifier from their own training data using the \code{\link{sandbag}} function.
This is usually necessary for other model organisms where pre-trained classifiers are not available.
Users should \emph{not} filter out low-abundance genes before applying \code{cyclone}.
Even if a gene is not expressed in any cell, it may still be useful for classification if it is phase-specific.
Its lack of expression relative to other genes will still yield informative pairs, and filtering them out would reduce power.
}
\section{Description of the score calculation}{
To make the proportions comparable between phases, a distribution of proportions is constructed by shuffling the expression values within each cell and recalculating the proportion.
The phase score is defined as the lower tail probability at the observed proportion.
High scores indicate that the proportion is greater than what is expected by chance if the expression of marker genes were independent
(i.e., with no cycle-induced correlations between marker pairs within each cell).
% The shuffling assumes that the marker genes are IID from the same distribution of expression values, such that there's no correlations.
% The question is then what distribution of expression values to use - see below.
% Training also should protect against non-cycle-based correlations, as such they should be present across all phases and not get included in the marker set.
By default, shuffling is performed \code{iter} times to obtain the distribution from which the score is estimated.
However, some iterations may not be used if there are fewer than \code{min.pairs} pairs with different expression, such that the proportion cannot be calculated precisely.
A score is only returned if the distribution is large enough for stable calculation of the tail probability, i.e., consists of results from at least \code{min.iter} iterations.
Note that the score calculation in \code{cyclone} is slightly different from that described originally by Scialdone et al.
The original code shuffles all expression values within each cell, while in this implementation, only the expression values of genes in the marker pairs are shuffled.
This modification aims to use the most relevant expression values to build the null score distribution.
% In theory, this shouldn't matter, as the score calculation depends on the ranking of each gene.
% That should be the same regardless of the distribution of expression values -- each set of rankings is equally likely, no matter what.
% In practice, the number of tied expression values will differ between different set of genes, e.g., due to abundance (low counts more likely to get ties).
% The most appropriate comparison would involve the same number of ties as that used to calculate the observed score.
% It doesn't make sense, for example, to shuffle in a whole bunch of non-expressed genes (lots of zeroes, ties) when the markers are always expressed.
}
\examples{
set.seed(1000)
library(scuttle)
sce <- mockSCE(ncells=200, ngenes=1000)
# Constructing a classifier:
is.G1 <- which(sce$Cell_Cycle \%in\% c("G1", "G0"))
is.S <- which(sce$Cell_Cycle=="S")
is.G2M <- which(sce$Cell_Cycle=="G2M")
out <- sandbag(sce, list(G1=is.G1, S=is.S, G2M=is.G2M))
# Classifying a new dataset:
test <- mockSCE(ncells=50)
assignments <- cyclone(test, out)
head(assignments$scores)
table(assignments$phases)
}
\references{
Scialdone A, Natarajana KN, Saraiva LR et al. (2015).
Computational assignment of cell-cycle stage from single-cell transcriptome data.
\emph{Methods} 85:54--61
}
\seealso{
\code{\link{sandbag}}, to generate the pairs from reference data.
}
\author{
Antonio Scialdone,
with modifications by Aaron Lun
}
|