File: scoreMarkers.Rd

package info (click to toggle)
r-bioc-scran 1.26.2%2Bdfsg-1
  • links: PTS, VCS
  • area: main
  • in suites: bookworm
  • size: 1,692 kB
  • sloc: cpp: 733; makefile: 2
file content (214 lines) | stat: -rw-r--r-- 14,057 bytes parent folder | download | duplicates (2)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/scoreMarkers.R
\name{scoreMarkers}
\alias{scoreMarkers}
\alias{scoreMarkers,ANY-method}
\alias{scoreMarkers,SummarizedExperiment-method}
\alias{scoreMarkers,SingleCellExperiment-method}
\title{Score marker genes}
\usage{
scoreMarkers(x, ...)

\S4method{scoreMarkers}{ANY}(
  x,
  groups,
  block = NULL,
  pairings = NULL,
  lfc = 0,
  row.data = NULL,
  full.stats = FALSE,
  subset.row = NULL,
  BPPARAM = SerialParam()
)

\S4method{scoreMarkers}{SummarizedExperiment}(x, groups, ..., assay.type = "logcounts")

\S4method{scoreMarkers}{SingleCellExperiment}(x, groups = colLabels(x, onAbsence = "error"), ...)
}
\arguments{
\item{x}{A matrix-like object containing log-normalized expression values, with genes in rows and cells in columns.
Alternatively, a \linkS4class{SummarizedExperiment} object containing such a matrix in its assays.}

\item{...}{For the generic, further arguments to pass to individual methods.

For the SummarizedExperiment method, further arguments to pass to the ANY method.

For the SingleCellExperiment method, further arguments to pass to the SummarizedExperiment method.}

\item{groups}{A factor or vector containing the identity of the group for each cell in \code{x}.}

\item{block}{A factor or vector specifying the blocking level for each cell in \code{x}.}

\item{pairings}{A vector, list or matrix specifying how the comparisons are to be performed, see details.}

\item{lfc}{Numeric scalar specifying the log-fold change threshold to compute effect sizes against.}

\item{row.data}{A DataFrame with the same number and names of rows in \code{x}, containing extra information to insert into each DataFrame.}

\item{full.stats}{Logical scalar indicating whether the statistics from the pairwise comparisons should be directly returned.}

\item{subset.row}{See \code{?"\link{scran-gene-selection}"}.}

\item{BPPARAM}{A \linkS4class{BiocParallelParam} object specifying how the calculations should be parallelized.}

\item{assay.type}{String or integer scalar specifying the assay containing the log-expression matrix to use.}
}
\value{
A List of DataFrames containing marker scores for each gene in each group.
Each DataFrame corresponds to a group and each row corresponds to a gene in \code{x}.
See Details for information about the individual columns.
}
\description{
Compute various summary scores for potential marker genes to distinguish between groups of cells.
}
\details{
Compared to \code{\link{findMarkers}}, this function represents a simpler and more intuitive summary of the differences between the groups.
We do this by realizing that the p-values for these types of comparisons are largely meaningless;
individual cells are not meaningful units of experimental replication, while the groups themselves are defined from the data.
Thus, by discarding the p-values, we can simplify our marker selection by focusing only on the effect sizes between groups.

Here, the strategy is to perform pairwise comparisons between each pair of groups to obtain various effect sizes.
For each group \eqn{X}, we summarize the effect sizes across all pairwise comparisons involving that group, e.g., mean, min, max and so on.
This yields a DataFrame for each group where each column contains a different summarized effect and each row corresponds to a gene in \code{x}.
Reordering the rows by the summary of choice can yield a ranking of potential marker genes for downstream analyses.
}
\section{Choice of effect sizes}{

The \code{logFC.cohen} columns contain the standardized log-fold change, i.e., Cohen's d.
For each pairwise comparison, this is defined as the difference in the mean log-expression for each group scaled by the average standard deviation across the two groups.
(Technically, we should use the pooled variance; however, this introduces some unpleasant asymmetry depending on the variance of the larger group, so we take a simple average instead.)
Cohen's d is analogous to the t-statistic in a two-sample t-test and avoids spuriously large effect sizes from comparisons between highly variable groups.
We can also interpret Cohen's d as the number of standard deviations between the two group means.

The \code{AUC} columns contain the area under the curve.
This is the probability that a randomly chosen observation in one group is greater than a randomly chosen observation in the other group.
The AUC is closely related to the U-statistic used in the Wilcoxon rank sum test.
Values greater than 0.5 indicate that a gene is upregulated in the first group.

The key difference between the AUC and Cohen's d is that the former is less sensitive to the variance within each group.
The clearest example is that of two distributions that exhibit no overlap, where the AUC is the same regardless of the variance of each distribution.
This may or may not be desirable as it improves robustness to outliers but reduces the information available to obtain a highly resolved ranking. 
The most appropriate choice of effect size is left at the user's discretion.

Finally, the \code{logFC.detected} columns contain the log-fold change in the proportion of cells with detected (i.e., non-zero) expression between groups.
This is specifically useful for detecting binary expression patterns, e.g., activation of an otherwise silent gene.
Note that the non-zero status of the data is not altered by normalization, so differences in library size will not be removed when computing this metric.
This effect is not necessarily problematic - users can interpret it as \emph{retaining} information about the total RNA content, analogous to spike-in normalization.
}

\section{Setting a log-fold change threshold}{

The default settings may yield highly ranked genes with large effect sizes but low log-fold changes if the variance is low (Cohen's d) or separation is clear (AUC).
Such genes may not be particularly interesting as the actual change in expression is modest.
Setting \code{lfc} allows us to focus on genes with large log-fold changes between groups,
by simply shifting the \dQuote{other} group's expression values by \code{lfc} before computing effect sizes.

When \code{lfc} is not zero, Cohen's d is generalized to the standardized difference between the observed log-fold change and \code{lfc}.
For example, if we had \code{lfc=2} and we obtained a Cohen's d of 3, this means that the observed log-fold change was 3 standard deviations above a value of 2.
A side effect is that we can only unambiguously interpret the direction of Cohen's d when it has the same sign as \code{lfc}.
Our above example represents upregulation, but if our Cohen's d was negative, this could either mean downregulation or simply that our observed log-fold change was less than \code{lfc}.

When \code{lfc} is not zero, the AUC is generalized to the probability of obtaining a random observation in one group that is greater than a random observation plus \code{lfc} in the other group.
For example, if we had \code{lfc=2} and we obtained an AUC of 0.8, this means that we would observe a difference of \code{lfc} or greater between the random observations.
Again, we can only unambiguously interpret the direction of the change when it is the same as the sign of the \code{lfc}.
In this case, an AUC above 0.5 with a positive \code{lfc} represents upregulation, but an AUC below 0.5 could mean either downregulation or a log-fold change less than \code{lfc}.

A non-zero setting of \code{lfc} has no effect on the log-fold change in the proportion of cells with detected expression.
}

\section{Computing effect size summaries}{

To simplify interpretation, we summarize the effect sizes across all pairwise comparisons into a few key metrics.
For each group \eqn{X}, we consider the effect sizes from all pairwise comparisons between \eqn{X} and other groups. 
We then compute the following values:
\itemize{
\item \code{mean.*}, the mean effect size across all pairwise comparisons involving \eqn{X}.
A large value (>0 for log-fold changes, >0.5 for the AUCs) indicates that the gene is upregulated in \eqn{X} compared to the average of the other groups.
A small value (<0 for the log-fold changes, <0.5 for the AUCs) indicates that the gene is downregulated in \eqn{X} instead.
\item \code{median.*}, the median effect size across all pairwise comparisons involving \eqn{X}.
A large value indicates that the gene is upregulated in \eqn{X} compared to most (>50\%) other groups.
A small value indicates that the gene is downregulated in \eqn{X} instead.
\item \code{min.*}, the minimum effect size across all pairwise comparisons involving \eqn{X}.
A large value indicates that the gene is upregulated in \eqn{X} compared to all other groups.
A small value indicates that the gene is downregulated in \eqn{X} compared to at least one other group.
\item \code{max.*}, the maximum effect size across all pairwise comparisons involving \eqn{X}.
A large value indicates that the gene is upregulated in \eqn{X} compared to at least one other group.
A small value indicates that the gene is downregulated in \eqn{X} compared to all other groups.
\item \code{rank.*}, the minimum rank (i.e., \dQuote{min-rank}) across all pairwise comparisons involving \eqn{X} - see \code{?\link{computeMinRank}} for details.
A small min-rank indicates that the gene is one of the top upregulated genes in at least one comparison to another group.
}
One set of these columns is added to the DataFrame for each effect size described above.
For example, the mean column for the AUC would be \code{mean.AUC}.
We can then reorder each group's DataFrame by our column of choice, depending on which summary and effect size we are interested in.
For example, if we ranked by decreasing \code{min.logFC.detected}, we would be aiming for marker genes that exhibit strong binary increases in expression in \eqn{X} compared to \emph{all} other groups.

If \code{full.stats=TRUE}, an extra \code{full.*} column is returned in the DataFrame.
This contains a nested DataFrame with number of columns equal to the number of other groups.
Each column contains the statistic from the comparison between \eqn{X} and the other group.

Keep in mind that the interpretations above also depend on the sign of \code{lfc}.
The concept of a \dQuote{large} summary statistic (>0 for Cohen's d, >0.5 for the AUCs) can only be interpreted as upregulation when \code{lfc >= 0}.
Similarly, the concept of a \dQuote{small} value (<0 for Cohen's d, <0.5 for the AUCs) cannot be interpreted as downregulation when \code{lfc <= 0}.
For example, if \code{lfc=1}, a positive \code{min.logFC.cohen} can still be interpreted as upregulation in \eqn{X} compared to all other groups,
but a negative \code{max.logFC.cohen} could not be interpreted as downregulation in \eqn{X} compared to all other groups.
}

\section{Computing other descriptive statistics}{
 
We report the mean log-expression of all cells in \eqn{X}, as well as the grand mean of mean log-expression values for all other groups.
This is purely descriptive; while it can be used to compute an overall log-fold change, ranking is best performed on one of the effect sizes described above.
We also report the proportion of cells with detectable expression in \eqn{X} and the mean proportion for all other groups.

When \code{block} is specified, the reported mean for each group is computed via \code{\link{correctGroupSummary}}. 
Briefly, this involves fitting a linear model to remove the effect of the blocking factor from the per-group mean log-expression.
The same is done for the detected proportion, except that the values are subjected to a logit transformation prior to the model fitting.
In both cases, each group/block combination is weighted by its number of cells in the model.
}

\section{Controlling the pairings}{

The \code{pairings} argument specifies the pairs of groups that should be compared.
This can be:
\itemize{
\item \code{NULL}, in which case comparisons are performed between all groups in \code{groups}.
\item A vector of the same type as \code{group}, specifying a subset of groups of interest.
We then perform all pairwise comparisons between groups in the subset.
\item A list of two vectors, each of the same type as \code{group} and specifying a subset of groups.
Comparisons are performed between one group from the first vector and another group from the second vector.
\item A matrix of two columns of the same type as \code{group}.
Each row is assumed to specify a pair of groups to be compared.
}

Effect sizes (and their summaries) are computed for only the pairwise comparisons specified by \code{pairings}.
Similarly, the \code{other.*} values in \eqn{X}'s DataFrame are computed using only the groups involved in pairwise comparisons with \eqn{X}.
The default of \code{pairings=NULL} ensures that all groups are used and effect sizes for all pairwise comparisons are reported;
however, this may not be the case for other values of \code{pairings}.

For list and matrix arguments, the first vector/column is treated as the first group in the effect size calculations.
Statistics for each comparison will only show up in the DataFrame for the first group, 
i.e., a comparison between \eqn{X} and \eqn{Y} will have a valid \code{full.AUC$Y} field in \eqn{X}'s DataFrame but not vice versa.
If both directions are desired in the output, both of the corresponding permutations should be explicitly specified in \code{pairings}.
}

\examples{
library(scuttle)
sce <- mockSCE()
sce <- logNormCounts(sce)

# Any clustering method is okay, only using k-means for convenience.
kout <- kmeans(t(logcounts(sce)), centers=4) 

out <- scoreMarkers(sce, groups=kout$cluster)
out

# Ranking by a metric of choice:
of.interest <- out[[1]]
of.interest[order(of.interest$mean.AUC, decreasing=TRUE),1:4]
of.interest[order(of.interest$rank.AUC),1:4]
of.interest[order(of.interest$median.logFC.cohen, decreasing=TRUE),1:4]
of.interest[order(of.interest$min.logFC.detected, decreasing=TRUE),1:4]

}
\author{
Aaron Lun
}