File: vst.Rd

package info (click to toggle)
r-cran-sctransform 0.4.1-1
links: PTS, VCS
area: main
in suites: forky, sid, trixie
size: 600 kB
sloc: cpp: 323; sh: 13; makefile: 2
file content (170 lines) | stat: -rw-r--r-- 9,414 bytes
parent folder | download | duplicates (2)
% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/vst.R
\name{vst}
\alias{vst}
\title{Variance stabilizing transformation for UMI count data}
\usage{
vst(
  umi,
  cell_attr = NULL,
  latent_var = c("log_umi"),
  batch_var = NULL,
  latent_var_nonreg = NULL,
  n_genes = 2000,
  n_cells = NULL,
  method = "poisson",
  do_regularize = TRUE,
  theta_regularization = "od_factor",
  res_clip_range = c(-sqrt(ncol(umi)), sqrt(ncol(umi))),
  bin_size = 500,
  min_cells = 5,
  residual_type = "pearson",
  return_cell_attr = FALSE,
  return_gene_attr = TRUE,
  return_corrected_umi = FALSE,
  min_variance = -Inf,
  bw_adjust = 3,
  gmean_eps = 1,
  theta_estimation_fun = "theta.ml",
  theta_given = NULL,
  exclude_poisson = FALSE,
  use_geometric_mean = TRUE,
  use_geometric_mean_offset = FALSE,
  fix_intercept = FALSE,
  fix_slope = FALSE,
  scale_factor = NA,
  vst.flavor = NULL,
  verbosity = 2,
  verbose = NULL,
  show_progress = NULL
)
}
\arguments{
\item{umi}{A matrix of UMI counts with genes as rows and cells as columns}

\item{cell_attr}{A data frame containing the dependent variables; if omitted a data frame with umi and gene will be generated}

\item{latent_var}{The independent variables to regress out as a character vector; must match column names in cell_attr; default is c("log_umi")}

\item{batch_var}{The dependent variables indicating which batch a cell belongs to; no batch interaction terms used if omiited}

\item{latent_var_nonreg}{The non-regularized dependent variables to regress out as a character vector; must match column names in cell_attr; default is NULL}

\item{n_genes}{Number of genes to use when estimating parameters (default uses 2000 genes, set to NULL to use all genes)}

\item{n_cells}{Number of cells to use when estimating parameters (default uses all cells)}

\item{method}{Method to use for initial parameter estimation; one of 'poisson', 'qpoisson', 'nb_fast', 'nb', 'nb_theta_given', 'glmGamPoi', 'offset', 'offset_shared_theta_estimate', 'glmGamPoi_offset'; default is 'poisson'}

\item{do_regularize}{Boolean that, if set to FALSE, will bypass parameter regularization and use all genes in first step (ignoring n_genes); default is FALSE}

\item{theta_regularization}{Method to use to regularize theta; use 'log_theta' for the behavior prior to version 0.3; default is 'od_factor'}

\item{res_clip_range}{Numeric of length two specifying the min and max values the results will be clipped to; default is c(-sqrt(ncol(umi)), sqrt(ncol(umi)))}

\item{bin_size}{Number of genes to process simultaneously; this will determine how often the progress bars are updated and how much memory is being used; default is 500}

\item{min_cells}{Only use genes that have been detected in at least this many cells; default is 5}

\item{residual_type}{What type of residuals to return; can be 'pearson', 'deviance', or 'none'; default is 'pearson'}

\item{return_cell_attr}{Make cell attributes part of the output; default is FALSE}

\item{return_gene_attr}{Calculate gene attributes and make part of output; default is TRUE}

\item{return_corrected_umi}{If set to TRUE output will contain corrected UMI matrix; see \code{correct} function}

\item{min_variance}{Lower bound for the estimated variance for any gene in any cell when calculating pearson residual; one of 'umi_median', 'model_median', 'model_mean' or a
numeric. default is -Inf.  When set to 'umi_median' uses (median of non-zero UMIs / 5)^2 as the minimum variance so that a median UMI (often 1)
results in a maximum pearson residual of 5. When set to 'model_median' or 'model_mean' uses the mean/median of the model estimated mu per gene as the minimum_variance.#'}

\item{bw_adjust}{Kernel bandwidth adjustment factor used during regurlarization; factor will be applied to output of bw.SJ; default is 3}

\item{gmean_eps}{Small value added when calculating geometric mean of a gene to avoid log(0); default is 1}

\item{theta_estimation_fun}{Character string indicating which method to use to estimate theta (when method = poisson); default is 'theta.ml', but 'theta.mm' seems to be a good and fast alternative}

\item{theta_given}{If method is set to nb_theta_given, this should be a named numeric vector of fixed theta values for the genes; if method is offset, this should be a single value; default is NULL}

\item{exclude_poisson}{Exclude poisson genes (i.e. mu < 0.001 or mu > variance) from regularization; default is FALSE}

\item{use_geometric_mean}{Use geometric mean instead of arithmetic mean for all calculations ; default is TRUE}

\item{use_geometric_mean_offset}{Use geometric mean instead of arithmetic mean in the offset model; default is FALSE}

\item{fix_intercept}{Fix intercept as defined in the offset model; default is FALSE}

\item{fix_slope}{Fix slope to log(10) (equivalent to using library size as an offset); default is FALSE}

\item{scale_factor}{Replace all values of UMI in the regression model by this value instead of the median UMI; default is NA}

\item{vst.flavor}{When set to `v2` sets method = glmGamPoi_offset, n_cells=2000, and exclude_poisson = TRUE which causes the model to learn theta and intercept only besides excluding poisson genes from learning and regularization; default is NULL which uses the original sctransform model}

\item{verbosity}{An integer specifying whether to show only messages (1), messages and progress bars (2) or nothing (0) while the function is running; default is 2}

\item{verbose}{Deprecated; use verbosity instead}

\item{show_progress}{Deprecated; use verbosity instead}
}
\value{
A list with components
\item{y}{Matrix of transformed data, i.e. Pearson residuals, or deviance residuals; empty if \code{residual_type = 'none'}}
\item{umi_corrected}{Matrix of corrected UMI counts (optional)}
\item{model_str}{Character representation of the model formula}
\item{model_pars}{Matrix of estimated model parameters per gene (theta and regression coefficients)}
\item{model_pars_outliers}{Vector indicating whether a gene was considered to be an outlier}
\item{model_pars_fit}{Matrix of fitted / regularized model parameters}
\item{model_str_nonreg}{Character representation of model for non-regularized variables}
\item{model_pars_nonreg}{Model parameters for non-regularized variables}
\item{genes_log_gmean_step1}{log-geometric mean of genes used in initial step of parameter estimation}
\item{cells_step1}{Cells used in initial step of parameter estimation}
\item{arguments}{List of function call arguments}
\item{cell_attr}{Data frame of cell meta data (optional)}
\item{gene_attr}{Data frame with gene attributes such as mean, detection rate, etc. (optional)}
\item{times}{Time stamps at various points in the function}
}
\description{
Apply variance stabilizing transformation to UMI count data using a regularized Negative Binomial regression model.
This will remove unwanted effects from UMI data and return Pearson residuals.
Uses future_lapply; you can set the number of cores it will use to n with plan(strategy = "multicore", workers = n).
If n_genes is set, only a (somewhat-random) subset of genes is used for estimating the
initial model parameters. For details see \doi{10.1186/s13059-019-1874-1}.
}
\section{Details}{

In the first step of the algorithm, per-gene glm model parameters are learned. This step can be done
on a subset of genes and/or cells to speed things up.
If \code{method} is set to 'poisson', a poisson regression is done and
the negative binomial theta parameter is estimated using the response residuals in
\code{theta_estimation_fun}.
If \code{method} is set to 'qpoisson', coefficients and overdispersion (phi) are estimated by quasi
poisson regression and theta is estimated based on phi and the mean fitted value - this is currently
the fastest method with results very similar to 'glmGamPoi'
If \code{method} is set to 'nb_fast', coefficients and theta are estimated as in the
'poisson' method, but coefficients are then re-estimated using a proper negative binomial
model in a second call to glm with \code{family = MASS::negative.binomial(theta = theta)}.
If \code{method} is set to 'nb', coefficients and theta are estimated by a single call to
\code{MASS::glm.nb}.
If \code{method} is set to 'glmGamPoi', coefficients and theta are estimated by a single call to
\code{glmGamPoi::glm_gp}.

A special case is \code{method = 'offset'}. Here no regression parameters are learned, but
instead an offset model is assumed. The latent variable is set to log_umi and a fixed
slope of log(10) is used (offset). The intercept is given by log(gene_mean) - log(avg_cell_umi).
See Lause et al. \doi{10.1186/s13059-021-02451-7} for details.
Theta is set
to 100 by default, but can be changed using the \code{theta_given} parameter (single numeric value).
If the offset method is used, the following parameters are overwritten:
\code{cell_attr <- NULL, latent_var <- c('log_umi'), batch_var <- NULL, latent_var_nonreg <- NULL,
n_genes <- NULL, n_cells <- NULL, do_regularize <- FALSE}. Further, \code{method = 'offset_shared_theta_estimate'}
exists where the 250 most highly expressed genes with detection rate of at least 0.5 are used
to estimate a theta that is then shared across all genes. Thetas are estimated per individual gene
using 5000 randomly selected cells. The final theta used for all genes is then the average.
}

\examples{
\donttest{
vst_out <- vst(pbmc)
}

}