File: cleanSizeFactors.Rd

package info (click to toggle)
r-bioc-scuttle 1.16.0%2Bdfsg-3
links: PTS, VCS
area: main
in suites: forky, sid, trixie
size: 912 kB
sloc: cpp: 531; sh: 7; makefile: 2
file content (72 lines) | stat: -rw-r--r-- 3,103 bytes
parent folder | download | duplicates (2)
% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/cleanSizeFactors.R
\name{cleanSizeFactors}
\alias{cleanSizeFactors}
\title{Clean out non-positive size factors}
\usage{
cleanSizeFactors(
  size.factors,
  num.detected,
  control = nls.control(warnOnly = TRUE),
  iterations = 3,
  nmads = 3,
  ...
)
}
\arguments{
\item{size.factors}{A numeric vector containing size factors for all libraries.}

\item{num.detected}{A numeric vector of the same length as \code{size.factors}, containing the number of features detected in each library.}

\item{control}{Argument passed to \code{\link{nls}} to control the fitting, see \code{?\link{nls.control}} for details.}

\item{iterations}{Integer scalar specifying the number of robustness iterations.}

\item{nmads}{Numeric scalar specifying the multiple of MADs to use for the tricube bandwidth in robustness iterations.}

\item{...}{Further arguments to pass to \code{\link{nls}}.}
}
\value{
A numeric vector identical to \code{size.factors} but with all non-positive size factors replaced with fitted values from the curve.
}
\description{
Coerce non-positive size factors (occasionally generated by \code{\link{pooledSizeFactors}}) to positive values based on the number of detected features.
}
\details{
This function will first fit a non-linear curve of the form
\deqn{y = \frac{ax}{1 + bx}}{y = ax/(1 + bx)}
where \code{y} is \code{num.detected} and \code{x} is \code{size.factors} for all positive size factors.
This is a purely empirical expression, chosen because it is passes through the origin, is linear near zero and asymptotes at large \code{x}.
The fitting is done robustly with iterations of tricube weighting to eliminate outliers.

We then consider the number of detected features for all samples with non-positive size factors.
This is treated as \code{y} and used to solve for \code{x} based on the curve fitted above.
The result is the \dQuote{cleaned} size factor, which must always be positive for \code{y < a/b}.
For \code{y > a/b}, there is no solution so the cleaned size factor is defined as the largest positive value in \code{size.factors}.

Negative size factors can occasionally be generated by \code{\link{pooledSizeFactors}}, see the documentation there for more details.
By coercing them to positive values, we can proceed to normalization and downstream analyses.
Here, we use the number of detected features as this is more robust to differential expression that would cause biases in the library size.
Of course, it is not theoretically guaranteed to yield the correct size factor, but a rough guess is better than a negative value.
}
\examples{
set.seed(100)    
counts <- matrix(rpois(20000, lambda=1), ncol=100)

library(scuttle)
sf <- librarySizeFactors(counts)
ngenes <- colSums(counts > 0)

# Adding negative size factor values to be cleaned.
out <- cleanSizeFactors(c(-1, -1, sf), c(100, 50, ngenes))
head(out)

}
\seealso{
\code{\link{pooledSizeFactors}}, which can occasionally generate negative size factors.

\code{\link{nls}}, which performs the curve fitting.
}
\author{
Aaron Lun
}