File: seqComplexity.Rd

package info (click to toggle)

r-bioc-dada2 1.34.0%2Bdfsg-2

links: PTS, VCS
area: main
in suites: sid, trixie
size: 3,016 kB
sloc: cpp: 3,096; makefile: 5

file content (62 lines) | stat: -rw-r--r-- 2,393 bytes

parent folder | download | duplicates (3)

% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/filter.R
\name{seqComplexity}
\alias{seqComplexity}
\title{Determine if input sequence(s) are low complexity.}
\usage{
seqComplexity(seqs, kmerSize = 2, window = NULL, by = 5, ...)
}
\arguments{
\item{seqs}{(Required). A \code{character} vector of A/C/G/T sequences, or
any object coercible by \code{\link{getSequences}}.}

\item{kmerSize}{(Optional). Default 2.
The size of the kmers (or "oligonucleotides" or "words") to use.}

\item{window}{(Optional). Default NULL.
The width in nucleotides of the moving window. If NULL the whole sequence is used.}

\item{by}{(Optional). Default 5.
The step size in nucleotides between each moving window tested.}

\item{...}{(Optional). Ignored.}
}
\value{
\code{numeric}.
 A vector of minimum kmer complexities for each sequence.
}
\description{
This function calculates the kmer
complexity of input sequences. Complexity is quantified as the Shannon
richness of kmers, which can be thought of as the
effective number of kmers if they were all
at equal frequencies. If a window size is provided, the minimum Shannon
richness observed over sliding window along the sequence is returned.
}
\details{
This function can be used to identify potentially artefactual or undesirable
low-complexity sequences, or sequences with low-complexity regions, as are
sometimes observed in Illumina sequencing runs. When such artefactual
sequences are present, the Shannon kmer
richness values returned by this function will typically show a clear
bimodal signal.

Kmers with non-ACGT characters are ignored. Also note that no correction is
performed for sequence lengths. This is important when using longer kmer
lengths, where 4^wordSize approaches the length of the sequence, as shorter
sequences will then have a lower effective richness simply due to their
being too little sequence to sample all the possible kmers.
}
\examples{
sq.norm <- "TACGGAAGGTCCGGGCGTTATCCGGATTTATTGGGTTTAAAGGGAGCGTAGGCCGGAGATTAAGCGTGTTGTGA"
sq.lowc <- "TCCTTCTTCTCCTCTCTTTCTCCTTCTTTCTTTTTTTTCCCTTTCTCTTCTTCTTTTTCTTCCTTCCTTTTTTC"
sq.part <- "TTTTTCTTCTCCCCCTTCCCCTTTCCTTTTCTCCTTTTTTCCTTTAGTGCAGTTGAGGCAGGCGGAATTCGTGG"
sqs <- c(sq.norm, sq.lowc, sq.part)
seqComplexity(sqs)
seqComplexity(sqs, kmerSize=3, window=25)

}
\seealso{
\code{\link{plotComplexity}}
 \code{\link[Biostrings]{oligonucleotideFrequency}}
}