File: dada.Rd

package info (click to toggle)
r-bioc-dada2 1.34.0%2Bdfsg-2
  • links: PTS, VCS
  • area: main
  • in suites: sid, trixie
  • size: 3,016 kB
  • sloc: cpp: 3,096; makefile: 5
file content (128 lines) | stat: -rw-r--r-- 6,458 bytes parent folder | download | duplicates (3)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/dada.R
\name{dada}
\alias{dada}
\title{High resolution sample inference from amplicon data.}
\usage{
dada(
  derep,
  err,
  errorEstimationFunction = loessErrfun,
  selfConsist = FALSE,
  pool = FALSE,
  priors = character(0),
  multithread = FALSE,
  verbose = TRUE,
  ...
)
}
\arguments{
\item{derep}{(Required). \code{character} or \code{\link{derep-class}}.
The file path(s) to the fastq file(s), or a directory containing fastq file(s) corresponding to the
the samples to be denoised. Compressed file formats such as .fastq.gz and .fastq.bz2 are supported.
A \code{\link{derep-class}} object (or list thereof) returned by \code{link{derepFastq}} can also be provided.
If multiple samples are provided, each will be denoised with a shared error model.}

\item{err}{(Required). 16xN numeric matrix, or an object coercible by \code{\link{getErrors}} 
such as the output of the \code{\link{learnErrors}} function.
  
The matrix of estimated rates for each possible nucleotide transition (from sample nucleotide to read nucleotide).
Rows correspond to the 16 possible transitions (t_ij) indexed such that 1:A->A, 2:A->C, ..., 16:T->T
Columns correspond to quality scores. Each entry must be between 0 and 1.
  
Typically there are 41 columns for the quality scores 0-40.
However, if USE_QUALS=FALSE, the matrix must have only one column.
  
If selfConsist = TRUE, \code{err} can be set to NULL and an initial error matrix will be estimated from the data
by assuming that all reads are errors away from one true sequence.}

\item{errorEstimationFunction}{(Optional). Function. Default \code{\link{loessErrfun}}.

 If USE_QUALS = TRUE, \code{errorEstimationFunction(dada()$trans_out)} is computed after sample inference, 
   and the return value is used as the new estimate of the err matrix in $err_out.
   
 If USE_QUALS = FALSE, this argument is ignored, and transition rates are estimated by maximum likelihood (t_ij = n_ij/n_i).}

\item{selfConsist}{(Optional). \code{logical(1)}. Default FALSE.

 If selfConsist = TRUE, the algorithm will alternate between sample inference and error rate estimation 
   until convergence. Error rate estimation is performed by \code{errorEstimationFunction}.
   
 If selfConsist=FALSE the algorithm performs one round of sample inference based on the provided \code{err} matrix.}

\item{pool}{(Optional). \code{logical(1)}. Default is FALSE.

 If pool = TRUE, the algorithm will pool together all samples prior to sample inference.
 If pool = FALSE, sample inference is performed on each sample individually.
 If pool = "pseudo", the algorithm will perform pseudo-pooling between individually processed samples.
 
 This argument has no effect if only 1 sample is provided, and \code{pool} does not affect
  error rates, which are always estimated from pooled observations across samples.}

\item{priors}{(Optional). \code{character}. Default is character(0), i.e. no prior sequences.
 
The priors argument provides a set of sequences for which there is prior information suggesting they may
 truly exist, i.e. are not errors. The abundance p-value of dereplicated sequences that exactly match one
 of the priors are calculated without conditioning on presence, allowing singletons to be detected,
 and are compared to a reduced threshold `OMEGA_P` when forming new partitions.}

\item{multithread}{(Optional). Default is FALSE.
If TRUE, multithreading is enabled and the number of available threads is automatically determined.   
If an integer is provided, the number of threads to use is set by passing the argument on to
\code{\link{setThreadOptions}}.}

\item{verbose}{(Optional). Default TRUE. 
 Print verbose text output. More fine-grained control is available by providing an integer argument.
\itemize{ 
 \item{0: Silence. No text output (same as FALSE). }
 \item{1: Basic text output (same as TRUE). }
 \item{2: Detailed text output, mostly intended for debugging. }
}}

\item{...}{(Optional). All dada_opts can be passed in as arguments to the dada() function.
See \code{\link{setDadaOpt}} for a full list and description of these options.}
}
\value{
A \code{\link{dada-class}} object or list of such objects if a list of dereps was provided.
}
\description{
The dada function takes as input dereplicated amplicon sequencing reads and returns the inferred composition
 of the sample (or samples). Put another way, dada removes all sequencing errors to reveal the members of the
 sequenced community.
 
If dada is run in selfConsist=TRUE mode, the algorithm will infer both the sample composition and
 the parameters of its error model from the data.
}
\details{
Briefly, \code{dada} implements a statistical test for the notion that a specific sequence was seen too many times
 to have been caused by amplicon errors from currently inferred sample sequences. Overly-abundant
 sequences are used as the seeds of new partitions of sequencing reads, and the final set of partitions
 is taken to represent the denoised composition of the sample. A more detailed explanation of the algorithm
 is found in two publications:

\itemize{ 
 \item{Callahan BJ, McMurdie PJ, Rosen MJ, Han AW, Johnson AJ, Holmes SP (2016). DADA2: High resolution sample inference from Illumina amplicon data. Nature Methods, 13(7), 581-3.}
 \item{Rosen MJ, Callahan BJ, Fisher DS, Holmes SP (2012). Denoising PCR-amplified metagenome data. BMC bioinformatics, 13(1), 283.}
}
 
\code{dada} depends on a parametric error model of substitutions. Thus the quality of its sample inference is affected
 by the accuracy of the estimated error rates. \code{selfConsist} mode allows these error rates to be inferred 
 from the data.
 
All comparisons between sequences performed by \code{dada} depend on pairwise alignments. This step is the most 
 computationally intensive part of the algorithm, and two alignment heuristics have been implemented for speed:
 A kmer-distance screen and banded Needleman-Wunsch alignmemt. See \code{\link{setDadaOpt}}.
}
\examples{
fn1 <- system.file("extdata", "sam1F.fastq.gz", package="dada2")
fn2 <- system.file("extdata", "sam2F.fastq.gz", package="dada2")
derep1 = derepFastq(fn1)
derep2 = derepFastq(fn2)
dada(fn1, err=tperr1)
dada(list(sam1=derep1, sam2=derep2), err=tperr1, selfConsist=TRUE)
dada(derep1, err=inflateErr(tperr1,3), BAND_SIZE=32, OMEGA_A=1e-20)

}
\seealso{
\code{\link{derepFastq}}, \code{\link{setDadaOpt}}
}