File: fastqPairedFilter.Rd

package info (click to toggle)
r-bioc-dada2 1.34.0%2Bdfsg-2
links: PTS, VCS
area: main
in suites: sid, trixie
size: 3,016 kB
sloc: cpp: 3,096; makefile: 5
file content (165 lines) | stat: -rw-r--r-- 7,503 bytes
parent folder | download | duplicates (3)
% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/filter.R
\name{fastqPairedFilter}
\alias{fastqPairedFilter}
\title{Filters and trims paired forward and reverse fastq files.}
\usage{
fastqPairedFilter(
  fn,
  fout,
  maxN = c(0, 0),
  truncQ = c(2, 2),
  truncLen = c(0, 0),
  maxLen = c(Inf, Inf),
  minLen = c(20, 20),
  trimLeft = c(0, 0),
  trimRight = c(0, 0),
  minQ = c(0, 0),
  maxEE = c(Inf, Inf),
  rm.phix = c(TRUE, TRUE),
  rm.lowcomplex = c(0, 0),
  matchIDs = FALSE,
  orient.fwd = NULL,
  id.sep = "\\\\s",
  id.field = NULL,
  n = 1e+06,
  OMP = TRUE,
  qualityType = "Auto",
  compress = TRUE,
  verbose = FALSE,
  ...
)
}
\arguments{
\item{fn}{(Required). A \code{character(2)} naming the paths to the (forward,reverse) fastq files.}

\item{fout}{(Required). A \code{character(2)} naming the paths to the (forward,reverse) output files.
 Note that by default (\code{compress=TRUE}) the output fastq files are gzipped.

\strong{FILTERING AND TRIMMING ARGUMENTS}   

If a length 1 vector is provided, the same parameter value is used for the forward and reverse reads.
If a length 2 vector is provided, the first value is used for the forward reads, and the second 
  for the reverse reads.}

\item{maxN}{(Optional). Default 0.
After truncation, sequences with more than \code{maxN} Ns will be discarded. 
Note that \code{\link{dada}} currently does not allow Ns.}

\item{truncQ}{(Optional). Default 2.
Truncate reads at the first instance of a quality score less than or equal to \code{truncQ}.}

\item{truncLen}{(Optional). Default 0 (no truncation).
Truncate reads after \code{truncLen} bases. Reads shorter than this are discarded.}

\item{maxLen}{(Optional). Default Inf (no maximum).
Remove reads with length greater than maxLen. maxLen is enforced on the raw reads.}

\item{minLen}{(Optional). Default 20.
Remove reads with length less than minLen. minLen is enforced after all other trimming and truncation.}

\item{trimLeft}{(Optional). Default 0.
The number of nucleotides to remove from the start of each read. If both \code{truncLen} and 
\code{trimLeft} are provided, filtered reads will have length \code{truncLen-trimLeft}.}

\item{trimRight}{(Optional). Default 0.
The number of nucleotides to remove from the end of each read. If both \code{truncLen} and 
\code{trimRight} are provided, truncation will be performed after \code{trimRight} is enforced.}

\item{minQ}{(Optional). Default 0.
After truncation, reads contain a quality score below minQ will be discarded.}

\item{maxEE}{(Optional). Default \code{Inf} (no EE filtering).
After truncation, reads with higher than maxEE "expected errors" will be discarded.
Expected errors are calculated from the nominal definition of the quality score: EE = sum(10^(-Q/10))}

\item{rm.phix}{(Optional). Default TRUE.
If TRUE, discard reads that match against the phiX genome, as determined by 
\code{\link{isPhiX}}.}

\item{rm.lowcomplex}{(Optional). Default 0.
If greater than 0, reads with an effective number of kmers less than this value will be removed.
The effective number of kmers is determined by \code{\link{seqComplexity}} using a Shannon information
approximation. The default kmer-size is 2, and therefore perfectly random sequences will approach an
effective kmer number of 16 = 4 (nucleotides) ^ 2 (kmer size).}

\item{matchIDs}{(Optional). Default FALSE.
Whether to enforce matching between the id-line sequence identifiers of the forward and reverse fastq files.
  If TRUE, only paired reads that share id fields (see below) are output.
  If FALSE, no read ID checking is done.
Note: \code{matchIDs=FALSE} essentially assumes matching order between forward and reverse reads. If that
  matched order is not present future processing steps may break (in particular \code{\link{mergePairs}}).}

\item{orient.fwd}{(Optional). Default NULL.
 A character string present at the start of valid reads. Only allows unambiguous nucleotides. 
 This string is compared to the start of the forward and reverse reads. 
 If it exactly matches the start of the forward read, the read is kept.
 If it exactly matches the start of the reverse read, the fwd/rev reads are swapped.
 Otherwise the read if filtered out.
 The primary use of this parameter is to unify the orientation of amplicon sequencing libraries that
 are a mixture of forward and reverse orientations, and that include the forward primer on the reads.

\strong{ID MATCHING ARGUMENTS}   

 The following optional arguments enforce matching between the sequence identification
 strings in the forward and reverse reads, and can automatically detect and match ID fields in 
 Illumina format, e.g: EAS139:136:FC706VJ:2:2104:15343:197393. ID matching is not required
 when using standard Illumina output fastq files.}

\item{id.sep}{(Optional). Default "\\s" (white-space).
The separator between fields in the id-line of the input fastq files. Passed to the \code{\link{strsplit}}.}

\item{id.field}{(Optional). Default NULL (automatic detection).
The field of the id-line containing the sequence identifier.
If NULL (the default) and matchIDs is TRUE, the function attempts to automatically detect
  the sequence identifier field under the assumption of Illumina formatted output.}

\item{n}{(Optional). The number of records (reads) to read in and filter at any one time.
This controls the peak memory requirement so that very large fastq files are supported.
Default is \code{1e6}, one-million reads. See \code{\link[ShortRead]{FastqStreamer}} for details.}

\item{OMP}{(Optional). Default TRUE.
Whether or not to use OMP multithreading when calling \code{\link[ShortRead]{FastqStreamer}}. 
Set this to FALSE if calling this function within a parallelized chunk of code 
(eg. within \code{\link[parallel]{mclapply}}).}

\item{qualityType}{(Optional). \code{character(1)}.
The quality encoding of the fastq file(s). "Auto" (the default) means to
attempt to auto-detect the encoding. This parameter is passed on to
\code{\link[ShortRead]{readFastq}}; see information there for details.}

\item{compress}{(Optional). Default TRUE.
Whether the output fastq files should be gzip compressed.}

\item{verbose}{(Optional). Default FALSE.
Whether to output status messages.}

\item{...}{(Optional). Arguments passed on to \code{\link{isPhiX}} or \code{\link{seqComplexity}}.}
}
\value{
\code{integer(2)}.
 The number of reads read in, and the number of reads that passed the filter and were output.
}
\description{
fastqPairedFilter filters pairs of input fastq files (can be compressed) based on several
user-definable criteria, and outputs those read pairs which pass the filter in \strong{both} directions
to two new fastq file (also can be compressed). Several functions
in the \code{ShortRead} package are leveraged to do this filtering. The filtered forward/reverse reads
remain identically ordered.
}
\examples{

testFastqF = system.file("extdata", "sam1F.fastq.gz", package="dada2")
testFastqR = system.file("extdata", "sam1R.fastq.gz", package="dada2")
filtFastqF <- tempfile(fileext=".fastq.gz")
filtFastqR <- tempfile(fileext=".fastq.gz")
fastqPairedFilter(c(testFastqF, testFastqR), c(filtFastqF, filtFastqR), maxN=0, maxEE=2)
fastqPairedFilter(c(testFastqF, testFastqR), c(filtFastqF, filtFastqR), trimLeft=c(10, 20),
                    truncLen=c(240, 200), maxEE=2, rm.phix=TRUE, rm.lowcomplex=5, kmerSize=2)

}
\seealso{
\code{\link{fastqFilter}}
\code{\link[ShortRead]{FastqStreamer}}
\code{\link[ShortRead]{trimTails}}
}