File: filterAndTrim.Rd

package info (click to toggle)
r-bioc-dada2 1.34.0%2Bdfsg-2
links: PTS, VCS
area: main
in suites: sid, trixie
size: 3,016 kB
sloc: cpp: 3,096; makefile: 5
file content (191 lines) | stat: -rw-r--r-- 9,157 bytes
parent folder | download | duplicates (3)
% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/filter.R
\name{filterAndTrim}
\alias{filterAndTrim}
\title{Filter and trim fastq file(s).}
\usage{
filterAndTrim(
  fwd,
  filt,
  rev = NULL,
  filt.rev = NULL,
  compress = TRUE,
  truncQ = 2,
  truncLen = 0,
  trimLeft = 0,
  trimRight = 0,
  maxLen = Inf,
  minLen = 20,
  maxN = 0,
  minQ = 0,
  maxEE = Inf,
  rm.phix = TRUE,
  rm.lowcomplex = 0,
  orient.fwd = NULL,
  matchIDs = FALSE,
  id.sep = "\\\\s",
  id.field = NULL,
  multithread = FALSE,
  n = 1e+05,
  OMP = TRUE,
  qualityType = "Auto",
  verbose = FALSE
)
}
\arguments{
\item{fwd}{(Required). \code{character}.
The file path(s) to the fastq file(s), or the directory containing the fastq file(s). 
Compressed file formats such as .fastq.gz and .fastq.bz2 are supported.}

\item{filt}{(Required). \code{character}.
The path(s) to the output filtered file(s) corresponding to the \code{fwd} input files, or a directory
that will contain those files.
If containing directory does not exist, it will be created.}

\item{rev}{(Optional). Default NULL.
The file path(s) to the reverse fastq file(s) from paired-end sequence data corresponding to those
provided to the \code{fwd} argument, or the directory containing those fastq file(s). 
Compressed file formats such as .fastq.gz and .fastq.bz2 are supported.
If NULL, the \code{fwd} files are processed as single-reads.}

\item{filt.rev}{(Optional). Default NULL, but required if \code{rev} is provided.
The path(s) to the output filtered file(s) corresponding to the \code{rev} input files, or a directory
that will contain those files.
If containing directory does not exist, it will be created.}

\item{compress}{(Optional). Default TRUE.
 If TRUE, the output fastq file(s) are gzipped.

\strong{FILTERING AND TRIMMING PARAMETERS ---------}   

\strong{Note:} When filtering paired reads...
If a length 1 vector is provided, the same parameter value is used for the forward and reverse reads.
If a length 2 vector is provided, the first value is used for the forward reads, and the second 
  for the reverse reads.}

\item{truncQ}{(Optional). Default 2.
Truncate reads at the first instance of a quality score less than or equal to \code{truncQ}.}

\item{truncLen}{(Optional). Default 0 (no truncation).
Truncate reads after \code{truncLen} bases. Reads shorter than this are discarded.}

\item{trimLeft}{(Optional). Default 0.
The number of nucleotides to remove from the start of each read. If both \code{truncLen} and 
\code{trimLeft} are provided, filtered reads will have length \code{truncLen-trimLeft}.}

\item{trimRight}{(Optional). Default 0.
The number of nucleotides to remove from the end of each read. If both \code{truncLen} and 
\code{trimRight} are provided, truncation will be performed after \code{trimRight} is enforced.}

\item{maxLen}{(Optional). Default Inf (no maximum).
Remove reads with length greater than maxLen. maxLen is enforced \strong{before} trimming and truncation.}

\item{minLen}{(Optional). Default 20.
Remove reads with length less than minLen. minLen is enforced \strong{after} trimming and truncation.}

\item{maxN}{(Optional). Default 0.
After truncation, sequences with more than \code{maxN} Ns will be discarded. 
Note that \code{\link{dada}} does not allow Ns.}

\item{minQ}{(Optional). Default 0.
After truncation, reads contain a quality score less than \code{minQ} will be discarded.}

\item{maxEE}{(Optional). Default \code{Inf} (no EE filtering).
After truncation, reads with higher than \code{maxEE} "expected errors" will be discarded.
Expected errors are calculated from the nominal definition of the quality score: EE = sum(10^(-Q/10))}

\item{rm.phix}{(Optional). Default TRUE.
If TRUE, discard reads that match against the phiX genome, as determined by \code{\link{isPhiX}}.}

\item{rm.lowcomplex}{(Optional). Default 0.
If greater than 0, reads with an effective number of kmers less than this value will be removed.
The effective number of kmers is determined by \code{\link{seqComplexity}} using a Shannon information
approximation. The default kmer-size is 2, and therefore perfectly random sequences will approach an
effective kmer number of 16 = 4 (nucleotides) ^ 2 (kmer size).}

\item{orient.fwd}{(Optional). Default NULL.
A character string present at the start of valid reads. Only allows unambiguous nucleotides. 
This string is compared to the start of each read, and the reverse complement of each read.
If it exactly matches the start of the read, the read is kept.
If it exactly matches the start of the reverse-complement read, the read is reverse-complemented and kept.
Otherwise the read if filtered out.
For paired reads, the string is compared to the start of the forward and reverse reads, and if it matches
the start of the reverse read the reaads are swapped and kept.
The primary use of this parameter is to unify the orientation of amplicon sequencing libraries that
are a mixture of forward and reverse orientations, and that include the forward primer on the reads.}

\item{matchIDs}{(Optional). Default FALSE. Paired-read filtering only.
Whether to enforce matching between the id-line sequence identifiers of the forward and reverse fastq files.
  If TRUE, only paired reads that share id fields (see below) are output.
  If FALSE, no read ID checking is done.
Note: \code{matchIDs=FALSE} essentially assumes matching order between forward and reverse reads. If that
  matched order is not present future processing steps may break (in particular \code{\link{mergePairs}}).}

\item{id.sep}{(Optional). Default "\\s" (white-space). Paired-read filtering only.
The separator between fields in the id-line of the input fastq files. Passed to the \code{\link{strsplit}}.}

\item{id.field}{(Optional). Default NULL (automatic detection). Paired-read filtering only.
The field of the id-line containing the sequence identifier.
If NULL (the default) and matchIDs is TRUE, the function attempts to automatically detect
  the sequence identifier field under the assumption of Illumina formatted output.}

\item{multithread}{(Optional). Default is FALSE.
 If TRUE, input files are filtered in parallel via \code{\link[parallel]{mclapply}}.
 If an integer is provided, it is passed to the \code{mc.cores} argument of \code{\link[parallel]{mclapply}}.
 Note that the parallelization here is by forking, and each process is loading another fastq file into
 memory. This option is ignored in Windows, as Windows does not support forking, with \code{mc.cores} set to 1.
If memory is an issue, execute in a clean environment and reduce the chunk size \code{n} and/or
 the number of threads.}

\item{n}{(Optional). Default \code{1e5}.
The number of records (reads) to read in and filter at any one time. 
This controls the peak memory requirement so that very large fastq files are supported. 
See \code{\link[ShortRead]{FastqStreamer}} for details.}

\item{OMP}{(Optional). Default TRUE.
Whether or not to use OMP multithreading when calling \code{\link[ShortRead]{FastqStreamer}}. 
Should be set to FALSE if calling this function within a parallelized chunk of code.
If \code{multithread=TRUE}, this argument will be coerced to FALSE.}

\item{qualityType}{(Optional). \code{character(1)}.
The quality encoding of the fastq file(s). "Auto" (the default) means to
attempt to auto-detect the encoding. This may fail for PacBio files with
uniformly high quality scores, in which case use "FastqQuality". This
parameter is passed on to \code{\link[ShortRead]{readFastq}}; see
information there for details.}

\item{verbose}{(Optional). Default FALSE.
Whether to output status messages.}
}
\value{
Integer matrix. Returned invisibly (i.e. only if assigned to something).
 Rows correspond to the input files, columns record the reads.in and reads.out after filtering.
}
\description{
Filters and trims an input fastq file(s) (can be compressed)
based on several user-definable criteria, and outputs fastq file(s)
(compressed by default) containing those trimmed reads which passed the filters. Corresponding
forward and reverse fastq file(s) can be provided as input, in which case filtering
is performed on the forward and reverse reads independently, and both reads must pass for
the read pair to be output.
}
\details{
\code{filterAndTrim} is a multithreaded convenience interface for the \code{\link{fastqFilter}}
and \code{\link{fastqPairedFilter}} filtering functions.
Note that error messages and tracking are not handled gracefully when using the multithreading
functionality. If errors arise, it is recommended to re-run without multithreading to
troubleshoot the issue.
}
\examples{
testFastqs = c(system.file("extdata", "sam1F.fastq.gz", package="dada2"), 
               system.file("extdata", "sam2F.fastq.gz", package="dada2"))
filtFastqs <- c(tempfile(fileext=".fastq.gz"), tempfile(fileext=".fastq.gz"))
filterAndTrim(testFastqs, filtFastqs, maxN=0, maxEE=2, verbose=TRUE)
filterAndTrim(testFastqs, filtFastqs, truncQ=2, truncLen=200, rm.phix=TRUE, rm.lowcomplex=8)

}
\seealso{
\code{\link{fastqFilter}}
 \code{\link{fastqPairedFilter}}
 \code{\link[ShortRead]{FastqStreamer}}
}