File: filter_operation.tex

package info (click to toggle)
kmc 3.1.1%2Bdfsg-3
links: PTS, VCS
area: main
in suites: bullseye
size: 2,376 kB
sloc: cpp: 33,006; python: 372; perl: 178; makefile: 135; sh: 34
file content (50 lines) | stat: -rw-r--r-- 3,057 bytes
parent folder | download | duplicates (3)
\clearpage
\section{filter}
\label{sec:filter_operation}

This operation works with input FASTQ/FASTA files and a database produced by \textsf{KMC}.
It removes from the input read set those reads which does not contain specified number of $k$-mers in the input \textsf{KMC} database.
Currently, read names are completely ignored by kmc\_tools (though it may change in the future).

Syntax: \\
kmc\_tools [global\_params] filter [filter\_params] <kmc\_input\_db> [kmc\_input\_db\_params] <input\_read\_set> [input\_read\_set\_params]  <output\_read\_set> [output\_read\_set\_params] \\

where:

\begin{itemize}
	\item \textsf{kmc\_input\_db} --- path to database generated by \textsf{KMC},
	\item \textsf{input\_read\_set} --- path to input set of reads,
	\item \textsf{output\_read\_set} --- path to set output of reads.
\end{itemize}

filter\_params are:
\begin{itemize}
	\item \textsf{-t} --- trim reads on first invalid $k$-mer instead of remove entirely.
\end{itemize}

For $k$-mer database there are additional parameters:

\begin{itemize}
	\item \textsf{-ci$<$value$>$} --- exclude $k$-mers occurring less than <value> times,
	\item \textsf{-cx$<$value$>$} --- exclude $k$-mers occurring more of than <value> times.
\end{itemize}

For the input set of reads there are additional parameters:

\begin{itemize}
	\item \textsf{-ci$<$value$>$} --- remove reads containing less $k$-mers than value (but if -t is set the read is trimmed on first $k$-mer with counter lower than value),
	\item \textsf{-cx$<$value$>$} --- remove reads containing more $k$-mers than value (but if -t is set the read is trimmed on first $k$-mer with counter higher than value),
	\item \textsf{-f$<$a/q$>$} --- input in FASTA format (-fa), FASTQ format (-fq); default: FASTQ.
\end{itemize}
For input set of reads integer or floating number can be given as \textsf{-ci$<$value$>$} and \textsf{-cx$<$value$>$}. Integer values are used to define strict thresholds, which means only reads that contain at least $ci_{value}$ and at most $cx_{value}$ $k$-mers will be kept in the output read set.
Floating numbers for \textsf{-ci$<$value$>$} and \textsf{-cx$<$value$>$} parameters are used to define thresholds depending on read length. It should be in the range of [0.0;1.0]. Let $r$ be a length of a read. The read will be kept in the output read set only if it contains at least $\lfloor (r-k+1) * ci_{value} \rfloor$ and at most $\lfloor (r-k+1) * cx_{value} \rfloor$ $k$-mers which are present in \textsf{KMC} database. \\ \\
For the output set of reads there are additional parameters:
\begin{itemize}
	\item \textsf{-f$<$a/q$>$} --- output in FASTA format (-fa), FASTQ format (-fq); default: same as the input
\end{itemize}
\textsf{input\_read\_set} may be a single file or a file which contains a list of input files (one file per line). 

\section*{example}
 kmc\_tools filter kmc\_db -ci3 input.fastq -ci0.5 -cx1.0 filtered.fastq \\
 kmc\_tools filter kmc\_db input.fastq -ci10 -cx100 filtered.fastq \\
 kmc\_tools filter kmc\_db @input\_files.txt -ci10 -cx100 filtered.fastq