File: transform.tex

package info (click to toggle)
kmc 3.1.1%2Bdfsg-3
links: PTS, VCS
area: main
in suites: bullseye
size: 2,376 kB
sloc: cpp: 33,006; python: 372; perl: 178; makefile: 135; sh: 34
file content (62 lines) | stat: -rw-r--r-- 3,495 bytes
parent folder | download | duplicates (3)
\clearpage
\section{transform}
\label{sec:transform}

This operation transforms single \textsf{KMC} database to one or more \textsf{KMC} database(s) or text file(s). \\
Command-line syntax: \\
kmc\_tools [global\_params] transform $<$input$>$ [input\_params] $<$oper1 [oper\_params1] output1 [output\_params1]$>$ [$<$oper2 [oper\_params2] output2 [output\_params2]$>$...]

where:


\begin{itemize}
	\item \textsf{oper1, oper2, ...} --- transform operation to be performed on the input,
	\item \textsf{input} -- path to databases generated by \textsf{KMC} (\textsf{KMC} generates 2 files with the same name, but different extensions --- here only name without extension should be given),
	\item \textsf{output1, output2, ...} --- paths to the output file(s).
\end{itemize}

For input there are additional parameters which can be set:
\begin{itemize}
	\item \textsf{-ci<value>} --- exclude $k$-mers occurring less than <value> times,
	\item \textsf{-cx<value>} --- exclude $k$-mers occurring more of than <value> times.
\end{itemize}

If additional parameters are not given they are taken from the appropriate input database. \\

Valid values for oper1, oper2,... are:
\begin{itemize}
	\item \textsf{sort} --- converts database produced by KMC2.x to KMC1.x database format (which contains $k$-mers in sorted order),
	\item \textsf{reduce} --- exclude too rare and too frequent $k$-mers,
	\item \textsf{compact} --- remove counters of $k$-mers,
	\item \textsf{histogram} --- produce histogram of $k$-mers occurrences,
	\item \textsf{dump} --- produce text dump of \textsf{KMC} database.	
\end{itemize}

For \textsf{sort}, \textsf{reduce} and \textsf{dump} operations additional \textsf{output\_params} are available:
\begin{itemize}
	\item \textsf{-ci<value>} --- exclude $k$-mers occurring less than <value> times,
	\item \textsf{-cx<value>} --- exclude $k$-mers occurring more than <value> times,
	\item \textsf{-cs<value>} --- maximal value of a counter.
\end{itemize}

If these parameters are not specified they are deduced based on input database. \\

For \textsf{histogram} operation additional \textsf{output\_params} are available:
\begin{itemize}
	\item \textsf{-ci<vaule>} --- minimum value of a counter to be stored in the output file (default value is a cutoff min stored in the database),
	\item \textsf{-cx<value>} --- maximum value of a counter to be stored in the output file (default value is a minimum of tree: $10^4$, cutoff max stored in the database, $2^{8\mathrm{CS}}-1$, where CS is the number of bytes used to store counters in the database)
\end{itemize}

For dump operation there are additional \textsf{oper\_params}:
\begin{itemize}
	\item \textsf{-s} --- force sorted output (default: false). \\
	For \textsf{KMC1.x} this parameter is irrelevant as $k$-mers are stored in sorted order and this order will be preserved in produced text file. For \textsf{KMC2.x} when this parameter is set $k$-mers will be sorted before dumpping to the text file.
\end{itemize}


\subsection *{example 1 - split $k$-mers on valid and invalid}
Let's suppose $k$-mers with occurrences below 11 are erroneous due to sequencing errors. With \textsf{reduce} we can split $k$-mer set to one set with valid $k$-mers and one with invalid: \\
kmc\_tools transform kmers reduce valid\_kmers -ci11 reduce erroneous\_kmers -cx10

\subsection*{example 2 - perform all operations}
kmc\_tools transform kmers reduce -ci10 reduced sort sorted compact without\_counters histogram histo.txt dump kmers.txt