1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62
|
\clearpage
\section{transform}
\label{sec:transform}
This operation transforms single \textsf{KMC} database to one or more \textsf{KMC} database(s) or text file(s). \\
Command-line syntax: \\
kmc\_tools [global\_params] transform $<$input$>$ [input\_params] $<$oper1 [oper\_params1] output1 [output\_params1]$>$ [$<$oper2 [oper\_params2] output2 [output\_params2]$>$...]
where:
\begin{itemize}
\item \textsf{oper1, oper2, ...} --- transform operation to be performed on the input,
\item \textsf{input} -- path to databases generated by \textsf{KMC} (\textsf{KMC} generates 2 files with the same name, but different extensions --- here only name without extension should be given),
\item \textsf{output1, output2, ...} --- paths to the output file(s).
\end{itemize}
For input there are additional parameters which can be set:
\begin{itemize}
\item \textsf{-ci<value>} --- exclude $k$-mers occurring less than <value> times,
\item \textsf{-cx<value>} --- exclude $k$-mers occurring more of than <value> times.
\end{itemize}
If additional parameters are not given they are taken from the appropriate input database. \\
Valid values for oper1, oper2,... are:
\begin{itemize}
\item \textsf{sort} --- converts database produced by KMC2.x to KMC1.x database format (which contains $k$-mers in sorted order),
\item \textsf{reduce} --- exclude too rare and too frequent $k$-mers,
\item \textsf{compact} --- remove counters of $k$-mers,
\item \textsf{histogram} --- produce histogram of $k$-mers occurrences,
\item \textsf{dump} --- produce text dump of \textsf{KMC} database.
\end{itemize}
For \textsf{sort}, \textsf{reduce} and \textsf{dump} operations additional \textsf{output\_params} are available:
\begin{itemize}
\item \textsf{-ci<value>} --- exclude $k$-mers occurring less than <value> times,
\item \textsf{-cx<value>} --- exclude $k$-mers occurring more than <value> times,
\item \textsf{-cs<value>} --- maximal value of a counter.
\end{itemize}
If these parameters are not specified they are deduced based on input database. \\
For \textsf{histogram} operation additional \textsf{output\_params} are available:
\begin{itemize}
\item \textsf{-ci<vaule>} --- minimum value of a counter to be stored in the output file (default value is a cutoff min stored in the database),
\item \textsf{-cx<value>} --- maximum value of a counter to be stored in the output file (default value is a minimum of tree: $10^4$, cutoff max stored in the database, $2^{8\mathrm{CS}}-1$, where CS is the number of bytes used to store counters in the database)
\end{itemize}
For dump operation there are additional \textsf{oper\_params}:
\begin{itemize}
\item \textsf{-s} --- force sorted output (default: false). \\
For \textsf{KMC1.x} this parameter is irrelevant as $k$-mers are stored in sorted order and this order will be preserved in produced text file. For \textsf{KMC2.x} when this parameter is set $k$-mers will be sorted before dumpping to the text file.
\end{itemize}
\subsection *{example 1 - split $k$-mers on valid and invalid}
Let's suppose $k$-mers with occurrences below 11 are erroneous due to sequencing errors. With \textsf{reduce} we can split $k$-mer set to one set with valid $k$-mers and one with invalid: \\
kmc\_tools transform kmers reduce valid\_kmers -ci11 reduce erroneous\_kmers -cx10
\subsection*{example 2 - perform all operations}
kmc\_tools transform kmers reduce -ci10 reduced sort sorted compact without\_counters histogram histo.txt dump kmers.txt
|