File: jellyfish.tex

package info (click to toggle)
jellyfish 2.2.6-1
links: PTS, VCS
area: main
in suites: stretch
size: 3,160 kB
ctags: 7,713
sloc: cpp: 35,152; ruby: 564; sh: 472; makefile: 355; python: 158; perl: 24
file content (205 lines) | stat: -rw-r--r-- 7,727 bytes
parent folder | download | duplicates (2)
\documentclass[english]{article}
\usepackage[latin1]{inputenc}
\usepackage{babel}
\usepackage{verbatim}

%% do we have the hyperref package?
\IfFileExists{hyperref.sty}{
   \usepackage[bookmarksopen,bookmarksnumbered]{hyperref}
}{}

%% do we have the fancyhdr package?
\IfFileExists{fancyhdr.sty}{
\usepackage[fancyhdr]{latex2man}
}{
%% do we have the fancyheadings package?
\IfFileExists{fancyheadings.sty}{
\usepackage[fancy]{latex2man}
}{
\usepackage[nofancy]{latex2man}
\message{no fancyhdr or fancyheadings package present, discard it}
}}
\usepackage[letterpaper]{geometry}

\newcommand{\ddash}[1]{-\,-#1}
\newcommand{\LOpt}[1]{\Opt{\ddash{#1}}}
\newcommand{\LoOpt}[1]{\oOpt{\ddash{#1}}}
\newcommand{\LOptArg}[2]{\OptArg{\ddash{#1}}{#2}}
\newcommand{\LoOptArg}[2]{\oOptArg{\ddash{#1}}{#2}}
\newcommand{\LoOptoArg}[2]{\oOptoArg{\ddash{#1}}{#2}}

\setVersion{1.1.4}
\setDate{2010/10/1}

\begin{document}

\begin{Name}{1}{jellyfish}{G. Marcais and C. Kingsford}{k-mer counter}{Jellyfish: A fast k-mer counter}

\Prog{Jellyfish} is a software to count $k$-mers in DNA sequences.

\end{Name}

\section{Synopsis}
\Prog{jellyfish count} \oOptArg{-o}{prefix} \oOptArg{-m}{merlength} \oOptArg{-t}{threads} \oOptArg{-s}{hashsize} \LoOpt{both-strands} \Arg{fasta} \oArg{fasta \Dots} \\
\Prog{jellyfish merge} \Arg{hash1} \Arg{hash2} \Dots \\
\Prog{jellyfish dump}  \Arg{hash} \\
\Prog{jellyfish stats} \Arg{hash} \\
\Prog{jellyfish histo} \oOptArg{-h}{high} \oOptArg{-l}{low} \oOptArg{-i}{increment} \Arg{hash} \\
\Prog{jellyfish query} \Arg{hash} \\
\Prog{jellyfish cite} \\


Plus equivalent version for \Prog{Quake} mode: \Prog{qhisto}, \Prog{qdump} and \Prog{qmerge}.

\section{Description}

\Prog{Jellyfish} is a $k$-mer counter based on a multi-threaded hash
table implementation.

\subsection{Counting and merging}

To count $k$-mers, use a command like:

\begin{verbatim}
jellyfish count -m 22 -o output -c 3 -s 10000000 -t 32 input.fasta
\end{verbatim}

This will count the the 22-mers in input.fasta with 32 threads. The
counter field in the hash uses only 3 bits and the hash has at least
10 million entries.

The output files will be named output\_0, output\_1, etc. (the prefix
is specified with the \Opt{-o} switch). If the hash is large enough
(has specified by the \Opt{-s} switch) to fit all the $k$-mers, there
will be only one output file named output\_0. If the hash filled up
before all the mers were read, the hash is dumped to disk, zeroed out
and reading in mers resumes. Multiple intermediary files will be
present on the disks, named output\_0, output\_1, etc.

To obtain correct results from the other sub-commands (such as histo,
stats, etc.), the multiple output files, if any, need to be merged into one
with the merge command. For example with the following command:

\begin{verbatim}
jellyfish merge -o output.jf output\_*
\end{verbatim}

Should you get many intermediary output files (say hundreds), the size
of the hash table is too small. Rerunning \Prog{Jellyfish} with a
larger size (option \Opt{-s}) is probably faster than merging all the
intermediary files.

\subsection{Orientation}
When the orientation of the sequences in the input fasta file is not
known, e.g. in sequencing reads, using \LOpt{both-strands} (\Opt{-C})
makes the most sense.

For any $k$-mer $m$, its canonical representation is $m$ itself or its
reverse-complement, whichever comes first lexicographically. With the
option \Opt{-C}, only the canonical representation of the mers are
stored in the hash and the count value is the number of occurrences of
both the mer and its reverse-complement.


\subsection{Choosing the hash size}

To achieve the best performance, a minimum number of intermediary
files should be written to disk. So the parameter \Opt{-s} should be
chosen to fit as many $k$-mers as possible (ideally all of them) while
still fitting in memory.

We consider to examples: counting mers in sequencing reads and in a
finished genome.

First, suppose we count $k$-mers in short sequencing reads:
there are $n$ reads and there is an average of 1 error per reads where
each error generates $k$ unique mers. If the genome size is $G$, the
size of the hash (option \Opt{-s}) to fit all $k$-mers at once is estimated to: $(G +
k*n)/0.8$. The division by $0.8$ compensates for the maximum usage of
approximately $80\%$ of the hash table.

On the other hand, when counting $k$-mers in an assembled sequence of
length $G$, setting \Opt{-s} to $G$ is appropriate.

As a matter of convenience, Jellyfish understands ISO suffixes for the
size of the hash. Hence '-s 10M' stands 10 million entries while '-s
50G' stands for 50 billion entries.

The actual memory usage of the hash table can be computed as
follow. The actual size of the hash will be rounded up to the next
power of 2: $s=2^l$. The parameter $r$ is such that the maximum
reprobe value (\Opt{-p}) plus one is less than $2^r$. Then the memory usage per
entry in the hash is (in bits, not bytes) $2k-l+r+1$. The total memory
usage of the hash table in bytes is: $2^l*(2k-l+r+1)/8$.

\subsection{Choosing the counting field size}
To save space, the hash table supports variable length counter, i.e. a
$k$-mer occurring only a few times will use a small counter, a $k$-mer
occurring many times will used multiple entries in the hash.

Important: the size of the couting field does NOT change the result,
it only impacts the amount of memory used. In particular, there is no
maximum value in the hash. Even if the counting field uses 5 bits, a
$k$-mer occurring 2 million times will have a value reported of 2
million (i.e., it is not capped at $2^5$).

The \Opt{-c} specify the length (in bits) of the counting field. The
trade off is as follows: a low value will save space per entry in the
hash but can potentially increase the number of entries used, hence
maybe requiring a larger hash.

In practice, use a value for \Opt{-c} so that most of you $k$-mers
require only 1 entry. For example, to count $k$-mers in a genome,
where most of the sequence is unique, use \OptArg{-c}{1} or
\OptArg{-c}{2}. For sequencing reads, use a value for \Opt{-c} large
enough to counts up to twice the coverage. For example, if the
coverage is 10X, choose a counter length of $5$ (\OptArg{-c}{5}) as $2^5 > 20$.


\section{Subcommands and options}
include(options.tex)

\section{Version}

Version: \Version\ of \Date

\section{Bugs}

\begin{itemize}
\item \Prog{jellyfish merge} has not been parallelized and is
  relatively slow.
\item The hash table does not grow in memory automatically and
  \Prog{jellyfish merge} is not called automatically on the
  intermediary files (if any).
\end{itemize}

\section{Copyright \& License}
\begin{description}
\item[Copyright] \copyright\ 2010, Guillaume Marcais \Email{guillaume@marcais.net} and Carl Kingsford \Email{carlk@umiacs.umd.edu}.

\item[License] This program is free software: you can redistribute it
  and/or modify it under the terms of the GNU General Public License
  as published by the Free Software Foundation, either version 3 of
  the License, or (at your option) any later version. \\
  This program is distributed in the hope that it will be useful, but
  WITHOUT ANY WARRANTY; without even the implied warranty of
  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
  General Public License for more details. \\
  You should have received a copy of the GNU General Public License
  along with this program.  If not, see
  \URL{http://www.gnu.org/licenses/}. 
\end{description}

\section{Authors}
\noindent
Guillaume Marcais \\
University of Maryland \\
\Email{gmarcais@umd.edu}

\noindent
Carl Kingsford \\
University of Maryland \\
\Email{carlk@umiacs.umd.edu}

\LatexManEnd
\end{document}