1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335
|
\documentclass[12pt,titlepage]{article}
\usepackage[a4paper,top=30mm,bottom=30mm,left=20mm,right=20mm]{geometry}
\usepackage[utf8]{inputenc}
\usepackage{xspace}
\usepackage{listings}
\usepackage{optionman}
\usepackage{url}
\usepackage{booktabs}
\usepackage{xcolor}
%\usepackage[binary-units]{siunitx}
%\usepackage{TheSansUHH}
\newcommand{\Gdiff}{\textit{genomediff}\xspace}
\newcommand{\Suff}{\textit{suffixerator}\xspace}
\newcommand{\Mki}{\textit{packedindex}\xspace}
\newcommand{\Encseq}{\textit{encseq}\xspace}
\newcommand{\RGdiff}{\texttt{run\_genomediff.rb}\xspace}
\newcommand{\GenomeTools}{\textit{GenomeTools}\xspace}
\newcommand{\Gt}{\texttt{gt}\xspace}
\newcommand{\Kr}{\ensuremath{K_r}\xspace}
\newcommand{\Gtsuffixerator}{\texttt{gt suffixerator}\xspace}
\newcommand{\Gtpackedindex}{\texttt{gt packedindex mkindex}\xspace}
\newcommand{\ESA}{ESA\xspace}
\newcommand{\FastA}{FASTA\xspace}
\newcommand{\File}[1]{\texttt{\small #1}}
\newcommand{\ShuS}{\textit{shustrings}\xspace}
\lstset{language=bash,
basicstyle=\ttfamily
}
\definecolor{darkgreen}{rgb}{0.3,0.5,0.3}
\definecolor{darkblue}{rgb}{0.3,0.3,0.5}
\definecolor{darkred}{rgb}{0.5,0.3,0.3}
\lstdefinelanguage{LUA}{%
sensitive=true,%
columns=fixed,%
keywordstyle=[1]{\color{darkblue}\bfseries},%
keywordstyle=[2]{\color{darkgreen}\bfseries},%
morekeywords=[1]{local,if,then,else,end,while,do, coroutine,yield},% Official LUA keywords
morekeywords=[2]{units},% Your private keywords
otherkeywords={.,=,~,*,>,:},%
morestring=[b]",%
stringstyle={\color{darkred}\itshape},%
breaklines=true,%
linewidth=\textwidth,%
comment=[l]{--}%
}
\title{\Gdiff user manual}
\author{\begin{tabular}{c}
\textit{Dirk Willrodt}\\[1cm]
Research Group for Genome Informatics\\
Center for Bioinformatics\\
University of Hamburg\\
Bundesstrasse 43\\
20146 Hamburg\\
Germany\\[1cm]
\url{willrodt@zbh.uni-hamburg.de}\\
\end{tabular}}
\begin{document}
%\tsuhhfamily
\maketitle
\section*{This Manual}
Some text is highlighted by different fonts according to the following rules.
\begin{itemize}
\item \texttt{Typewriter font} is used for the names of software tools.
\item \File{Small typewriter font} is used for file names.
\item \begin{footnotesize}\texttt{Footnote sized typewriter font}
\end{footnotesize} with a leading
\begin{footnotesize}\texttt{'-'}\end{footnotesize}
is used for program options.
\item \Showoptionarg{small italic font} is used for the argument(s) of an
option.
\end{itemize}
\section{Introduction}
This document describes \Gdiff, a software tool for measuring evolutionary
distances between sets of closely related genomes. These distances are
Jukes-Cantor corrected divergence between the pairs of genomes, that is, the
number of mutations per base between them.
This distance is called \Kr and is based on so called \ShuS
\cite{HAU:DOM:WIE:2008,HAU:PFA:DOM:WIE:2009,HAU:REE:PFA:2011}. The calculation
of all pairwise distances is alignment free, but the resulting distances have
the same biological meaning as if calculated with a multiple sequence alignment.
This software is only able to process closely related distances, because \Kr is
only reliable for distances $<0.5$.
\Gdiff is written in C and it is based on the \GenomeTools
library~\cite{genometools}. It is called as part of the single binary named \Gt.
The source code can be compiled on 32-bit and 64-bit platforms without making
any changes to the sources.
\section{Building \Gdiff} \label{Building}
As \Gdiff is part of the \GenomeTools software suite, a source distribution of
\GenomeTools must be obtained, e.g.\@ via the \GenomeTools home page
(\url{http://genometools.org}), and decompressed into a source directory:
\begin{lstlisting}
$ tar -xzvf genometools-X.X.X.tar.gz
$ cd genometools-X.X.X
\end{lstlisting}
Where \lstinline!X.X.X! denotes the desired gt version.
Then, it suffices to call \lstinline!make! to compile the source using the
provided makefile.
It is recommended to use the 64bit-version of the \GenomeTools executable if
your system supports this. Pass the option \lstinline!64bit=yes! to enable 64
bit support.
The option \lstinline!amalgamation=yes! allows the compiler to use better
optimization.
\begin{lstlisting}
$ make 64bit=yes amalgamation=yes
\end{lstlisting}
After successful compilation, the \GenomeTools executable containing \Gdiff is
available in the \File{bin} subfolder of the root directory of the uncompressed
source. It can then be installed for system-wide use as follows (do this as
root):
\begin{lstlisting}
$ make 64bit=yes amalgamation=yes install
\end{lstlisting}
Make sure to use the same options as for the compilation step when using the
install target!
If a \texttt{prefix=<path>} option is appended to this line, a custom directory
can be specified as the installation target directory, e.g.\@
\begin{lstlisting}
$ make 64bit=yes amalgamation=yes install prefix=/home/user/gt
\end{lstlisting}
will install the \Gt binary in the \File{/home/user/gt/bin} directory. Please
also consult the \File{README} and \File{INSTALL} files in the root directory of
the uncompressed source tree for more information and troubleshooting advice.
\section{Usage}
\subsection{\Gdiff command line options}
Since \Gdiff is part of \GenomeTools, it is invoked as follows:
\texttt{gt genomediff [{\footnotesize options}] ({\small INDEX} |
{\footnotesize -indexname} \textit{\footnotesize NAME} {\small SEQFILE SEQFILE
[\ldots])}}
where \File{INDEX} is the path without file extension of an encoded
sequence containing the genomes to be compared and \Showoptionarg{NAME} is a
name for an encoded sequence to be built from the given \texttt{\small
SEQFILES}.
A short description of all possible options is given in Table \ref{tab:gdopts}.
\begin{table}[hbpt]
\centering
\caption{\Gdiff{} command line options}
\begin{footnotesize}
\label{tab:gdopts}
\begin{tabular}{lp{0.6\textwidth}}\hline
\Showoptiongroup{Input options}
\Showoption{indextype} \Showoptionarg{type} & Specify type of index, one of:
\Showoptionarg{esa\textbar{}pck\textbar{}encseq}. Where encseq is an encoded
sequence and an enhanced suffix array will be constructed only in memory.
default: \Showoptionarg{encseq}\\
\Showoption{unitfile} \Showoptionarg{filename} & Specifies genomic units,
see below for description. default: undefined\\
\Showoptiongroup{Output options}
\Showoption{indexname} \Showoptionarg{name} & Basename of encseq to
construct. default: undefined\\
\Showoptiongroup{ESA options}
\Showoption{mirrored} & Virtually append the reverse complement of each
sequence default: \Showoptionarg{no}\\
\Showoption{pl} \Showoptionarg{n} & Specify prefix length for bucket sort
recommendation: use without argument; then a reasonable prefix length is
automatically determined. default: \Showoptionarg{0}\\
\Showoption{dc} \Showoptionarg{n} & Specify difference cover value. default:
\Showoptionarg{0}\\
\Showoption{memlimit} \Showoptionarg{n} & Specify maximal amount of memory
to be used during index construction (in bytes, the keywords 'MB' and 'GB'
are allowed). default: undefined\\
\Showoptiongroup{Miscellaneous options}
\Showoption{v} & Be verbose. default: \Showoptionarg{no}\\
\Showoption{help} & Display help for basic options and exit.\\
\Showoption{help+} & Display help for all options and exit.\\
\Showoption{version} & Display version information and exit.\\\hline
\end{tabular}
\end{footnotesize}
\end{table}
\begin{lstlisting}[%
float=hbpt,%
showlines=true,%
frame=tb,%
caption={Example unitfile: {\small The section '\texttt{units}' is mandatory,
'\texttt{genome1/2}' are examples of names, filenames are paths as
given on the command line or during index construction.}
},%
label={code:lua}, language=LUA]
units = {
genome1 = { "file1.fas", "file2.fas" },
genome2 = { "path/file3.fas", "file4.fas" }
}
\end{lstlisting}
\subsection{Input files}
The tool \Gdiff can handle three types of prepared indices. The first is an
encoded sequence, which can be prepared by \Encseq. Given an encoded sequence,
\Gdiff will build an enhanced suffix array in memory and calculate \Kr using
that index. Second is an enhanced suffix array prepared by the tool \Suff (see
\texttt{gt suffixerator -help}) and third a compressed FM-index build by the
tool \Mki (see \texttt{gt packedindex mkindex -help}. The usage of FM-indices is
not recommended, because calculation of \Kr takes significantly longer.
Another way is to give the names of sequence files directly. Option
\Showoption{-indexname} is mandatory in this case. The given name will be used
to store an encoded sequence on disk. File format can be any sequence format
supported by \GenomeTools.
Either way, each given sequence file will be regarded as one genomic unit,
regardless of the number of sequences inside that file.
To give the genomic units other names than the filenames or to combine files to
single genomic units one can give a unitfile with option \Showoption{-unitfile}.
The format of an example unitfile is shown in Listing \ref{code:lua}.
\subsection{Output}
The output on the standard output stream consists of a line with the number of
genomes or units that were compared. It is followed by a quadratic matrix of
pairwise distances where each line consists of a file- or unitname and tabulator
separated distance values.
Depending on the options of the \Gt call there can be additional output where
each line is prefixed by '\texttt{\# }' and additional output prefixed
by'\texttt{debug: }' on the standard error stream.
\section{Example}
This section describes two example scenarios, the first being the comparison of
multiple genomes organised in separate multiple \FastA-files and the second
being the comparison of two genomes consisting of multiple files each.
\subsection{Compare genomes in separate files}
Consider three files \File{genome1.fas}, \File{genome2.fas} and
\File{genome3.fas} each of which could contain multiple \FastA entries. Our
machine has 2\,GiB RAM. Assuming the index construction would need 5\,GiB, we
need to split it in at least three parts of equal size or restrict maximal
memory requirements.
The simplest way to calculate the distance matrix for these three genomes
would be to call:
\begin{lstlisting}
gt genomediff -indexname 3genomes \
-memlimit 1500MB \
genome1.fas genome2.fas genome3.fas
\end{lstlisting}
\Showoption{-memlimit} should be reasonable less than available main memory.
This will output the distance matrix on the terminal and store an encoded
sequence with basename \File{3genomes} in the current directory.
In order to save the results to a file use terminal redirection:
\lstinline!gt genomediff ... > outfile!.
The file \File{outfile} might look like this:
\begin{verbatim}
3
genome1.fas 0.000000 0.115125 0.267473
genome2.fas 0.115125 0.000000 0.293082
genome3.fas 0.267473 0.293082 0.000000
\end{verbatim}
This tabulator separated table can be used for example with \textit{Phylip} or
\textit{R} to calculate a phylogenetic tree.
Another way to calculate the same distances if an enhanced suffix array of the
given files with name \File{3genomes\_idx} already exists on disk would be like
this:
\begin{lstlisting}
gt genomediff -indextype esa 3genomes_idx > outfile
\end{lstlisting}
To reuse an existing encoded sequence just give the basename of it:
\begin{lstlisting}
gt genomediff 3genomes > outfile
\end{lstlisting}
\subsection{Compare two genomes in multiple files}
Assume we have two genomes that consist of multiple chromosomes in separate
files. For example, genome1 consists of \File{g1\_chr1.fas} and
\File{g1\_chr2.fas} while the two files for genome2 are named accordingly. The
unitfile could be organized like this:
\begin{lstlisting}[language=lua]
units = {
genome1 = { "g1_chr1.fas", "g1_chr2.fas" },
genome2 = { "g2_chr1.fas", "g2_chr2.fas" }
}
\end{lstlisting}
The name of the unitfile in our example will be \File{units}.
Now we could call \Gdiff like this:
\begin{lstlisting}
gt genomediff -indexname 2genomes \
-unitfile units \
g1_chr1.fas g1_chr2.fas g2_chr1.fas g2.chr2.fas > output
\end{lstlisting}
File \File{output} could look like this:
\begin{verbatim}
2
genome1 0.000000 0.115125
genome2 0.115125 0.000000
\end{verbatim}
\section*{Bibliography}
\bibliographystyle{unsrt}
\bibliography{gtmanuals}
\end{document}
% vim:spell spelllang=en_gb
|