File: genomediff.tex

package info (click to toggle)
genometools 1.5.3-2
  • links: PTS, VCS
  • area: main
  • in suites: jessie, jessie-kfreebsd
  • size: 57,988 kB
  • ctags: 45,574
  • sloc: ansic: 475,937; ruby: 24,092; python: 4,519; sh: 3,014; perl: 2,523; makefile: 1,839; java: 158; haskell: 37; xml: 6; sed: 5
file content (335 lines) | stat: -rw-r--r-- 12,547 bytes parent folder | download
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
\documentclass[12pt,titlepage]{article}

\usepackage[a4paper,top=30mm,bottom=30mm,left=20mm,right=20mm]{geometry}
\usepackage[utf8]{inputenc}
\usepackage{xspace}
\usepackage{listings}
\usepackage{optionman}
\usepackage{url}
\usepackage{booktabs}
\usepackage{xcolor}
%\usepackage[binary-units]{siunitx}
%\usepackage{TheSansUHH}

\newcommand{\Gdiff}{\textit{genomediff}\xspace}
\newcommand{\Suff}{\textit{suffixerator}\xspace}
\newcommand{\Mki}{\textit{packedindex}\xspace}
\newcommand{\Encseq}{\textit{encseq}\xspace}
\newcommand{\RGdiff}{\texttt{run\_genomediff.rb}\xspace}
\newcommand{\GenomeTools}{\textit{GenomeTools}\xspace}
\newcommand{\Gt}{\texttt{gt}\xspace}
\newcommand{\Kr}{\ensuremath{K_r}\xspace}
\newcommand{\Gtsuffixerator}{\texttt{gt suffixerator}\xspace}
\newcommand{\Gtpackedindex}{\texttt{gt packedindex mkindex}\xspace}
\newcommand{\ESA}{ESA\xspace}
\newcommand{\FastA}{FASTA\xspace}
\newcommand{\File}[1]{\texttt{\small #1}}
\newcommand{\ShuS}{\textit{shustrings}\xspace}

\lstset{language=bash,
basicstyle=\ttfamily
}
\definecolor{darkgreen}{rgb}{0.3,0.5,0.3}
\definecolor{darkblue}{rgb}{0.3,0.3,0.5}
\definecolor{darkred}{rgb}{0.5,0.3,0.3}
\lstdefinelanguage{LUA}{%
sensitive=true,%
columns=fixed,%
keywordstyle=[1]{\color{darkblue}\bfseries},%
keywordstyle=[2]{\color{darkgreen}\bfseries},%
morekeywords=[1]{local,if,then,else,end,while,do, coroutine,yield},% Official LUA keywords
morekeywords=[2]{units},% Your private keywords
otherkeywords={.,=,~,*,>,:},%
morestring=[b]",%
stringstyle={\color{darkred}\itshape},%
breaklines=true,%
linewidth=\textwidth,%
comment=[l]{--}%
}
\title{\Gdiff user manual}

\author{\begin{tabular}{c}
  \textit{Dirk Willrodt}\\[1cm]
  Research Group for Genome Informatics\\
  Center for Bioinformatics\\
  University of Hamburg\\
  Bundesstrasse 43\\
  20146 Hamburg\\
  Germany\\[1cm]
  \url{willrodt@zbh.uni-hamburg.de}\\
\end{tabular}}

\begin{document}
%\tsuhhfamily
\maketitle

\section*{This Manual}
Some text is highlighted by different fonts according to the following rules.

\begin{itemize}
\item \texttt{Typewriter font} is used for the names of software tools.
\item \File{Small typewriter font} is used for file names.
\item \begin{footnotesize}\texttt{Footnote sized typewriter font}
      \end{footnotesize} with a leading
      \begin{footnotesize}\texttt{'-'}\end{footnotesize}
      is used for program options.
\item \Showoptionarg{small italic font} is used for the argument(s) of an
      option.
\end{itemize}


\section{Introduction}
This document describes \Gdiff, a software tool for measuring evolutionary
distances between sets of closely related genomes. These distances are
Jukes-Cantor corrected divergence between the pairs of genomes, that is, the
number of mutations per base between them.

This distance is called \Kr and is based on so called \ShuS
\cite{HAU:DOM:WIE:2008,HAU:PFA:DOM:WIE:2009,HAU:REE:PFA:2011}. The calculation
of all pairwise distances is alignment free, but the resulting distances have
the same biological meaning as if calculated with a multiple sequence alignment.

This software is only able to process closely related distances, because \Kr is
only reliable for distances $<0.5$.

\Gdiff is written in C and it is based on the \GenomeTools
library~\cite{genometools}. It is called as part of the single binary named \Gt.

The source code can be compiled on 32-bit and 64-bit platforms without making
any changes to the sources.

\section{Building \Gdiff} \label{Building}
As \Gdiff is part of the \GenomeTools software suite, a source distribution of
\GenomeTools must be obtained, e.g.\@ via the \GenomeTools home page
(\url{http://genometools.org}), and decompressed into a source directory:

\begin{lstlisting}
$ tar -xzvf genometools-X.X.X.tar.gz
$ cd genometools-X.X.X
\end{lstlisting}

Where \lstinline!X.X.X! denotes the desired gt version.

Then, it suffices to call \lstinline!make! to compile the source using the
provided makefile.

It is recommended to use the 64bit-version of the \GenomeTools executable if
your system supports this. Pass the option \lstinline!64bit=yes! to enable 64
bit support.

The option \lstinline!amalgamation=yes! allows the compiler to use better
optimization.

\begin{lstlisting}
$ make 64bit=yes amalgamation=yes
\end{lstlisting}

After successful compilation, the \GenomeTools executable containing \Gdiff is
available in the \File{bin} subfolder of the root directory of the uncompressed
source. It can then be installed for system-wide use as follows (do this as
root):

\begin{lstlisting}
$ make 64bit=yes amalgamation=yes install
\end{lstlisting}

Make sure to use the same options as for the compilation step when using the
install target!

If a \texttt{prefix=<path>} option is appended to this line, a custom directory
can be specified as the installation target directory, e.g.\@

\begin{lstlisting}
$ make 64bit=yes amalgamation=yes install prefix=/home/user/gt
\end{lstlisting}

will install the \Gt binary in the \File{/home/user/gt/bin} directory. Please
also consult the \File{README} and \File{INSTALL} files in the root directory of
the uncompressed source tree for more information and troubleshooting advice.

\section{Usage}
\subsection{\Gdiff command line options}
Since \Gdiff is part of \GenomeTools, it is invoked as follows:

\texttt{gt genomediff [{\footnotesize options}] ({\small INDEX} |
{\footnotesize -indexname} \textit{\footnotesize NAME} {\small SEQFILE SEQFILE
[\ldots])}}

where \File{INDEX} is the path without file extension of an encoded
sequence containing the genomes to be compared and \Showoptionarg{NAME} is a
name for an encoded sequence to be built from the given \texttt{\small
SEQFILES}.

A short description of all possible options is given in Table \ref{tab:gdopts}.

\begin{table}[hbpt]
  \centering
  \caption{\Gdiff{} command line options}
\begin{footnotesize}
  \label{tab:gdopts}
  \begin{tabular}{lp{0.6\textwidth}}\hline

    \Showoptiongroup{Input options}
    \Showoption{indextype} \Showoptionarg{type} & Specify type of index, one of:
    \Showoptionarg{esa\textbar{}pck\textbar{}encseq}. Where encseq is an encoded
    sequence and an enhanced suffix array will be constructed only in memory.
    default: \Showoptionarg{encseq}\\
    \Showoption{unitfile} \Showoptionarg{filename} & Specifies genomic units,
    see below for description. default: undefined\\

    \Showoptiongroup{Output options}
    \Showoption{indexname} \Showoptionarg{name} & Basename of encseq to
    construct. default: undefined\\

    \Showoptiongroup{ESA options}
    \Showoption{mirrored} & Virtually append the reverse complement of each
    sequence default: \Showoptionarg{no}\\
    \Showoption{pl} \Showoptionarg{n} & Specify prefix length for bucket sort
    recommendation: use without argument; then a reasonable prefix length is
    automatically determined. default: \Showoptionarg{0}\\
    \Showoption{dc} \Showoptionarg{n} & Specify difference cover value. default:
    \Showoptionarg{0}\\
    \Showoption{memlimit} \Showoptionarg{n} & Specify maximal amount of memory
    to be used during index construction (in bytes, the keywords 'MB' and 'GB'
    are allowed). default: undefined\\

    \Showoptiongroup{Miscellaneous options}
    \Showoption{v} & Be verbose. default: \Showoptionarg{no}\\
    \Showoption{help} & Display help for basic options and exit.\\
    \Showoption{help+} & Display help for all options and exit.\\
    \Showoption{version} & Display version information and exit.\\\hline
  \end{tabular}
\end{footnotesize}
\end{table}

\begin{lstlisting}[%
  float=hbpt,%
  showlines=true,%
  frame=tb,%
  caption={Example unitfile: {\small The section '\texttt{units}' is mandatory,
  '\texttt{genome1/2}' are examples of names, filenames are paths as
  given on the command line or during index construction.}
  },%
  label={code:lua}, language=LUA]
units = {
  genome1 = { "file1.fas", "file2.fas" },
  genome2 = { "path/file3.fas", "file4.fas" }
}
\end{lstlisting}

\subsection{Input files}
The tool \Gdiff can handle three types of prepared indices. The first is an
encoded sequence, which can be prepared by \Encseq. Given an encoded sequence,
\Gdiff will build an enhanced suffix array in memory and calculate \Kr using
that index. Second is an enhanced suffix array prepared by the tool \Suff (see
\texttt{gt suffixerator -help}) and third a compressed FM-index build by the
tool \Mki (see \texttt{gt packedindex mkindex -help}. The usage of FM-indices is
not recommended, because calculation of \Kr takes significantly longer.

Another way is to give the names of sequence files directly. Option
\Showoption{-indexname} is mandatory in this case. The given name will be used
to store an encoded sequence on disk. File format can be any sequence format
supported by \GenomeTools.

Either way, each given sequence file will be regarded as one genomic unit,
regardless of the number of sequences inside that file.

To give the genomic units other names than the filenames or to combine files to
single genomic units one can give a unitfile with option \Showoption{-unitfile}.
The format of an example unitfile is shown in Listing \ref{code:lua}.

\subsection{Output}
The output on the standard output stream consists of a line with the number of
genomes or units that were compared. It is followed by a quadratic matrix of
pairwise distances where each line consists of a file- or unitname and tabulator
separated distance values.

Depending on the options of the \Gt call there can be additional output where
each line is prefixed by '\texttt{\# }' and additional output prefixed
by'\texttt{debug: }' on the standard error stream.

\section{Example}
This section describes two example scenarios, the first being the comparison of
multiple genomes organised in separate multiple \FastA-files and the second
being the comparison of two genomes consisting of multiple files each.

\subsection{Compare genomes in separate files}
Consider three files \File{genome1.fas}, \File{genome2.fas} and
\File{genome3.fas} each of which could contain multiple \FastA entries. Our
machine has 2\,GiB RAM. Assuming the index construction would need 5\,GiB, we
need to split it in at least three parts of equal size or restrict maximal
memory requirements.

The simplest way to calculate the distance matrix for these three genomes
would be to call:

\begin{lstlisting}
gt genomediff -indexname 3genomes \
              -memlimit 1500MB    \
              genome1.fas genome2.fas genome3.fas
\end{lstlisting}

\Showoption{-memlimit} should be reasonable less than available main memory.

This will output the distance matrix on the terminal and store an encoded
sequence with basename \File{3genomes} in the current directory.

In order to save the results to a file use terminal redirection:
\lstinline!gt genomediff ... > outfile!.

The file \File{outfile} might look like this:
\begin{verbatim}
3
genome1.fas	0.000000	0.115125	0.267473
genome2.fas	0.115125	0.000000	0.293082
genome3.fas	0.267473	0.293082	0.000000
\end{verbatim}

This tabulator separated table can be used for example with \textit{Phylip} or
\textit{R} to calculate a phylogenetic tree.

Another way to calculate the same distances if an enhanced suffix array of the
given files with name \File{3genomes\_idx} already exists on disk would be like
this:

\begin{lstlisting}
gt genomediff -indextype esa 3genomes_idx > outfile
\end{lstlisting}
To reuse an existing encoded sequence just give the basename of it:
\begin{lstlisting}
gt genomediff 3genomes > outfile
\end{lstlisting}

\subsection{Compare two genomes in multiple files}
Assume we have two genomes that consist of multiple chromosomes in separate
files. For example, genome1 consists of \File{g1\_chr1.fas} and
\File{g1\_chr2.fas} while the two files for genome2 are named accordingly. The
unitfile could be organized like this:

\begin{lstlisting}[language=lua]
units = {
  genome1 = { "g1_chr1.fas", "g1_chr2.fas" },
  genome2 = { "g2_chr1.fas", "g2_chr2.fas" }
}
\end{lstlisting}
The name of the unitfile in our example will be \File{units}.

Now we could call \Gdiff like this:

\begin{lstlisting}
gt genomediff -indexname 2genomes \
              -unitfile units     \
              g1_chr1.fas g1_chr2.fas g2_chr1.fas g2.chr2.fas > output
\end{lstlisting}
File \File{output} could look like this:
\begin{verbatim}
2
genome1	0.000000	0.115125
genome2	0.115125	0.000000
\end{verbatim}

\section*{Bibliography}
\bibliographystyle{unsrt}
\bibliography{gtmanuals}
\end{document}
% vim:spell spelllang=en_gb