1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356
|
%
% Copyright (c) 2003 by Stefan Kurtz and The Institute for
% Genomic Research. This is OSI Certified Open Source Software.
% Please see the file LICENSE for licensing information and
% the file ACKNOWLEDGEMENTS for names of contributors to the
% code base.
%
\documentclass[12pt]{article}
\usepackage{a4,alltt,xspace,times}
\usepackage{skaff}
\usepackage{optionman}
\newcommand{\MMthree}{\texttt{maxmat3}\xspace}
\newcommand{\MUM}[0]{\textit{MUM}\xspace}
\newcommand{\Ignore}[1]{}
\newcommand{\Subs}[3]{#1[#2\ldots#3]}
\newcommand{\Match}[3]{(#3,#1,#2)}
\author{Stefan Kurtz\thanks{\SKaffiliation} \thanks{Minor alterations
made by Adam Phillippy, The Institute for Genomic Research}}
\title{\textbf{A Program to find Maximal Matches}\\
\textbf{in Large Sequences}\\[2mm]
\textbf{Manual}\footnote{\copyright Stefan Kurtz and The
Institute for Genomic Research}}
\date{}
\begin{document}
\maketitle
\MMthree is a program to find maximal matches of some minimum length
between a reference-sequence and a query-sequence. It also allows for
the computation of \MUM-candidates as well as \MUM{s}. This document
describes the options of \MMthree and the output format. In Section
\ref{SecBasicNotions}, we formally define some basic notions to clarify
the semantics of \MMthree. However, the reader should be able to understand
the manual without reading that section. In the following, the notion
\emph{match} always refers to \emph{maximal matches} if not stated
otherwise.
\section{The Program and its Options}
The program is called as follows:
$\MMthree~~[\mathit{options}]~~\mathit{referencefile}~~
\mathit{queryfile}_{1}~~\cdots~~\mathit{queryfile}_{n}$
And here is a description of the options:
\begin{list}{}{}
\Option{mum}{}{
Compute \MUM{s}, i.e.\ matches that are unique in both sequences.
}
\Option{mumreference}{}{
Compute \MUM-candidates, i.e.\ matches that are unique
in the reference-sequence but not necessarily in the query-sequence.
}
\Option{maxmatch}{}{
Compute all maximal matches, regardless of their uniqueness.
}
\Option{n}{}{
Only match the characters $a$, $c$, $g$, or $t$. They can be either
in upper or in lower case.
}
\Option{l}{$q$}{
Set the minimum length $q$ of a match to be reported. $q$ must be a
positive integer. Only matches of length at least $q$ are reported.
If this option is not used, then the default value for $q$ is 20.
}
\Option{b}{}{
Compute direct matches and reverse complement matches.
}
\Option{r}{}{
Only compute reverse complement matches.
}
\Option{s}{}{
Show the matching substring.
}
\Option{c}{}{
Report the query-position of a reverse complement match
relative to the original query-sequence. Suppose \(x\) is the
reference-sequence and \(y\) is the query-sequence. By definition, a
reverse complement match \(\Match{i}{j}{l}\) of \(x\) and \(y\) is a
match of \(x\) and \(\overline{y}\), where \(\overline{y}\) is the
reverse complement of \(y\), see Section \ref{SecBasicNotions}. By
default, a match is reported as a triple
\begin{alltt}
i j l
\end{alltt}
Position \(j\) is relative to \(\overline{y}\). With this option,
not \(j\) but \(m-j+1\) is reported, where \(m\) is the length of the
query-sequence. Now note that position \(m-j+1\) in \(y\) corresponds
to position \(j\) in \(\overline{y}\).
}
\Option{F}{}{
Forces 4 column output format regardless of the number of reference
sequence inputs, i.e. prefixes every output match with its reference
sequence identifier.
}
\Option{L}{}{
Show the length of the query-sequence on the header line.
}
\Option{h}{}{
Show the possible options and exit.
}
\Option{help}{}{
Show the possible options and exit.
}
\end{list}
Options \Showoption{mum}, \Showoption{mumreference} and
\Showoption{maxmatch} cannot be combined. If none of these options are
used, then the program will default to \Showoption{mumreference}.
Option \Showoption{b} and \Showoption{r} exclude each other. If none
of these two options is used, then only direct matches are reported.
Option \Showoption{c} can only be used in combination with
option \Showoption{b} or option \Showoption{r}.
There must be exactly one reference-file given and at least one
query-file. The maximum number of query-files is 32.
The reference-file and the query-files must be in multiple fasta format.
The query-files are processed one after the other. The uniqueness condition
for the \MUM{s} refers to the entire set of reference-sequences but
only to one query-sequence. That is, if a \MUM is reported, the
matching substring is unique in all reference-sequences and in the currently
processed query-sequence. The uniqueness does not necessarily extend to
all query-sequences.
Input sequences can be over any set of lower or upper case
characters. Thus DNA sequences as well as protein sequences are allowed.
The characters are processed case insensitive. That is, a
lower case character is identified with the corresponding upper case
character. The sequence output via option \Showoption{s} is always
in lower case. If option \Showoption{n} is used,
then all characters except for $a$, $c$, $g$, and $t$ are replaced by
a unique character which is different for the reference-sequences and
the query-sequences. This prevents false matches involving, for
example, the wildcard symbols $s$, $w$, $r$, $y$, $m$, $k$, $b$,
$d$, $h$, $v$, and $n$ often occurring in DNA sequences.
We therefore recommend to use the option \Showoption{n}.
\section{Output format}
Suppose we have two fasta files \texttt{Data/U89959} and
\texttt{Data/at.est}. The first file contains a complete BAC-sequence from
Arabidopsis thaliana, while the second is a collection of
ESTs from the same organism. We want to use the first file as our
reference-file and the second as out query-file.
Now let us look for direct matches and reverse complement
matches in the reference and in the query sequences.
The length of the matches should be at least 18 (options \Showoption{b}
and \Showoption{l} 18).
We want to report the following:
\begin{itemize}
\item
the length of the query-sequences (option \Showoption{L})
\item
the matching sequence (option \Showoption{s})
\item
the query-positions relative to the original sequence
(option \Showoption{c}).
\end{itemize}
We also do not want to report matches involving wildcards
(option \Showoption{n}). The corresponding program call is as follows:
\begin{verbatim}
maxmat3.x -b -l 18 -L -s -c -n Data/U89959 Data/at.est
\end{verbatim}
Here is a part of the output:
\begin{small}
\begin{verbatim}
# reading input file "Data/U89959" of length 106973
# construct suffix tree for sequence of length 106973
# (maximal input length is 536870908)
# process 1725 characters per dot
# ..............................................................
# CONSTRUCTIONTIME maxmat3.x Data/U89959 0.11
# reading input file "Data/at.est" of length 772376
# matching query-file "Data/at.est"
# against reference-file "Data/U89959"
> gi|5587835|gb|AF078689.1|AF078689 Len = 275
90201 258 18
taaaaaaaaaaaaaaaaa
52836 258 18
taaaaaaaaaaaaaaaaa
> gi|5587835|gb|AF078689.1|AF078689 Reverse Len = 275
> gi|4714033|dbj|C99914.1|C99914 Len = 628
> gi|4714033|dbj|C99914.1|C99914 Reverse Len = 628
> gi|4714032|dbj|C99913.1|C99913 Len = 497
> gi|4714032|dbj|C99913.1|C99913 Reverse Len = 497
> gi|4714031|dbj|C99911.1|C99911 Len = 661
> gi|4714031|dbj|C99911.1|C99911 Reverse Len = 661
> gi|4714030|dbj|C99910.1|C99910 Len = 241
5066 23 19
agaagaagaagaagaagaa
5069 23 19
agaagaagaagaagaagaa
5072 23 19
agaagaagaagaagaagaa
5075 23 19
.....
> gi|2763999|gb|R30040.1|R30040 Len = 475
> gi|2763999|gb|R30040.1|R30040 Reverse Len = 475
# COMPLETETIME maxmat3.x Data/U89959 1.64
# SPACE maxmat3.x Data/U89959 2.71
\end{verbatim}
\end{small}
The lines starting with the symbol \texttt{\symbol{35}} report some useful
information about the matching process. They tell which files are input
and the length of the scanned sequences. They also report
the used computational resources, i.e.\ the time required for
constructing the suffix tree, the total time of the matching process
(in seconds) and the space requirement (in megabytes).
The line starting with
\texttt{\symbol{35}} and containing only dots shows the progress of the
suffix tree construction. This is useful to estimate the remaining
running time for very long sequences.
For each query-sequence the corresponding header line is (up to the
first white-space character) reported in a line beginning with the
symbol \texttt{\symbol{62}}. The length of the
query-sequence appears at the end of such a line. Below the header line
all matches against the corresponding query sequence
are reported as a triple of three numbers.
\begin{itemize}
\item
The first number is the position of the match in the reference-sequence.
\item
The second number is the position of the match in the query-sequence.
\item
The third number is the length of the match.
\end{itemize}
Reverse complement matches are reported after a header line containing the
keyword \texttt{Reverse}. A matching sequence is reported in
a separate line following the line containing the positions and the
length of the match.
If the reference-file is a multiple fasta file with more than one
sequence, then it is necessary to also report the header line of the
reference-sequence and the position of the match relative to the start
of the reference-sequence. This leads to a slightly different format
where each line containing the positions is prepended by the header of
the reference-sequence (also available with the \Showoption{F}
option). This format is shown in the following example, where we use
two files containing protein sequences: The file \texttt{Data/prot1}
is a file containing one protein sequence, while \texttt{Data/prot2}
contains many protein sequences: We use \texttt{Data/prot2} as our
reference-file and \texttt{Data/prot1} as our query-file:
\begin{verbatim}
maxmat3.x -L -l 7 Data/prot2 Data/prot1
\end{verbatim}
The output is as follows:
\begin{small}
\begin{verbatim}
# reading input file "Data/prot2" of length 40839
# construct suffix tree for sequence of length 40839
# (maximum input length is 536870908)
# CONSTRUCTIONTIME maxmat3.x Data/prot2 0.02
# reading input file "Data/prot1" of length 45543
# matching query-file "Data/prot1"
# against reference-file "Data/prot2"
> some_collection Len = 45543
sp|ALL2_BOVIN|ALLERGEN 139 1155 7
sp|ALG3_YEAST|PUTATIVE 425 2904 7
sp|ALF_TRYBB|FRUCTOSE-BISPHOSPHATE 177 4794 7
sp|ALGB_YEAST|ALG11 114 12599 7
sp|ALR_SYNY3|ALANINE 354 33261 7
sp|ALGB_PSEAE|ALGINATE 212 40991 7
sp|ALGB_PSEAE|ALGINATE 314 41093 7
sp|ALGB_PSEAE|ALGINATE 360 41138 7
sp|ALR_SYNY3|ALANINE 79 44186 7
sp|ALP_CEPAC|ALKALINE 366 45304 7
# COMPLETETIME maxmat3.x Data/prot2 0.06
# SPACE maxmat3.x Data/prot2 0.56
\end{verbatim}
\end{small}
\section{Basic Notions}\label{SecBasicNotions}
We assume that \(x\) is a sequence of length \(n\geq 1\) over some
alphabet \(\Sigma\).
$x[i]$ denotes the character at position $i$ in $x$,
for $i\in[1,n]$. For $i\leq j$, $\Subs{x}{i}{j}$ denotes the
substring of $x$ starting with the character at position $i$
and ending with the character at position $j$ in \(x\). If \(i>j\), then
\(\Subs{x}{i}{j}\) is the empty sequence.
Let \(y\) be a sequence of length \(m\). In this document we refer to
\(x\) as the \emph{reference-sequence} and to \(y\) as the
\emph{query-sequence}.
Let \(\ell>0\), \(i\in[1,n-\ell+1]\), and \(j\in[1,m-\ell+1]\).
\(\Match{i}{j}{\ell}\) is a \emph{match} of \(x\) and \(y\) if and only if
\(\Subs{x}{i}{i+\ell-1}=\Subs{y}{j}{j+\ell-1}\).
\(\ell\) is the \emph{length of the match} and \(\Subs{x}{i}{i+\ell-1}\)
is the \emph{matching substring}. Note that a match is contained in a
longer match if the characters to the left or to the right of the
occurrences in \(x\) and \(y\) are identical. To reduce redundancy,
we restrict attention to maximal matches.
A match \(\Match{i}{j}{\ell}\) of \(x\) and \(y\) is \emph{maximal} if
and only if the following holds:
\begin{itemize}
\item
\(i=1\) or \(j=1\) or \(x[i-1]\neq y[j-1]\)\hfill(left maximality)
\item
\(i+\ell=n+1\) or \(j+\ell=m+1\) or \(x[i+\ell]\neq
y[j+\ell]\)\hfill(right maximality)
\end{itemize}
A \emph{maximal unique match} of \(x\) and \(y\) (\MUM for short) is a
maximal match \(\Match{i}{j}{\ell}\) of \(x\) and \(y\) such that the
matching substring \(\Subs{x}{i}{i+\ell-1}\)
occurs exactly once in \(x\) and once in \(y\). A \emph{\MUM-candidate}
is a maximal match \(\Match{i}{j}{l}\) of \(x\) and \(y\)
such that the matching substring
\(\Subs{x}{i}{i+\ell-1}\) occurs exactly once in \(x\). Note that
any \MUM is also a \MUM-candidate, but not vice versa, since for a
\MUM-candidate the matching substring may occur more than once in \(y\).
If \(\Sigma\) is the DNA alphabet with the characters \(a,c,g,t\), then we
define a function \(wcc\) over \(\Sigma\) by the following equations:
\begin{eqnarray*}
wcc(a)&=&t\\
wcc(g)&=&c\\
wcc(c)&=&g\\
wcc(t)&=&a
\end{eqnarray*}
The reverse complement of
a DNA-sequence \(u=\Subs{u}{1}{k}\) is the sequence
\[wcc(u[k])wcc(u[k-1])\ldots wcc(u[2])wcc(u[1])\]
denoted by \(\overline{u}\).
A \emph{reverse complement match} of \(x\) and \(y\) is a
match of \(x\) and \(\overline{y}\). The notion of maximality
naturally extends to reverse complement matches. To distinguish reverse
complement matches from matches we often call the latter direct matches.
\end{document}
|