1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160
|
\documentclass[12pt]{article}
\usepackage[a4paper,top=20mm,bottom=20mm,left=20mm,right=20mm]{geometry}
\usepackage{url}
\usepackage{alltt}
\usepackage{xspace}
\usepackage{times}
\usepackage{listings}
\usepackage{verbatim}
\usepackage{ifthen}
\usepackage{optionman}
\newcommand{\Substring}[3]{#1[#2..#3]}
\newcommand{\Program}[0]{\texttt{matstat}\xspace}
\newcommand{\MS}[1]{\mathit{ms(s,#1)}}
\title{\Program: a program for computing\\
matching statistics\\
a manual}
\author{\begin{tabular}{c}
\textit{Stefan Kurtz}\\
Center for Bioinformatics,\\
University of Hamburg
\end{tabular}}
\begin{document}
\maketitle
\section{The program \Program}
The program \Program is called as follows:
\par
\noindent\Program [\textit{options}] \Showoption{query} \Showoptionarg{files} [\textit{options}]
\par
\Showoptionarg{files} is a white space separated list of at least one
filename. Any sequence occurring in any file specified in \Showoptionarg{files}
is called \textit{unit} in the following.
In addition to the mandatory option \Showoption{query}, the program
must be called with either option \Showoption{pck} or \Showoption{esa}
which specify to use a packed index or an enhanced suffix array for
a given set of subject sequences.
\Program computes the \textit{matching statistics} for each unit. That is,
for each position \(i\) in
each unit, say \(s\) of length \(n\), \(\MS{i}=(l,j)\) is computed. Here
\(l\) is the largest integer such that \(\Substring{s}{i}{i+l-1}\) matches
a substring represented by the index and \(j\) is a start position of the
matching substring in the index. We say that \(l\) is the length of \(\MS{i}\)
and \(j\) is the subject position of \(\MS{i}\).
The following options are available in \Program:
\begin{Justshowoptions}
\begin{comment}
\Option{fmi}{$\Showoptionarg{indexname}$}{
Use the old implementation of the FMindex. This option is not recommended.
}
\end{comment}
\Option{esa}{$\Showoptionarg{indexname}$}{
Use the given enhanced suffix array to compute the matches.
}
\Option{pck}{$\Showoptionarg{indexname}$}{
Use the packed index (an efficient representation of the FMindex)
to compute the matches.
}
\Option{query}{$\Showoptionarg{files}$}{
Specify a white space separated list of query files containing the units.
At least one query file must be given. The files may be in
gzipped format, in which case they have to end with the suffix \texttt{.gz}.
}
\Option{min}{$\ell$}{
Specify the minimum value $\ell$ for the length of the matching statistics.
That is, for each unit \(s\) and each position \(i\) in \(s\), the program
reports all values \(i\) and \(\MS{i}\) if the
length of \(\MS{i}\) is at least \(\ell\).
}
\Option{max}{$\ell$}{
Specify the maximum length $\ell$ for the length of the matching statistics.
That is, for each unit \(s\) and each positions \(i\) in \(s\), the program
reports the values \(i\) and \(\MS{i}\) if the length of \(\MS{i}\)
is at most \(\ell\).
}
\Option{output}{(\Showoptionkey{subjectpos}$\mid$\Showoptionkey{querypos}$\mid$\Showoptionkey{sequence})}{
Specify what to output. At least one of the three keys words
$\Showoptionkey{subjectpos}$,
$\Showoptionkey{querypos}$, and
$\Showoptionkey{sequence}$ must be used.
Using the keyword $\Showoptionkey{subjectpos}$ shows the
subject position of the matching statistics.
Using the keyword $\Showoptionkey{querypos}$ shows the query position.
Using the keyword $\Showoptionkey{sequence}$ shows the sequence content
}
\Helpoption
\end{Justshowoptions}
The following conditions must be satisfied:
\begin{enumerate}
\item
Either option \Showoption{min} or option \Showoption{max} must be used.
\item
If both options \Showoption{min} and \Showoption{max} are used, then
the value specified by option \(\Showoption{min}\) must be smaller
than the value specified by option \(\Showoption{max}\).
\item
Either option \Showoption{pck} or \Showoption{esa} must be used. Both cannot
be combined.
\end{enumerate}
\section{Examples}
Suppose that in some directory, say \texttt{homo-sapiens}, we have 25 gzipped
fasta files containing all 24 human chromomsomes plus one file with
mitrochondrial sequences. These may have been downloaded from
\url{ftp://ftp.ensembl.org/pub/current_fasta/homo_sapiens_47_36i/dna}.
In the first step, we construct the packed index for the entire genome:
\begin{Output}
gt packedindex mkindex -dna -dir rev -parts 15 -bsize 10 -locfreq 32
-indexname human-all -db homo-sapiens/*.gz
\end{Output}
The program runs for almost two hours and delivers
an index \texttt{human-all} consisting of three files:
\begin{Output}
ls -lh human-all.*
-rw-r----- 1 kurtz gistaff 37 2008-01-24 00:47 human-all.al1
-rw-r----- 1 kurtz gistaff 1.9G 2008-01-24 02:37 human-all.bdx
-rw-r----- 1 kurtz gistaff 3.4K 2008-01-24 02:37 human-all.prj
\end{Output}
This is used in the following call to the program \Program:
\begin{Output}
gt matstat -output subjectpos querypos sequence -min 20 -max 30
-query queryfile.fna -pck human-all
unit 0 (Mus musculus, chr 1, complete sequence)
22 20 390765125 actgtatctcaaaatataaa
253 21 258488266 gggaataaacatgtcattgag
254 20 258488267 ggaataaacatgtcattgag
275 20 900483549 taattctatttttctttctt
480 20 1008274536 gcttgaagatcatgatccag
..
\end{Output}
Here, the first column shows the relative positions in unit 0 for which the
length of the matching statistics is between 20 and 30. The second column is
the corresponding length value. The third column shows position of the
matching sequence in the index, and the fourth shows the sequence content.
\end{document}
|