File: matstat.tex

package info (click to toggle)
genometools 1.5.3-2
  • links: PTS, VCS
  • area: main
  • in suites: jessie, jessie-kfreebsd
  • size: 57,988 kB
  • ctags: 45,574
  • sloc: ansic: 475,937; ruby: 24,092; python: 4,519; sh: 3,014; perl: 2,523; makefile: 1,839; java: 158; haskell: 37; xml: 6; sed: 5
file content (160 lines) | stat: -rw-r--r-- 5,710 bytes parent folder | download
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
\documentclass[12pt]{article}
\usepackage[a4paper,top=20mm,bottom=20mm,left=20mm,right=20mm]{geometry}
\usepackage{url}
\usepackage{alltt}
\usepackage{xspace}
\usepackage{times}
\usepackage{listings}

\usepackage{verbatim}
\usepackage{ifthen}
\usepackage{optionman}

\newcommand{\Substring}[3]{#1[#2..#3]}

\newcommand{\Program}[0]{\texttt{matstat}\xspace}
\newcommand{\MS}[1]{\mathit{ms(s,#1)}}
\title{\Program: a program for computing\\
       matching statistics\\
       a manual}

\author{\begin{tabular}{c}
         \textit{Stefan Kurtz}\\
         Center for Bioinformatics,\\
         University of Hamburg
        \end{tabular}}

\begin{document}
\maketitle

\section{The program \Program}

The program \Program is called as follows:
\par
\noindent\Program [\textit{options}] \Showoption{query} \Showoptionarg{files} [\textit{options}] 
\par
\Showoptionarg{files} is a white space separated list of at least one 
filename. Any sequence occurring in any file specified in \Showoptionarg{files}
is called \textit{unit} in the following.
In addition to the mandatory option \Showoption{query}, the program
must be called with either option \Showoption{pck} or \Showoption{esa}
which specify to use a packed index or an enhanced suffix array for 
a given set of subject sequences.

\Program computes the  \textit{matching statistics} for each unit. That is, 
for each position \(i\) in 
each unit, say \(s\) of length \(n\), \(\MS{i}=(l,j)\) is computed. Here
\(l\) is the largest integer such that \(\Substring{s}{i}{i+l-1}\) matches
a substring represented by the index and \(j\) is a start position of the
matching substring in the index. We say that \(l\) is the length of \(\MS{i}\)
and \(j\) is the subject position of \(\MS{i}\).

The following options are available in \Program:

\begin{Justshowoptions}
\begin{comment}
\Option{fmi}{$\Showoptionarg{indexname}$}{
Use the old implementation of the FMindex. This option is not recommended.
}
\end{comment}

\Option{esa}{$\Showoptionarg{indexname}$}{
Use the given enhanced suffix array to compute the matches.
}

\Option{pck}{$\Showoptionarg{indexname}$}{
Use the packed index (an efficient representation of the FMindex)
to compute the matches.
}


\Option{query}{$\Showoptionarg{files}$}{
Specify a white space separated list of query files containing the units.
At least one query file must be given. The files may be in 
gzipped format, in which case they have to end with the suffix \texttt{.gz}.
}

\Option{min}{$\ell$}{                                                           
Specify the minimum value $\ell$ for the length of the matching statistics.
That is, for each unit \(s\) and each position \(i\) in \(s\), the program 
reports all values \(i\) and \(\MS{i}\) if the 
length of \(\MS{i}\) is at least \(\ell\).
}

\Option{max}{$\ell$}{
Specify the maximum length $\ell$ for the length of the matching statistics.
That is, for each unit \(s\) and each positions \(i\) in \(s\), the program 
reports the values \(i\) and \(\MS{i}\) if the length of \(\MS{i}\)
is at most \(\ell\).
}

\Option{output}{(\Showoptionkey{subjectpos}$\mid$\Showoptionkey{querypos}$\mid$\Showoptionkey{sequence})}{
Specify what to output. At least one of the three keys words
$\Showoptionkey{subjectpos}$,
$\Showoptionkey{querypos}$, and
$\Showoptionkey{sequence}$ must be used.
Using the keyword $\Showoptionkey{subjectpos}$ shows the 
subject position of the matching statistics.
Using the keyword $\Showoptionkey{querypos}$ shows the query position.
Using the keyword $\Showoptionkey{sequence}$ shows the sequence content
}

\Helpoption

\end{Justshowoptions}
The following conditions must be satisfied:
\begin{enumerate}
\item
Either option  \Showoption{min} or option \Showoption{max} must be used.
\item
If both options \Showoption{min} and \Showoption{max} are used, then
the value specified by option \(\Showoption{min}\) must be smaller
than the value specified by option \(\Showoption{max}\).
\item
Either option \Showoption{pck} or \Showoption{esa} must be used. Both cannot
be combined.
\end{enumerate}

\section{Examples}

Suppose that in some directory, say \texttt{homo-sapiens}, we have 25 gzipped
fasta files containing all 24 human chromomsomes plus one file with 
mitrochondrial sequences. These may have been downloaded from
\url{ftp://ftp.ensembl.org/pub/current_fasta/homo_sapiens_47_36i/dna}.

In the first step, we construct the packed index for the entire genome:

\begin{Output}
gt packedindex mkindex -dna -dir rev -parts 15 -bsize 10 -locfreq 32
                       -indexname human-all -db homo-sapiens/*.gz
\end{Output}

The program runs for almost two hours and delivers 
an index \texttt{human-all} consisting of three files:

\begin{Output}
ls -lh human-all.*
-rw-r----- 1 kurtz gistaff   37 2008-01-24 00:47 human-all.al1
-rw-r----- 1 kurtz gistaff 1.9G 2008-01-24 02:37 human-all.bdx
-rw-r----- 1 kurtz gistaff 3.4K 2008-01-24 02:37 human-all.prj
\end{Output}

This is used in the following call to the program \Program:

\begin{Output}
gt matstat -output subjectpos querypos sequence -min 20 -max 30 
           -query queryfile.fna -pck human-all
unit 0 (Mus musculus, chr 1, complete sequence)
22 20 390765125 actgtatctcaaaatataaa
253 21 258488266 gggaataaacatgtcattgag
254 20 258488267 ggaataaacatgtcattgag
275 20 900483549 taattctatttttctttctt
480 20 1008274536 gcttgaagatcatgatccag
..
\end{Output}
Here, the first column shows the relative positions in unit 0 for which the
length of the matching statistics is between 20 and 30. The second column is
the corresponding length value. The third column shows position of the
matching sequence in the index, and the fourth shows the sequence content.

\end{document}