1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484
|
<html>
<head>
<title>SPALN information</title>
<meta HTTP-EQUIV="OWNER" CONTENT="O.Gotoh">
<link REL="made" HREF="mailto:o.gotoh@aist.go.jp">
</head>
<!--
<body BGCOLOR="#00336633" link="#FFFF00" vlink="#77FFFF77" alink="#FFFF00" text="#FFFFFF">
-->
<a NAME="Top"></a>
<DIV style="background-color:#FFF0C4;">
<br>
<h1><center>SPALN information</center></h1>
<br>
<br>
<H2 ALIGN=left>
<ul>
<li><a href='#Ov'>Overview</a>
<li><a href="#Ref">References</a>
<!-- <li><a href="../cgi-bin/spaln.pl">Online Server</a> -->
<!-- <li><a href="./pr_tut.html">Tutorial</a> -->
<li><a href="#In">Install</a>
<li><a href="#Seq">Sequence data formation</a>
<li><a href="#Usr">Execution</a>
<li><a href="#Exl">Example</a>
<li><a href="#Chg">Change from previous version</a>
</ul>
</H2>
<SPAN style="padding-bottom:20px;"></SPAN>
<hr>
</DIV>
<body BGCOLOR="#FFFFF0FF">
<H2 ALIGN=left> <a name="Ov">Overview</a> </H2>
<B>Spaln</B> (space-efficient spliced alignment) is a
stand-alone program that maps and aligns a set of cDNA or protein sequences
onto a whole genomic sequence in a single job. <b>Spaln</b> also performs spliced
or ordinary alignment after rapid similarity search against a protein sequence
database, if a genomic segment or an amino acid sequence is given as a query.
From Version 1.4, spaln supports a combination of protein sequence database and
a given genomic segment. From Version 2.2, spaln also performs rapid similarity
search and (semi-)global alignment of a set of protein sequence queries again
a protein sequence database. <b>Spaln</b> adopts multi-phase
heuristics that makes it possible to perform the job on a conventional personal
computer running under Unix/Linux with limited memory. The program is written
in C++ and distributed as source codes and also as executables for a few platforms.
Unless binaries are not provided, users must compile the program on their
own system. Although the program has been tested only on a Linux operating
system, it is likely to be portable to most Unix systems with little or no
modifications. The accessory program <b>sortgrcd</b> sorts the gene loci found
by <b>spaln</b> in the order of chromosomal position and orientation. From
version 2.3.2, <b>spaln</b> and <b>sortgrcd</b> can handle some gzipped files
without prior expansion if USE_ZLIB mode is activated upon compilation.
From version 2.3.2a, compressed query sequence file(s) may also be accepted.
From version 2.4.0, multiple files corresponding to different output
forms can be generated at a single run.
<p>
<h2><a name="Ref">References</a></h2>
<P>
[1] Gotoh, O. "<a href='http://nar.oxfordjournals.org/cgi/content/abstract/gkn105?ijkey=N2yLVza41RuShAg&keytype=ref'>
A space-efficient and accurate method for mapping and aligning cDNA sequences onto genomic sequence</a>"
<i>Nucleic Acids Research</i> <b>36</b> (8) 2630-2638 (2008).
<BR>
[2] Gotoh, O. "<a href='http://bioinformatics.oxfordjournals.org/cgi/content/abstract/btn460?ijkey=XajuzvyHlcQZoQd&keytype=ref'>
Direct mapping and alignment of protein sequences onto genomic sequence</a>"
<i>Bioinformatics</i> <b>24</b> (21) 2438-2444 (2008).
<BR>
[3] Iwata, H. and Gotoh, O. "<a href='http://nar.oxfordjournals.org/content/40/20/e161'>
Benchmarking spliced alignment programs including Spaln2, an extended version of Spaln that incorporates additional species-specific features</a>"
<i>Nucleic Acids Research</i> <b>40</b> (20) e161 (2012)
<BR>
[4] Gotoh, O. "<a href='https://doi.org/10.1093/bioinformatics/16.3.190'>
Homology-based gene structure prediction: simplified matching algorithm using a translated codon (tron) and improved accuracy by allowing for long gaps</a>" <i>Bioinformatics</i> <b>16</b> (3) 190-202 (2000)
<BR>
[5] Nagasaki, H., Arita, M., Nishizawa, T., Suwa, M., Gotoh, O. "<a href='https://academic.oup.com/bioinformatics/article/22/10/1211/236993'>
Automated classification of alternative splicing and transcriptional initiation and construction of a visual database of the classified patterns</a>" <i>Bioinformatics</i> <b>22</b> (10) 1211-1216 (2006)
<BR>
[6] Gotoh, O.
Cooperation of Spaln and Prrn5 for construction of gene-structure-aware multiple sequence alignment. <i>Methods in Molecular Biology</i>, in press.
<BR>
<p>
<h2><a href="#Chg">Present Version: 2.4.1</a>, Last update: 2020-10-09 </h2>
<p>
<h2><a name="In">Install</a></h2>
<h3>From source</h3>
To compile the source codes in the default settings, follow the instructions below.
If you download the source file (<a href="../archive/spaln2.4.0.tar.gz">spaln2.4.0</a>) in the directory <font color="#40A040"><em>download</em></font>, five directories will be generated under <em><font color="#40A040">download</font>/spalnXX/</em> after installation, where XX is a version code. We assume <font color="#40A040"><em>work</em></font> is your workspace, which may or may not be identical to <em><font color="#40A040">download</font></em>.
<p>
<ul>
<li> bin : binaries
<li> doc : documents
<li> seqdb : sample sequences. In this directory you should format genomic or database files
<li> src : source codes
<li> table : parameter files used by <B>spaln</B>
</ul>
<p>
To modify the location of executables and/or other settings, run 'configure --help' at step 6 below. (<b>Warning</b>: Full path name rather than relative path name must be given for executables or other directories as the arguments of the <B>configure</B> command.) These locations are hard coded in <B>spaln</B>. The locations of the 'seqdb' and 'table' directories will be respectively denoted by <em><font color="#40A040">seqdb</font></em> and <em><font color="#40A040">table</font></em> below. Hence, <em><font color="#40A040">seqdb</font></em>=<em><font color="#40A040">download</font>/spalnXX/seqdb</em>, and <font color="#40A040"><em>table</em></font>=<em><font color="#40A040">download</font>/spalnXX/table</em> in the default settings.
<p>
<ol>
<li> <TT>% mkdir <em><font color="#40A040">download</font></em></TT>
<li> <TT>% cd <em><font color="#40A040">download</font></em></TT>
<li> Download spalnXX.tar.gz
<li> <TT>% tar xfz spalnXX.tar.gz</TT>
<li> <TT>% cd ./spalnXX/src</TT>
<li> <TT>% ./configure [--help]</TT>
<ul>
<li>Please manually edit Makefile if $(CC) does not indicate a C++ compiler or
<li><TT>% CXX=g++ ./configure [other options]</TT>
<li><a name="compile"></a>To make <b>spaln</b> and <b>sorgrcd</b> handle gzipped files, ./configure command should be run with --use_zlib=1. Alternatively, you may manually edit the generated Makefile so that -DUSE_ZLIB=1 is included in the compile option.
</ul>
<li> <TT>% make</TT>
<li> <TT>% make install</TT>
<ul>
<li>Executables are copied to ../bin
<li>makmdm program makes mutation data matrices of various PAM levels in the ../table directory
</ul>
<li> <TT>% make clearall</TT>
<li> Add <em><font color="#40A040">download</font>/spalnXX/bin</em> to your PATH
<ul>
<TT>% setenv PATH $PATH:<em><font color="#40A040">download</font>/spalnXX/bin</em> (csh/tsh)</TT><BR>
<TT>$ export PATH=$PATH:<em><font color="#40A040">download</font>/spalnXX/bin</em> (sh/bsh)</TT> <BR>
</ul>
Preferably, you may add the above line in your start up rc file (<i>e.g.</i> ~/.bashrc)<BR>
Alternatively, move or copy <em><font color="#40A040">download</font>/spalnXX/bin/*</em> to a directory on your PATH, if you have not specified the location of executables at step 6 above.
<li> If you have changed the location of <em><font color="#40A040">table</font></em> and/or <em><font color="#40A040">seqdb</font></em> directory after installation, set the env variables ALN_TAB and/or ALN_DBS as explained in the following subsection.
<li> Proceed to <a href="#Seq">Sequence data formation</a>.
</ol>
<BR>
<h3>From binaries</h3>
Binaries for a 32 bit (<a href="../archive/spaln2.0.4.linux32.tar.gz">spaln2.0.4.linux32</a>) or 64 bit
(<a href="../archive/spaln2.4.0.linux64.tar.gz">spaln2.4.0.linux64</a>) Linux machine are available. The
executable will run on 64-bit Windows10 WSL environment without any modification.
To use the binaries, follow the instructions below.
<p>
Case I: Assume the directory <font color="#40A040"><em>work</em></font> is your workspace where every material is stored. In this case, <font color="#40A040"><em>seqdb</font></em>=<em><font color="#40A040">work</font></em>.
<ol>
<li> <TT>% mkdir <em><font color="#40A040">work</font></em></TT>
<li> <TT>% cd <em><font color="#40A040">work</font></em></TT>
<li> Download spalnXX.PC.tar.gz, where PC is a platform code
<li> <TT>% tar xfz spalnXX.PC.tar.gz</TT>
<li> Add <em><font color="#40A040">work</font>/bin</em> to your PATH <BR>
Or move or copy <em><font color="#40A040">work</font>/bin/*</em> to a directory on your PATH
<li> <TT>% mv ./table/* .; rmdir ./table</TT>
<li> <TT>% mv ./seqdb/* .; rmdir ./seqdb</TT>
<li> Now proceed to <a href="#Seq">Sequence data formation</a>.
</ol>
<p>
Case II: Assume your workspace <font color="#40A040"><em>work</em></font> is distinct from <font color="#40A040"><em>seqdb</em></font>
<ol>
<li> <TT>% mkdir <em><font color="#40A040">download</font></em></TT>
<li> <TT>% cd <em><font color="#40A040">download</font></em></TT>
<li> Download spalnXX.PC.tar.gz, where PC is a platform code
<li> <TT>% tar xfz spalnXX.PC.tar.gz</TT>
<li> Add <em><font color="#40A040">download</font>/bin</em> to your PATH<BR>
Or move or copy <font color="#40A040"><em>download</font>/bin/*</em> to a directory on your PATH
<li> <TT>% setenv ALN_TAB <font color="#40A040"><em>download</em></font>/table</TT> (csh/tsh)
<BR> <TT>$ export ALN_TAB=<font color="#40A040"><em>download</em></font>/table</TT> (sh/bsh)
<li> <TT>% setenv ALN_DBS <font color="#40A040"><em>download</em></font>/seqdb</TT> (csh/tsh)
<BR> <TT>$ export ALN_DBS=<font color="#40A040"><em>download</em></font>/seqdb</TT> (sh/bsh)
<li> Add the above lines to your rc file, so that you don't have to repeat the commands at every login time.
<li> Now proceed to <a href="#Seq">Sequence data formation </a>
</ol>
<p>
<h2><a name="Seq">Sequence data formation</a></h2>
If you do not need genome mapping or database search, you may skip this section. All sequence files should be in (multi-)fasta format.<BR>
To perform genome mapping, the genomic sequence must be formatted before use. Formatting is optional for amino acid sequence database search.
<ol>
<li> <TT>% cd <font color="#40A040"><em>seqdb</em></font></TT></li>
<li> Download or copy genomic sequences or protein database sequence in multi-fasta format. If <b>spaln</b> is accordingly <a href='#compile'>compiled</a>, gzipped file need not be uncompressed (the file name must be <i>X</i>.gz).</li>
<li> To use 'makeidx.pl' command, chromosomal sequences must be concatenated into a single file. The extension of the genomic sequence file must be '.mfa' or '.gf', and protein database sequence must be '.faa', to render 'make' command effective. With 'spaln -W' command, these restrictions are not obligatory. Hereafter, the file name is assumed to be xxxgnm.gf or prosdb.faa. <del> The total number of residues in a file must not be greater than or equal to 2**32.</del> </li>
<li>To format xxxgnm.gf(.gz), run either of the following two commands, which are equivalent to each other except that the former is faster, accepts multiple input files, and does not need Makefile.
<ul>
<TT>% spaln -W -K[D|P] [-X<i>GMAX_GENE</i>] [other spaln options] xxxgnm.gf(.gz) ...</TT><BR>
<TT>% ./makeidx.pl -i[n|p|np] [-X<i>GMAX_GENE</i>] [other spaln options] xxxgnm.gf(.gz)</TT><BR>
</ul>
To format protein database sequence, use either of the following two commands:
<ul>
<TT>% spaln -W -KA [other spaln options] prosdb.faa(.gz) ...</TT><BR>
<TT>% ./makeidx.pl -ia [other spaln options] prosdb.faa(.gz)</TT><BR>
</ul>
<ul>
<li> -K<i>X</i> (or corresponding -i<i>x</i>) option specifies the "block file" xxxgnm.bk<i>x</i> to be constructed, where <i>X</i> is 'A', 'D' or 'P' and <i>x</i> is 'a', 'n' or 'p'. The -inp option will construct both xxxgnm.bkn (for cDNA queries)
and xxxgnm.bkp (for protein queries) together with the xxxgnm.idx and associated files.
-K<i>X</i> option is mandatory. If -i<i>x</i> is omitted or <i>x</i> is empty, xxxgnm.idx and associated files are created but no block file is constructed.
<li>The block size and <i>k</i>-mer size are estimated from the genome size, unless explicitly specified (see below).</li>
<li>if <i>MAX_GENE</i> (the length of the plausibly longest gene on the genome) is
not specified, <i>MAX_GENE</i> is also estimated from the genome size.</li>
<li><em><b>WARNING:</b> <u>Don't forget to specify <i>MAX_GENE</i> if xxxgnm.gf represents only a part of the genome (<i>e.g</i> a single chromosome)!!</u></em>. Otherwise, <i>MAX_GENE</i> may be seriously underestimated. <i>MAX_GENE</i> can have suffix 'k' and 'M' to indicate that the number is measured in kbp and Mbp, respectively.</li>
<li>-g option generates gzipped outputs.</li>
<li>-t<i>N</i> option enables <i>N</i>-thread operation.</li>
<li>-E option generates local lookup table.</li>
</ul>
As <TT>Spaln -W</TT> command accepts multiple input files and generates all necessary files in a single operation, you can <a href="#Usr">skip</a> following instructions.<BR><BR>
Makeidx.pl command performs the following series of operations 5-6, if the input is a single sequence file.
<li> <TT>% make xxxgnm.idx</TT> (for genomic sequence) or<BR>
<TT>% make prosdb.idx</TT> (for protein database sequence)
<ul>
<li> This command converts the sequence into a binary format. Four or five files, xxxgnm.seq, xxxgnm.idx, xxxgnm.ent, xxxgnm.grp, and optionally xxxgnm.odr are constructed (prosdb instead of xxxgnm in case of make prosdb.idx). It may take several tens of minutes to construct the files for mammalian genome.</li>
</li></ul></li>
<li> <TT>% make xxxgnm.bkn</TT> (for cDNA queries) or<BR>
<TT>% make xxxgnm.bkp</TT> (for protein queries) or<BR>
<TT>% make prosdb.bka</TT> (for protein database)<BR>
<ul>
<li> This command makes the block index table. This process may take another several tens of minutes.
<li> Internally, the make command invokes<BR>
'<TT>spaln -Wxxxgnm.bkn -KD [Options] xxxgnm.gf</TT>' or<BR>
'<TT>spaln -Wxxxgnm.bkp -KP [Options] xxxgnm.gf</TT>' or<BR>
'<TT>spaln -Wprosdb.bka -KA [Options] prosdb.faa</TT>'.
<li> If xxxgnm.grp or prosdb.grp were successfully constructed at step 5 above, the option values below would be automatically calculated by script makblk.pl.
<li> Options: (default value)
<ul>
<li> -XA<i>N</i>: Alphabet size of the reduced amino acids: 6 < <i>N</i> <= 20 (20)
<li> -XB<i>S</i>: Bit patterns of the spaced seeds. The pattern should be asymmetric when the number of patterns > 2.
<li> -XC<i>N</i>: Number of seed patterns: 0 <= <i>N</i> <= 5 (0: contiguous seed)
<li> -XG<i>N</i>: Maximum gene length (262144)
<li> -Xa<i>N</i>: A parameter used to filter excessively abundant words (10)
<li> -Xb<i>N</i>: Block size (4096)
<del><i>N</i> < 65536 and 'genome size' < 65536 * <i>N</i> must be satisfied.</del> An estimate of <i>N</i> is sqrt(genome size). For mammals, <i>N</i> is nearly equal to 54000.
<li> -Xg<i>N</i>: Maximal distance in block number between 5' terminal and 3' terminal blocks (16)
<li> -Xk<i>N</i>: Word size (11 for DNA, 5 for protein)
<li> -Xs<i>N</i>: Distance between neighboring seeds (= <i>k</i>) <BR>
</ul>
</ul>
<li> It is possible to generate xxxgnm.idx and other three files directly from the input files without concatenation:<BR>
<ul>
<TT>% makdbs -nxxxgnm -KD file1 ... fileN</TT><BR>
<TT>% make xxxgnm.bkn</TT> (for cDNA queries) or<BR>
<TT>% make xxxgnm.bkp</TT> (for protein queries)<BR>
</ul>
This method is particularly useful when the concatenation might yield a file too large to be dealt with by the OS.
</ol>
<h2><a name="Usr">Execution</a></h2>
<ol>
<li> Prepare protein, cDNA, or genomic segment sequence(s) in (multi-)fasta
or extended (multi-)fasta (see -O6 option)
format (denoted by <i>query</i> below). From 2.3.2a, gzipped fasta file(s) may
be used as the query without prior expansion. Note, however, that compressed
query can considerably slow down the execution rate.
<li> Store <i>query</i> to <em><font color="#40A040">work</font></em>
<li> <TT>% cd <em><font color="#40A040">work</font></em></TT>
<li> Run <B>spaln</B> in one of the following four modes. <B>Spaln</B>
does not support comparison between two genomic segments.<BR>
(A) <TT>% spaln -Q[0|1|2|3] [-O<i>N</i>] [other options]
<i>genome_segment query</i></TT><BR>
(B) <TT>% spaln -Q[4|5|6|7] [-O<i>N</i>] [other options]
-[d|D] <i>xxxgnm query</i></TT><BR>
(C) <TT>% spaln -Q[4|5|6|7] [-O<i>N</i>] [other options]
-[a|A] <i>prosdb query</i></TT><BR>
(D) <TT>% spaln -Q[4|5|6|7] [-O<i>N</i>] [other options]
<i>prosdb.faa query</i></TT><BR>
In the last case, <i>prosdb.faa</i> will be internally formatted, and the formatted results will be discarded after the end of execution.<BR>
<p>
Only a subset of queries may be examined if <i>query</i> is replaced with '<i>query</i> (from to)', where 'from' and 'to' are the first and last entry numbers in <i>query</i> to be examined.<BR> To run <B>spaln</B> on multiple CPUs, for example, the following commands may be used and the results may be summarized with <B>sortgrcd</B>, as explained later.<BR>
(a) <TT>% spaln -Q7 -O12 -oxxxO1 -dxxxgnm '<i>query</i> (1 1000)'</TT><BR>
(b) <TT>% spaln -Q7 -O12 -oxxxO2 -dxxxgnm '<i>query</i> (1001 2000)'</TT><BR>
(c) <TT>% spaln -Q7 ...</TT><BR>
However, the procedure will be simplified if a multi-thread operation is
used as follows:<br>
(d) <TT>% spaln -Q7 -O<i>N</i> -oxxx -dxxxgnm -t[<i>N</i>] query</TT><p>
Options: (default value)<BR>
<ul>
<li> -C<i>N</i>: Use the genetic code specified by the "transl_table number" defined in <a href="http://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi">NCBI transl_table</a> (1).
<li> -E: Use local lookup table.
<li> -H<i>N</i>: Output is suppressed if the alignment score is less than <i>N</i>. See also -pw. (35)
<li> -LS: Smith-Waterman-type local alignment. This option may prune out weakly matched terminal regions.
<li> -M[<i>N</i>[.<i>M</i>]: Single or multiple output for each query
<ul>
<li> No option (default): single locus
<li> No argument: Multiple loci up to the maximum number specified by the program (4 in the present implementation)
<li> <i>N</i>=1: Re-search the <i>query</i> region not aligned in the first trial. May be useful to detect chimera or fragmented genomic region
<li> <i>N</i>>1: Output multiple loci maximally up to <i>N</i>
<li> <i>M</i>: Maximal number of candidate loci to be subjected to spliced alignment (4). If<i>M</i> < <i>N</i>, <i>M</i> is reset to <i>N</i>.
</ul>
<li> -O<i>N</i>[,<i>N</i><sub>2</sub>,<i>N</i><sub>3</sub>...]: Select output format for genome vs cDNA or aa (4)<BR>
It is possible to output multiple files with extensions of .O<i>N</i> at a run by multiply applying this option, or by concatenating the format numbers with commas (,) or colons (:), ex. -O0,1,4. See also -o option.
<ul>
<li> <i>N</i>=0: Gff3 gene format
<li> <i>N</i>=1: Alignment
<li> <i>N</i>=2: Gff3 match format
<li> <i>N</i>=3: Bed format
<li> <i>N</i>=4: Exon-oriented format similar to the output of megablast -D 3
<li> <i>N</i>=5: Intron-oriented output
<li> <i>N</i>=6: Concatenated exon sequence in extended (multi-)fasta format, in which the exon-intron structure of the parental gene is supplied by one
or more comment lines starting with ';C', such as<BR>
<TT>;C complement(join(1232555..1232760,1233786..1233849,1233947..1234119,</TT><BR>
<TT>;C 1234206..1234392))</TT><BR>
<li> <i>N</i>=7: Translated amino-acid sequence in extended (multi-)fasta format. Presently not very useful for cDNA queries because the entire exon rather than an ORF is translated
<li> <i>N</i>=8: Mapping (block) information only. Use with -Q4
<li> <i>N</i>=10: SAM format
<li> <i>N</i>=12: Output the same information as -O4 in the
binary format. If -oOutput is set, three files named
Output.grd, Output.erd, and Output.qrd will be created. Otherwise,
query.grd, query.erd, and query.qrd will be created.
</ul>
<li> -O <i>N</i>[,<i>N</i><sub>2</sub>,<i>N</i><sub>3</sub>...]: Select output format for aa vs aa (4)<BR>
It is possible to output multiple files with extensions of .O<i>N</i> at a run by multiply applying this option, or by concatenating the format numbers with commas (,) or colons (:), ex. -O0,1,4. See also -o option.
<ul>
<li> <i>N</i>=0: statistics (%divergence alignment_score #match #mismatch #gap #unpaired)
<li> <i>N</i>=1: Alignment
<li> <i>N</i>=2: Sugar format
<li> <i>N</i>=3: Psl format
<li> <i>N</i>=4: XYL = Coordinate + match length
<li> <i>N</i>=5: statistics + XYL
<li> <i>N</i>=8: Cigar format
<li> <i>N</i>=9: Vulgar format
<li> <i>N</i>=10: SAM format
</ul>
<li> -Q<i>N</i>: Select algorithm (3)
<ul>
<li> 0<=<i>N</i><=3: Genomic segment in the fasta format given by the first argument vs. <i>query</i> given by the following arguments. See also -i option. One may skip the formatting step described above if only this mode of operation is used.
<li> 4<=<i>N</i><=7: Genome mapping and alignment. The genomic sequence must be formatted beforehand.
<li> <i>N</i>=0,4: DP procedure without HSP search. Considerably slow
<li> <i>N</i>=1,2,3,5,6,7: Recursive HSP search up to the level of (<i>N</i> % 4)
</ul>
<li> -R<i>S</i>: Read block index table from file <i>S</i>.
If omitted, the xxxgnm.bkn, xxxgnm.bkp,
or prosdb.bka file will be read depending on the type of query. The
appropriate file is searched for in the current directory, the directory
specified by the <span class=SpellE>env</span> variable 'ALN_DBS', and the
'seqdb' directory specified at the compile time in this order.
<li> -S<i>N</i>: Orientation of the DNA query sequence (3)
<ul>
<li> <i>N</i>=0: The
orientation is inferred from the phrases (e.g. 5' end) in the header
line of each entry within a fasta file. If no information is available,
both orientations are examined, and the result with the better score is
reported. Terminal polyA or polyT sequence is not trimmed.
<li> <i>N</i>=1: Forward orientation only. PolyA tail may be trimmed off.
<li> <i>N</i>=2: Reverse-complement orientation only. Leading polyT may be trimmed off.
<li> <i>N</i>=3: Examine both orientations. Terminal polyA or polyT sequence may be trimmed off.
</ul>
<li> -T<i>S</i>: Specify the species-specific parameter set. <i>S</i> corresponds to the subdirectory in the <em><font color="#40A040">table</font></em> directory. Alternatively, <i>S</i> may be the 1st or the 3rd term in <em><font color="#40A040">table/gnm2tab</font></em> file, where the 2nd term on the line indicates the subdirectory.
<li> -V<i>N</i>: Minimum space to induce Hirschberg's algorithm (16M)
<li> -W<i>S</i>: Write block index table to file <i>S</i>.bk<i>x</i>. if <i>S</i> is omitted, the file name (without directory and extension) of the first argument is used as <i>S</i>.
<li> -g: gzipped output used in combination with -W or -O12 option.
<li> -i[a|p]: Input mode with -Q[0..3].
<ul>
<li> -ia: Alternative mode; a genomic segment of an odd numbered entry in the input file is aligned with the query of the following entry.
<li> -ip: Parallel mode; the i-th entry in the file specified by the first argument is aligned with the i-th entry in the file specified by the second argument.
<li> default: The genomic segment specified by the first argument is aligned with each entry in the file specified by the following arguments.
</ul>
<li> -o<i>S</i>: Destination of output file name (stdout). If multiple output formats are specified by -O option(s), <i>S</i> specifies the directory or prefix to which the file names with .O<i>N</i> extensions are concatenated.
<li> -pa<i>N</i>: Terminal polyA or polyT sequence longer than <i>N</i> (12) is trimmed off and the orientation is fixed accordingly. If <i>N</i> = 0 or empty, these functionalities are disabled.
<li> -pi: Mark exon-intron junctions by color in the alignment (-O1).
<li> -pq: Suppress warning messages sent to <i>stderr</i>.
<li> -pw: Report result even if alignment score is below threshold value.
<li> -px: Suppress self-comparisons in the execution mode (C) or (D).
<li> -xB<i>S</i>: Bit pattern of seeds used for HSP search at level 1
<li> -xb<i>S</i>: Bit pattern of seeds used for HSP search at level 3
<li> -u<i>N</i>: Gap-extension penalty (3, 2, 2)
<li> -v<i>N</i>: Gap-opening penalty (8, 6, 9)
<li> -ya<i>N</i>: Dinucleotide pairs at the ends of an intron (0)
<ul>
<li> <i>N</i>=0: Accept only the canonical pairs (GT..AG,GC..AG,AT..AC)
<li> <i>N</i>=1: accept also AT..AN
<li> <i>N</i>=2: allow up to one mismatch from GT..AG
<li> <i>N</i>=3: accept any pairs. An omission of <i>N</i> implies <i>N</i> = 3
</ul>
<li> -yi<i>N</i>: Intron penalty (11, 8, 11)
<li> -yj<i>N</i>: Incline of long gap penalty (0.6)
<li> -yk<i>N</i>: Flex point where the incline of gap penalty changes (7)
<li> -yl<i>N</i>: Double affine gap penalty if <i>N</i>=3; otherwise affine gap penalty
<li> -ym<i>N</i>: Score for a nucleotide match (2, 2)
<li> -yn<i>N</i>: Penalty for nucleotide mismatch (6, 2)
<li> -yo<i>N</i>: Penalty for an in-frame termination codon (100)
<li> -yp<i>N</i>: PAM level used in the alignment (third) phase (150)
<li> -yq<i>N</i>: PAM level used in the second phase (50)
<li> -yx<i>N</i>: Penalty for a frame shift (100)
<li> -yy<i>N</i>: Relative contribution of splicing signal (8)
<li> -yz<i>N</i>: Relative contribution of coding potential (2)
<li> -yA<i>N</i>: Relative contribution of the translational initiation or termination signal (8)
<li> -yB<i>N</i>: Relative contribution of branch point signal (0)
<li> -yE<i>N</i>: Minimum exon length (2)
<li> -yI<i>S</i>: Intron distribution parameters
<li> -yJ<i>N</i>: Relative contribution of the bonus given
to a conserved intron position
<li> -yL<i>N</i>: Minimum intron length (30, 30)
<li> -yS<i>N</i>: <i>N</i> specifies the percentile
contribution of the species-specific splice signal. The other part is
derived from the universal signal given to the dinucleotides at the ends
of an intron. An omission of <i>N</i> implies <i>N</i> = 100.
<li> -yX<i>N</i>: <i>N</i> = 0: set parameter values for intra-species
comparison. <i>N</i> = 1: set parameter values for cross-species comparison. The default value for <i>N</i> is 0 or 1 for DNA or protein query, respectively.
<li> -yY<i>N</i>: Relative contribution of length-dependent part of intron penalty (8)
<li> -yZ<i>N</i>: Relative contribution of oligomer composition within an intron (0)
</ul>
<li> Sortgrcd</li><p>
<B>Sortgrcd</B> is used to recover the output of <B>spaln</B> with -O12 option, to apply some filtering, and also to rearrange the output of multiple <B>spaln</B> runs. It is invoked by:<br>
<TT>% sortgrcd [options] X.grd [Y.grd ...]</TT> or<BR>
<TT>% sortgrcd [options] X.grd.gz [Y.grd.gz ...]</TT><BR>
<ul>
<li> Options:
<ul>
<li> -C<i>N</i>: Minimum cover rate = % nucleotides in predicted exons / length of <i>query</i> (x 3 if query is protein) (0-100)
<li> -E<i>N</i>: Report only the best (<i>N</i>=1) or all (<i>N</i>=2) results per gene locus (1)
<li> -F<i>N</i>: Filter level (<i>N</i>=0: no; <i>N</i>=1: mild; <i>N</i>=2: medium; <i>N</i>=3: stringent)
<li> -I<i>N</i>: Minimum sequence identity (0-100)
<li> -H<i>N</i>: Minimum alignment score (35)
<li> -O<i>N</i>: Output mode. <i>N</i>=0 or 3-7: same as -O<i>N</i> of spaln; <i>N</i>=15: -O5 format for unique introns (default: <i>N</i>=4)
<li> -S<i>C</i>: Sort chromosomes/contigs in the order of <i>C</i>=a: alphabetical, b: abundance, c: appearance in genome database, r: reverse order for minus strand
<li> -V<i>N</i>: Internal memory size used for core sort. If the
data size is greater than <i>N</i>, the sorting procedure will be
done in pieces.
<li> -m<i>N</i>: Maximum number of mismatches within 20 bp from the nearest exon-intron boundary
<li> -n<i>N</i>: Maximum number of non-canonical (other than GT..AG, GC..AG, AT..AC) intron ends
<li> -u<i>N</i>: Maximum number of unpaired (gap) sites within 20 bp from the nearest exon-intron boundary
</ul>
<li> By default, no filter listed above is applied.
<li> When the output of <B>spaln</B> is separated into several files, the combined results are subjected to the sorting. Although *.grd files are assigned as the argument, there must be corresponding *.erd and *.qrd files in the same directory.
<li> In the default output format, the gene structure corresponding to each transcript is delimited by a line starting with '@', whereas each gene locus is delimited by a line starting with '!' [4]. Two transcripts belong to the same locus if their corresponding genomic regions overlap by at least one nucleotide on the same strand.
<li> With -O0 option, the outputs follow the instruction of Gff3 (<a href='http://www.sequenceontology.org/gff3.shtml'>http://www.sequenceontology.org/gff3.shtml</a>) where a gene locus is defined as described above.
</ul>
</ol>
<p>
<h2><a name="Exl">Example</a></h2>
<ul>
<li> To experience the flow of procedures with the samples in <em><font color="#40A040">seqdb</font></em>, type in the following series of commands after moving to <em><font color="#40A040">seqdb</font></em>.<BR>
<TT>% make dictdisc.cf</TT><BR>
<TT>% make dictdisc.faa</TT><BR>
<TT>% make dictdisc_g.gf</TT><BR>
<TT>% ./makeidx.pl -inp dictdisc_g.gf</TT><BR>
<TT>% make dictdisc.srd</TT><BR>
<TT>% make dictdisc.spn</TT><BR>
</ul>
<p>
<h2><a name="Chg">Change from previous versions</a></h2>
Added/modified in Ver. 2.4.1 (2020-10-09):
<p>
<ol>
<li>The algorithm for delimiting a genic region has been modified to find remote terminal coding exon(s) separated by long (up to 99.6% quantile) intron(s) from the main body of the gene.
<li>The -yx0 option now tries to search for missing internal micro exons and terminal very short coding exons.
<li>Selenocysteine (denoted by U) is now regarded as the 21th amino acid which favorably matches an in-frame TGA termination codon (U in the Tron code) upon DNA vs amino acid sequence alignment.
<li>Gene candidates are now sorted according to the final alignment score rather than the intermediate chained HSP score. This modification has improved the chance of true orthologous hits rather than paralog hits at an expense of a slight increase in computational load.
<li>Compared with the previous versions, a larger number of species-specific parameter sets (247 <- 102) are provided to support more species (1495 <- 688). <b>Note</b> that some parameter-set identifiers are changed. Please use eight-digit species identifies (e.g. zea_mays) rather than former parameter-set identifiers (e.g. Magnolio) as the argument of -T option.
</ol>
<p>
Added/modified in Ver. 2.4.0 (2019-11-18):
<p>
<ol>
<li><b>Spaln</b> can now directly format genomic sequences without relying on 'make' command. See <a href="#Seq">Sequence data formation</a>.
<li>The internal format of index files is slightly modified. Although previously-formatted files can be used by the new version, the opposite is not true. Note that use of older files with the new version can lead to a slight loss in sensitivity.
<li>The above change has been done to facilitate multi-thread operation at the format time.
<li>Multiple output forms can be produced at a single run. See -O and -o options.
<li>The traditional bidirectional Hirschberg algorithm is changed to the unidirectional variant.
<li>Also, the bidirectional 'sandwich' or 'attack by both sides' spliced alignment algorithm has been changed to unidirectional 'skipped' spliced alignment algorithm. This and the preceding changes have considerably reduced code complexity.
<li>Local lookup table (xxxgnm.lun or xxxgnm.lup) is generated and used with -E option. Be cautious to use this option, as a large disk space is required to store the generated file, and a large memory is required at the runtime.</li>
<li>Many small bugs have been fixed.
</ol>
<hr>
<p>
Copyright (c) 2007-2020 Osamu Gotoh all rights reserved
<p>
</body>
</html>
|