File: last-train.html

package info (click to toggle)
last-align 1179-1
links: PTS, VCS
area: main
in suites: bullseye
size: 4,004 kB
sloc: cpp: 43,317; python: 3,352; ansic: 1,874; makefile: 495; sh: 305
file content (576 lines) | stat: -rw-r--r-- 19,112 bytes
<?xml version="1.0" encoding="utf-8" ?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<meta name="generator" content="Docutils 0.6: http://docutils.sourceforge.net/" />
<title>last-train</title>
<style type="text/css">

/*
:Author: David Goodger (goodger@python.org)
:Id: $Id: html4css1.css 5951 2009-05-18 18:03:10Z milde $
:Copyright: This stylesheet has been placed in the public domain.

Default cascading style sheet for the HTML output of Docutils.

See http://docutils.sf.net/docs/howto/html-stylesheets.html for how to
customize this style sheet.
*/

/* used to remove borders from tables and images */
.borderless, table.borderless td, table.borderless th {
  border: 0 }

table.borderless td, table.borderless th {
  /* Override padding for "table.docutils td" with "! important".
     The right padding separates the table cells. */
  padding: 0 0.5em 0 0 ! important }

.first {
  /* Override more specific margin styles with "! important". */
  margin-top: 0 ! important }

.last, .with-subtitle {
  margin-bottom: 0 ! important }

.hidden {
  display: none }

a.toc-backref {
  text-decoration: none ;
  color: black }

blockquote.epigraph {
  margin: 2em 5em ; }

dl.docutils dd {
  margin-bottom: 0.5em }

/* Uncomment (and remove this text!) to get bold-faced definition list terms
dl.docutils dt {
  font-weight: bold }
*/

div.abstract {
  margin: 2em 5em }

div.abstract p.topic-title {
  font-weight: bold ;
  text-align: center }

div.admonition, div.attention, div.caution, div.danger, div.error,
div.hint, div.important, div.note, div.tip, div.warning {
  margin: 2em ;
  border: medium outset ;
  padding: 1em }

div.admonition p.admonition-title, div.hint p.admonition-title,
div.important p.admonition-title, div.note p.admonition-title,
div.tip p.admonition-title {
  font-weight: bold ;
  font-family: sans-serif }

div.attention p.admonition-title, div.caution p.admonition-title,
div.danger p.admonition-title, div.error p.admonition-title,
div.warning p.admonition-title {
  color: red ;
  font-weight: bold ;
  font-family: sans-serif }

/* Uncomment (and remove this text!) to get reduced vertical space in
   compound paragraphs.
div.compound .compound-first, div.compound .compound-middle {
  margin-bottom: 0.5em }

div.compound .compound-last, div.compound .compound-middle {
  margin-top: 0.5em }
*/

div.dedication {
  margin: 2em 5em ;
  text-align: center ;
  font-style: italic }

div.dedication p.topic-title {
  font-weight: bold ;
  font-style: normal }

div.figure {
  margin-left: 2em ;
  margin-right: 2em }

div.footer, div.header {
  clear: both;
  font-size: smaller }

div.line-block {
  display: block ;
  margin-top: 1em ;
  margin-bottom: 1em }

div.line-block div.line-block {
  margin-top: 0 ;
  margin-bottom: 0 ;
  margin-left: 1.5em }

div.sidebar {
  margin: 0 0 0.5em 1em ;
  border: medium outset ;
  padding: 1em ;
  background-color: #ffffee ;
  width: 40% ;
  float: right ;
  clear: right }

div.sidebar p.rubric {
  font-family: sans-serif ;
  font-size: medium }

div.system-messages {
  margin: 5em }

div.system-messages h1 {
  color: red }

div.system-message {
  border: medium outset ;
  padding: 1em }

div.system-message p.system-message-title {
  color: red ;
  font-weight: bold }

div.topic {
  margin: 2em }

h1.section-subtitle, h2.section-subtitle, h3.section-subtitle,
h4.section-subtitle, h5.section-subtitle, h6.section-subtitle {
  margin-top: 0.4em }

h1.title {
  text-align: center }

h2.subtitle {
  text-align: center }

hr.docutils {
  width: 75% }

img.align-left, .figure.align-left{
  clear: left ;
  float: left ;
  margin-right: 1em }

img.align-right, .figure.align-right {
  clear: right ;
  float: right ;
  margin-left: 1em }

.align-left {
  text-align: left }

.align-center {
  clear: both ;
  text-align: center }

.align-right {
  text-align: right }

/* reset inner alignment in figures */
div.align-right {
  text-align: left }

/* div.align-center * { */
/*   text-align: left } */

ol.simple, ul.simple {
  margin-bottom: 1em }

ol.arabic {
  list-style: decimal }

ol.loweralpha {
  list-style: lower-alpha }

ol.upperalpha {
  list-style: upper-alpha }

ol.lowerroman {
  list-style: lower-roman }

ol.upperroman {
  list-style: upper-roman }

p.attribution {
  text-align: right ;
  margin-left: 50% }

p.caption {
  font-style: italic }

p.credits {
  font-style: italic ;
  font-size: smaller }

p.label {
  white-space: nowrap }

p.rubric {
  font-weight: bold ;
  font-size: larger ;
  color: maroon ;
  text-align: center }

p.sidebar-title {
  font-family: sans-serif ;
  font-weight: bold ;
  font-size: larger }

p.sidebar-subtitle {
  font-family: sans-serif ;
  font-weight: bold }

p.topic-title {
  font-weight: bold }

pre.address {
  margin-bottom: 0 ;
  margin-top: 0 ;
  font: inherit }

pre.literal-block, pre.doctest-block {
  margin-left: 2em ;
  margin-right: 2em }

span.classifier {
  font-family: sans-serif ;
  font-style: oblique }

span.classifier-delimiter {
  font-family: sans-serif ;
  font-weight: bold }

span.interpreted {
  font-family: sans-serif }

span.option {
  white-space: nowrap }

span.pre {
  white-space: pre }

span.problematic {
  color: red }

span.section-subtitle {
  /* font-size relative to parent (h1..h6 element) */
  font-size: 80% }

table.citation {
  border-left: solid 1px gray;
  margin-left: 1px }

table.docinfo {
  margin: 2em 4em }

table.docutils {
  margin-top: 0.5em ;
  margin-bottom: 0.5em }

table.footnote {
  border-left: solid 1px black;
  margin-left: 1px }

table.docutils td, table.docutils th,
table.docinfo td, table.docinfo th {
  padding-left: 0.5em ;
  padding-right: 0.5em ;
  vertical-align: top }

table.docutils th.field-name, table.docinfo th.docinfo-name {
  font-weight: bold ;
  text-align: left ;
  white-space: nowrap ;
  padding-left: 0 }

h1 tt.docutils, h2 tt.docutils, h3 tt.docutils,
h4 tt.docutils, h5 tt.docutils, h6 tt.docutils {
  font-size: 100% }

ul.auto-toc {
  list-style-type: none }

</style>
<style type="text/css">

/* Style sheet for LAST HTML documents */
h1 { color: navy }
h2 { color: teal }
div.document { margin-left: auto; margin-right: auto; max-width: 45em }
strong { color: red }
.option-list td { padding-bottom: 1em }
table.field-list { border: thin solid green }

</style>
</head>
<body>
<div class="document" id="last-train">
<h1 class="title">last-train</h1>

<p>last-train finds the rates (probabilities) of insertion, deletion, and
substitutions between two sets of sequences.  It thereby finds
suitable substitution and gap scores for aligning them.</p>
<p>It (probabilistically) aligns the sequences using some initial score
parameters, then estimates better score parameters based on the
alignments, and repeats this procedure until the parameters stop
changing.</p>
<p>The usage is like this:</p>
<pre class="literal-block">
lastdb mydb reference.fasta
last-train mydb queries.fasta
</pre>
<p>last-train prints a summary of each alignment step, followed by the
final score parameters, in a format that can be read by <a class="reference external" href="lastal.html#score-options">lastal's -p
option</a>.</p>
<p>last-train can read .gz files, or from pipes:</p>
<pre class="literal-block">
bzcat queries.fasta.bz2 | last-train mydb
</pre>
<div class="section" id="options">
<h2>Options</h2>
<blockquote>
<table class="docutils option-list" frame="void" rules="none">
<col class="option" />
<col class="description" />
<tbody valign="top">
<tr><td class="option-group">
<kbd><span class="option">-h</span>, <span class="option">--help</span></kbd></td>
<td>Show a help message, with default option values, and exit.</td></tr>
<tr><td class="option-group">
<kbd><span class="option">-v</span>, <span class="option">--verbose</span></kbd></td>
<td>Show more details of intermediate steps.</td></tr>
</tbody>
</table>
</blockquote>
<div class="section" id="training-options">
<h3>Training options</h3>
<blockquote>
<table class="docutils option-list" frame="void" rules="none">
<col class="option" />
<col class="description" />
<tbody valign="top">
<tr><td class="option-group">
<kbd><span class="option">--revsym</span></kbd></td>
<td>Force the substitution scores to have reverse-complement
symmetry, e.g. score(A→G) = score(T→C).  This is often
appropriate, if neither strand is &quot;special&quot;.</td></tr>
<tr><td class="option-group">
<kbd><span class="option">--matsym</span></kbd></td>
<td>Force the substitution scores to have directional symmetry,
e.g. score(A→G) = score(G→A).</td></tr>
<tr><td class="option-group">
<kbd><span class="option">--gapsym</span></kbd></td>
<td>Force the insertion costs to equal the deletion costs.</td></tr>
<tr><td class="option-group">
<kbd><span class="option">--pid=<var>PID</var></span></kbd></td>
<td>Ignore alignments with &gt; PID% identity.  This aims to
optimize the parameters for low-similarity alignments
(similarly to the BLOSUM matrices).</td></tr>
<tr><td class="option-group">
<kbd><span class="option">--postmask=<var>NUMBER</var></span></kbd></td>
<td>By default, last-train ignores alignments of mostly-lowercase
sequence (by using <a class="reference external" href="last-postmask.html">last-postmask</a>).
To turn this off, do <tt class="docutils literal"><span class="pre">--postmask=0</span></tt>.</td></tr>
<tr><td class="option-group">
<kbd><span class="option">--sample-number=<var>N</var></span></kbd></td>
<td>Use N randomly-chosen chunks of the query sequences.  The
queries are chopped into fixed-length chunks (as if they were
first concatenated into one long sequence).  If there are ≤ N
chunks, all are picked.  Otherwise, if the final chunk is
shorter, it is never picked.  0 means use everything.</td></tr>
<tr><td class="option-group">
<kbd><span class="option">--sample-length=<var>L</var></span></kbd></td>
<td>Use randomly-chosen chunks of length L.</td></tr>
<tr><td class="option-group">
<kbd><span class="option">--scale=<var>S</var></span></kbd></td>
<td>Output scores in units of 1/S bits.  Traditional values
include 2 for half-bit scores and 3 for 1/3-bit scores.
(Note that 1/3-bit scores essentially equal Phred scores
a.k.a. decibans, because log10(2) ≈ 3/10.)  The default is to
infer a scale from the initial score parameters.</td></tr>
<tr><td class="option-group">
<kbd><span class="option">--codon</span></kbd></td>
<td>Do training for DNA query sequences versus protein reference
sequences.  These options will be ignored: <tt class="docutils literal"><span class="pre">--revsym</span>
<span class="pre">--matsym</span> <span class="pre">--gapsym</span> <span class="pre">--pid</span> <span class="pre">--postmask</span> <span class="pre">-q</span> <span class="pre">-p</span> <span class="pre">-S</span></tt>.  If
<tt class="docutils literal"><span class="pre">--codon</span></tt> is used, the &quot;initial parameter options&quot; are
initial probabilities, not scores/costs.</td></tr>
</tbody>
</table>
</blockquote>
<p>All options below this point are passed to lastal to do the
alignments: they are described in more detail at <a class="reference external" href="lastal.html">lastal.html</a>.</p>
</div>
<div class="section" id="initial-parameter-options">
<h3>Initial parameter options</h3>
<blockquote>
<table class="docutils option-list" frame="void" rules="none">
<col class="option" />
<col class="description" />
<tbody valign="top">
<tr><td class="option-group">
<kbd><span class="option">-r <var>SCORE</var></span></kbd></td>
<td>Initial match score.</td></tr>
<tr><td class="option-group">
<kbd><span class="option">-q <var>COST</var></span></kbd></td>
<td>Initial mismatch cost.</td></tr>
<tr><td class="option-group">
<kbd><span class="option">-p <var>NAME</var></span></kbd></td>
<td>Initial match/mismatch score matrix.</td></tr>
<tr><td class="option-group">
<kbd><span class="option">-a <var>COST</var></span></kbd></td>
<td>Initial gap existence cost.</td></tr>
<tr><td class="option-group">
<kbd><span class="option">-b <var>COST</var></span></kbd></td>
<td>Initial gap extension cost.</td></tr>
<tr><td class="option-group">
<kbd><span class="option">-A <var>COST</var></span></kbd></td>
<td>Initial insertion existence cost.</td></tr>
<tr><td class="option-group">
<kbd><span class="option">-B <var>COST</var></span></kbd></td>
<td>Initial insertion extension cost.</td></tr>
<tr><td class="option-group">
<kbd><span class="option">-F <var>LIST</var></span></kbd></td>
<td>Initial frameshift probabilities (only used with <tt class="docutils literal"><span class="pre">--codon</span></tt>).</td></tr>
</tbody>
</table>
</blockquote>
</div>
<div class="section" id="alignment-options">
<h3>Alignment options</h3>
<blockquote>
<table class="docutils option-list" frame="void" rules="none">
<col class="option" />
<col class="description" />
<tbody valign="top">
<tr><td class="option-group">
<kbd><span class="option">-D <var>LENGTH</var></span></kbd></td>
<td>Query letters per random alignment.  (See <a class="reference external" href="last-evalues.html">here</a>.)</td></tr>
<tr><td class="option-group">
<kbd><span class="option">-E <var>EG2</var></span></kbd></td>
<td>Maximum expected alignments per square giga.  (See <a class="reference external" href="last-evalues.html">here</a>.)</td></tr>
<tr><td class="option-group">
<kbd><span class="option">-s <var>NUMBER</var></span></kbd></td>
<td>Which query strand to use: 0=reverse, 1=forward, 2=both.
If specified, this parameter is written in last-train's
output, so it will override lastal's default.</td></tr>
<tr><td class="option-group">
<kbd><span class="option">-S <var>NUMBER</var></span></kbd></td>
<td><p class="first">Specify how to use the substitution score matrix for
reverse strands.  If you use <tt class="docutils literal"><span class="pre">--revsym</span></tt>, this makes no
difference.  &quot;0&quot; means that the matrix is used as-is for
all alignments.  &quot;1&quot; (the default) means that the matrix
is used as-is for alignments of query sequence forward
strands, and the complemented matrix is used for query
sequence reverse strands.</p>
<p class="last">This parameter is always written in last-train's output,
so it will override lastal's default.</p>
</td></tr>
<tr><td class="option-group">
<kbd><span class="option">-C <var>COUNT</var></span></kbd></td>
<td>Before extending gapped alignments, discard any gapless
alignment whose query range lies in COUNT other gapless
alignments with higher score-per-length.  This aims to
reduce run time.</td></tr>
<tr><td class="option-group">
<kbd><span class="option">-T <var>NUMBER</var></span></kbd></td>
<td>Type of alignment: 0=local, 1=overlap.</td></tr>
<tr><td class="option-group">
<kbd><span class="option">-m <var>COUNT</var></span></kbd></td>
<td>Maximum number of initial matches per query position.</td></tr>
<tr><td class="option-group">
<kbd><span class="option">-k <var>STEP</var></span></kbd></td>
<td>Look for initial matches starting only at every STEP-th
position in each query.</td></tr>
<tr><td class="option-group">
<kbd><span class="option">-P <var>COUNT</var></span></kbd></td>
<td>Number of parallel threads.</td></tr>
<tr><td class="option-group">
<kbd><span class="option">-X <var>NUMBER</var></span></kbd></td>
<td><p class="first">How to score a match/mismatch involving N (for DNA) or X
(otherwise).  By default, the lowest match/mismatch score
is used. 0 means the default; 1 means treat reference
Ns/Xs as fully-ambiguous letters; 2 means treat query
Ns/Xs as ambiguous; 3 means treat reference and query
Ns/Xs as ambiguous.</p>
<p class="last">If specified, this parameter is written in last-train's
output, so it will override lastal's default.</p>
</td></tr>
<tr><td class="option-group">
<kbd><span class="option">-Q <var>NAME</var></span></kbd></td>
<td><p class="first">How to read the query sequences (the NAME is not
case-sensitive):</p>
<pre class="literal-block">
Default         fasta
&quot;0&quot;, &quot;fastx&quot;    fasta or fastq: discard per-base quality data
&quot;1&quot;, &quot;sanger&quot;   fastq-sanger
</pre>
<p>The <tt class="docutils literal">fastq</tt> formats are described here:
<a class="reference external" href="lastal.html">lastal.html</a>.  last-train assumes the per-base
quality codes indicate substitution error probabilities,
<em>not</em> insertion or deletion error probabilities.  If this
assumption is dubious (e.g. for data with many insertion
or deletion errors), it may be better to discard the
quality data.  For <tt class="docutils literal"><span class="pre">fastq-sanger</span></tt>, last-train finds the
rates of substitutions not explained by the quality data
(ideally, real substitutions as opposed to errors).</p>
<p class="last">If specified, this parameter is written in last-train's
output, so it will override lastal's default.</p>
</td></tr>
</tbody>
</table>
</blockquote>
</div>
</div>
<div class="section" id="details">
<h2>Details</h2>
<ul>
<li><p class="first">last-train (and lastal) uses &quot;Model A&quot;, in Figure 5A of <a class="reference external" href="https://doi.org/10.1093/bioinformatics/btz576">btz576</a>.</p>
</li>
<li><p class="first">last-train (and lastal) converts between path and alignment
parameters as in Supplementary Section 3.1 of <a class="reference external" href="https://doi.org/10.1093/bioinformatics/btz576">btz576</a>.</p>
</li>
<li><p class="first">last-train uses parameters with &quot;homogeneous letter probabilities&quot;
and &quot;balanced length probability&quot; (<a class="reference external" href="https://doi.org/10.1093/bioinformatics/btz576">btz576</a>).</p>
</li>
<li><p class="first">last-train rounds the scores to integers, which makes them slightly
inaccurate.  It then finds an adjusted scale factor (without
changing the scores), which makes the integer-rounded scores
correspond to homogeneous letter probabilities and balanced length
probability.  It writes this adjusted scale (in nats, not bits) as a
&quot;-t&quot; option for lastal, e.g. &quot;-t4.4363&quot;.</p>
</li>
<li><p class="first">In rare cases, it may be impossible to find such an adjusted scale
factor.  If that happens, last-train doubles the original scale (to
reduce the inaccuracy of integer rounding), until the problem goes
away.</p>
</li>
</ul>
</div>
<div class="section" id="bugs">
<h2>Bugs</h2>
<ul>
<li><p class="first">last-train assumes that gap lengths roughly follow a geometric
distribution.  If they do not (which is often the case), the results
may be poor.</p>
</li>
<li><p class="first">last-train can fail for various reasons, e.g. if the sequences are
too dissimilar.  If it fails to find any alignments, you could try
reducing the alignment <a class="reference external" href="last-evalues.html">significance</a> threshold with option <tt class="docutils literal"><span class="pre">-D</span></tt>.</p>
</li>
</ul>
</div>
</div>
</body>
</html>