File: last-tutorial.txt

package info (click to toggle)
last-align 1179-1
  • links: PTS, VCS
  • area: main
  • in suites: bullseye
  • size: 4,004 kB
  • sloc: cpp: 43,317; python: 3,352; ansic: 1,874; makefile: 495; sh: 305
file content (405 lines) | stat: -rw-r--r-- 15,554 bytes parent folder | download
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
LAST Cookbook
=============

LAST_ is used by running commands in a terminal / command line.  It
has many options: unfortunately, the LAST developers don't know the
best options for every possible alignment task.  Here are some
reasonable starting points.  Feel free to optimize (and share!) search
protocols.

A minimal example: compare human & fugu mitochondrial genomes
-------------------------------------------------------------

Let's find and align similar regions between the human and fugu
mitochondrial genomes.  These FASTA-format files are in LAST's
examples directory: humanMito.fa and fuguMito.fa.  The simplest
possible usage is::

  lastdb humdb humanMito.fa
  lastal humdb fuguMito.fa > myalns.maf

The lastdb_ command creates several files whose names begin with
"humdb".  The lastal_ command then compares fuguMito.fa to the humdb
files, and writes the alignments to a file called "myalns.maf".

Understanding the output
------------------------

The output has very long lines, so you need to view it without
line-wrapping.  For example, you can use::

  less -S myalns.maf

Each alignment looks like this (MAF_ format)::

  a score=27 EG2=4.7e+04 E=2.6e-05
  s humanMito 2170 145 + 16571 AGTAGGCCTAAAAGCAGCCACCAATTAAGAAAGCGTT...
  s fuguMito  1648 142 + 16447 AGTAGGCTTAGAAGCAGCCACCA--CAAGAAAGCGTT...

The score is a measure of how significant the similarity is.  EG2 and
E are explained at last-evalues_.  Lines starting with "s" contain:
the sequence name, the start coordinate of the alignment, the number
of bases spanned by the alignment, the strand, the sequence length,
and the aligned bases.

The start coordinates are zero-based.  This means that, if the
alignment begins right at the start of a sequence, the coordinate is
0.  If the strand is "-", the start coordinate is the coordinate in
the reverse-complemented sequence (the same as if you were to
reverse-complement the sequence before giving it to LAST).

You can convert MAF to other formats with maf-convert_, or use lastal_
option ``-f`` to get a few other formats.

More accurate: learn substitution & gap rates
---------------------------------------------

We can get more accurate alignments between the human and fugu
mitochondrial genomes like this::

  lastdb humdb humanMito.fa
  last-train humdb fuguMito.fa > hufu.train
  lastal -p hufu.train humdb fuguMito.fa > myalns.maf

The last-train_ command finds the rates of deletion, insertion, and
each kind of substitution between these sequences, and writes them to
a file "hufu.train".  lastal's ``-p`` option uses this file to get
more-accurate alignments.

Comparing protein sequences
---------------------------

We can compare some query proteins to some reference proteins like
this::

  lastdb -p -c -R01 refdb ref-prots.fa
  lastal refdb query-prots.fa > prot-alns.maf

``-p`` tells it the sequences are proteins.  (If you forget ``-p`` and
the sequences look proteinaceous, you'll get a warning message.)

The other options suppress alignments caused by simple sequence such
as ``APPSPAPPSPAPPSPAPPSPAP``:

* ``-R01`` converts the sequence letters to uppercase, then finds
  simple regions and converts them to lowercase.  This will be done
  for both ref-prots and query-prots.

* ``-c`` omits alignments that lack a significant amount of
  uppercase-to-uppercase alignment.

You can also use last-train_, but we've hardly tested it for
protein-protein alignment, so we're not sure if it helps.

Find high-similarity, and short, protein alignments
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

If we just want high-similarity alignments, we can use the PAM30 (or
PAM10) `scoring scheme`_::

  lastdb -p -c -R01 refdb ref-prots.fa
  lastal -p PAM30 refdb query-prots.fa > prot-alns.maf

This has two advantages:

* It omits low-similarity alignments, or alignment parts.

* It can find short similarities, which would be deemed insignificant
  (likely to occur by chance between random sequences) unless we focus
  the search on high-similarity.

Comparing DNA to proteins
-------------------------

We can find related regions of DNA and proteins, allowing for nonsense
mutations and frameshifts.  For example, let's find DNA regions
related to transposon proteins::

  lastdb -q -c -R01 trandb transposon-prots.fa
  last-train --codon trandb dna.fa > codon.train
  lastal -p codon.train -m100 -D1e9 -K1 trandb dna.fa > out.maf

``-q`` appends ``*`` meaning STOP to each protein, and treats ``*`` as
a 21st protein letter.

``--codon`` makes it do DNA-versus-protein.  Here, last-train_ tries
to learn 21x64 substitution rates, so it needs a fairly large amount
of data (e.g. a chromosome).

``-m100`` makes it more slow-and-sensitive than the default (which is
``-m10``), see lastal_.

``-D1e9`` sets a strict significance_ threshold.  It means: only
report strong similarities that would be expected to occur by chance,
between random sequences, at most once per 10^9 base-pairs.  The
default is 1e6.

``-K1`` streamlines the output by omitting any alignment whose DNA
range lies in that of a higher-scoring alignment.

Another possibility is to add last-train_ option ``-X1``, which treats
matches to ``X`` (unknown) amino acids as neutral instead of
disfavored.

You can reuse ``last-train`` output for different alignment tasks, if
you expect the rates to be roughly the same.

Aligning high-indel-error long DNA reads to a genome
----------------------------------------------------

Suppose we have DNA reads in either FASTA or FASTQ_ format.  This is
sensitive but slow::

  lastdb -P8 -uNEAR -R01 mydb genome.fa
  last-train -P8 -Q0 mydb reads.fastq > reads.train
  lastal -P8 -p reads.train mydb reads.fastq | last-split > out.maf

``-P8`` uses 8 parallel threads, adjust as appropriate for your
computer.  This has no effect on the results.

``-uNEAR`` selects a `seeding scheme`_ that's better at finding
alignments with few substitutions and/or many gaps.

``-Q0`` makes it discard the fastq_ quality information (or you can
keep-but-ignore it with ``-Qkeep``).

last-split_ finds a unique best alignment for each part of each read.

Here we used ``-R01`` to lowercase simple sequence like
``cacacacacacacacacacacaca``.  But we didn't suppress it with ``-c``,
so as not to hide anything from last-split_.  If desired, you can
filter lowercase with last-postmask_.

You can go faster by sacrificing a bit of sensitivity.  It depends on
your aim, e.g. slow-and-sensitive seems necessary to find intricate
rearrangements of repeats.  Suggested ways to go faster:

* `Mask repeats`_.  This has often worked well.

* Add lastal option ``-k8`` (or ``-k16`` etc).  This makes it faster,
  by only finding initial matches starting at every 8th (or 16th etc)
  position in the reads.

* Replace ``-uNEAR`` with ``-uRY32`` (or ``-uRY16``, ``-uRY8``,
  ``-uRY4``).  This makes it check for initial matches starting at
  only ~1/32 (or ~1/16 etc) of positions, in both reads and genome.
  Compared to ``-k``: this harms sensitivity slightly more, but
  reduces memory use and makes lastdb faster.

Which genome version to use?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Some genome versions (e.g. for human) have artificial
exactly-duplicated regions, which makes it hard to align reads
uniquely.  To avoid that, look for a genome version called something
like "analysis set".

Aligning low-error long DNA reads to a genome
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

We can do this the same way as for high-error reads, but perhaps
accelerate more aggressively (e.g. ``-uRY32``).

If repeats are not masked, lastal_ option ``-C2`` may reduce run time
with little effect on accuracy.

Aligning potentially-spliced RNA or cDNA long reads to a genome
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

See here_.  (For low-error reads, you can probably omit ``-d90`` and
``-m20``.)

Aligning Illumina DNA reads to a genome
---------------------------------------

::

  lastdb -P8 -uNEAR -R01 -C2 mydb genome.fasta
  last-train -P8 -Q1 mydb reads.fastq.gz > reads.train
  lastal -P8 -p reads.train mydb reads.fastq.gz | last-split | gzip > out.maf.gz

Most LAST commands accept ``.gz`` compressed files, and you can
compress output with ``gzip`` as above.

lastdb_ option ``-C2`` makes the alignment a bit faster, but uses more
memory.  This has no effect on the results.  (You could use it in the
other examples too, but it might not be faster.)

``-Q1`` makes it use the fastq_ quality information to improve the
training and alignment.  LAST **assumes** that the qualities reflect
substitution errors, not insertion/deletion errors.  (For long
non-Illumina reads, we suspect this assumption doesn't hold, so we
didn't use this option.)

This recipe may be excessively slow-and-sensitive.  Adding lastal_
option ``-C2`` may make it faster with negligible accuracy loss.  You
can accelerate with e.g. ``-uRY16`` or ``-k16`` as above.

Finding very short DNA alignments
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

By default, LAST only reports significant_ alignments that will rarely
occur by chance.  In the preceding example, the minimum alignment
length is about 26 bases for a human-size genome (less for smaller
genomes).  To find shorter alignments, add lastal_ option ``-D100``
(say), to get alignments that could occur by chance once per hundred
query letters (the default is once per million.)  This makes the
minimum alignment length about 20 bases for a human-size genome.

Aligning paired-end Illumina DNA reads to a genome
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

You could use last-pair-probs_.  It has a disadvantage: it doesn't
allow different parts of one read (i.e. one "end") to align to
different parts of the genome.  Alternatively, you could align the
reads individually, ignoring the pair relationships::

  fastq-interleave reads1.fq reads2.fq | lastal -P8 -p reads.train mydb | last-split > out.maf

``fastq-interleave`` ensures that each read has a unique name (by
appending "/1" and "/2" if necessary).

Aligning potentially-spliced Illumina reads to a genome
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

See last-split_ (and last-pair-probs_).

Aligning human & chimp genomes
------------------------------

This is very slow-and-sensitive::

  lastdb -P8 -uNEAR -R01 humdb human_no_alt_analysis_set.fa
  last-train -P8 --revsym -E0.05 -C2 humdb chimp.fa > humchi.train
  lastal -E0.05 -C2 -p humchi.train humdb chimp.fa | last-split > humchi1.maf

``--revsym`` makes the substitution rates the same on both strands.
For example, it makes A→G equal T→C (because A→G on one strand means
T→C on the other strand).  This is usually appropriate for
genome-genome comparison (but maybe not for mitochondria which have
asymmetric "heavy" and "light" strands).

``-E0.05`` means only get significant_ alignments that would be
expected to occur by chance at a rate ≤ 0.05 times per pair of random
sequences of length 1 billion each.

The result so far is asymmetric: each part of the chimp genome is
aligned to at most one part of the human genome, but not vice-versa.
We can get one-to-one alignments like this::

  maf-swap humchi1.maf | last-split > humchi2.maf

Then we can discard less-confident alignments, and convert_ to a
compact tabular format::

  last-postmask humchi2.maf | maf-convert -n tab | awk -F= '$2 <= 1e-5' > humchi.tab

last-postmask_ discards alignments caused by simple sequence.  The
``awk`` command gets alignments with `mismap probability`_ ≤ 10^-5.
Finally, we can make a dotplot_::

  last-dotplot humchi.tab humchi.png

To go faster, with minor accuracy loss: replace ``-uNEAR`` with
``-uRY32`` and/or `mask repeats`_.

To squeeze out the last 0.000...1% of accuracy: add ``-m50`` to the
lastal_ options.

Aligning human & mouse genomes
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

You can do this in the same way as human/chimp, except that ``-uNEAR``
should be omitted.  To increase sensitivity, but also time and memory
use, add lastdb seeding_ option ``-uMAM4`` or or ``-uMAM8``.  To
increase them even more, add lastal_ option ``-m100`` (or as high as
you can bear).

Large reference sequences
-------------------------

If the sequences that you give to lastdb exceed ~4 billion letters,
consider using 8-byte LAST (lastdb8_ and lastal8_).  Ordinary (4-byte)
LAST can't handle so much sequence at once, so lastdb_ splits it into
"volumes", which may be inefficient.  8-byte LAST avoids voluming, but
uses more memory.  So lastdb8_ works well with a memory-reducing
option: ``-uRY`` or ``-w`` or ``-W``.

Moar faster
-----------

* `Using multiple CPUs / cores <last-parallel.html>`_
* `Various speed & memory options <last-tuning.html>`_

Ambiguity of alignment columns
------------------------------

Consider this alignment::

  TGAAGTTAAAGGTATATGAATTCCAATTCTTAACCCCCCTATTAAACGAATATCTTG
  |||||||| ||||||  |  ||  | |  |    || ||||||   |||||||||||
  TGAAGTTAGAGGTAT--GGTTTTGAGTAGT----CCTCCTATTTTTCGAATATCTTG

The middle section has such weak similarity that its precise alignment
cannot be confidently inferred.  We can see the confidence of each
alignment column with lastal_ option ``-j4``::

  lastal -j4 -p hufu.train humdb fuguMito.fa > myalns.maf

The output looks like this::

  a score=17 EG2=9.3e+09 E=5e-06
  s seqX 0 57 + 57 TGAAGTTAAAGGTATATGAATTCCAATTCTTAACCCCCCTATTAAACGAATATCTTG
  s seqY 0 51 + 51 TGAAGTTAGAGGTAT--GGTTTTGAGTAGT----CCTCCTATTTTTCGAATATCTTG
  p                %*.14442011.(%##"%$$$$###""!!!""""&'(*,340.,,.~~~~~~~~~~~

The "p" line indicates the probability that each column is wrongly
aligned, using a compact code (the same as fastq_ format):

  ======  =================   ======  =================
  Symbol  Error probability   Symbol  Error probability
  ------  -----------------   ------  -----------------
  ``!``   0.79 -- 1           ``0``   0.025 -- 0.032
  ``"``   0.63 -- 0.79        ``1``   0.02  -- 0.025
  ``#``   0.5  -- 0.63        ``2``   0.016 -- 0.02
  ``$``   0.4  -- 0.5         ``3``   0.013 -- 0.016
  ``%``   0.32 -- 0.4         ``4``   0.01  -- 0.013
  ``&``   0.25 -- 0.32        ``5``   0.0079 -- 0.01
  ``'``   0.2  -- 0.25        ``6``   0.0063 -- 0.0079
  ``(``   0.16 -- 0.2         ``7``   0.005  -- 0.0063
  ``)``   0.13 -- 0.16        ``8``   0.004  -- 0.005
  ``*``   0.1  -- 0.13        ``9``   0.0032 -- 0.004
  ``+``   0.079 -- 0.1        ``:``   0.0025 -- 0.0032
  ``,``   0.063 -- 0.079      ``;``   0.002  -- 0.0025
  ``-``   0.05  -- 0.063      ``<``   0.0016 -- 0.002
  ``.``   0.04  -- 0.05       ``=``   0.0013 -- 0.0016
  ``/``   0.032 -- 0.04       ``>``   0.001  -- 0.0013
  ======  =================   ======  =================

Note that each alignment is grown from a "core" region, and the
ambiguity estimates assume that the core is correctly aligned.  The
core is indicated by "~" symbols, and it contains exact matches only.

.. _last: last.html
.. _lastdb8:
.. _lastdb: lastdb.html
.. _lastal8:
.. _lastal: lastal.html
.. _dotplot: last-dotplot.html
.. _last-pair-probs: last-pair-probs.html
.. _last-postmask: last-postmask.html
.. _mismap probability:
.. _last-split: last-split.html
.. _last-train: last-train.html
.. _convert:
.. _maf-convert: maf-convert.html
.. _scoring scheme: last-matrices.html
.. _seeding scheme:
.. _seeding: last-seeds.html
.. _last-evalues:
.. _significant:
.. _significance: last-evalues.html
.. _fastq: https://doi.org/10.1093/nar/gkp1137
.. _here:
.. _mask repeats: https://github.com/mcfrith/last-rna/blob/master/last-long-reads.md
.. _MAF: http://genome.ucsc.edu/FAQ/FAQformat.html#format5