File: last-parallel.txt

package info (click to toggle)
last-align 830-1
  • links: PTS, VCS
  • area: main
  • in suites: stretch
  • size: 3,240 kB
  • ctags: 3,201
  • sloc: cpp: 40,808; python: 1,910; ansic: 1,188; makefile: 385; sh: 232
file content (78 lines) | stat: -rw-r--r-- 2,554 bytes parent folder | download | duplicates (2)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
Running LAST in parallel
========================

You can make LAST faster by running it on multiple CPUs / cores.  The
easiest way is with lastal's -P option::

  lastal -P4 my-index queries.fasta > out.maf

This will use 4 parallel threads.  If you specify -P0, it will use as
many threads as your computer claims it can handle simultaneously.

This works by aligning different query sequences in different threads
- so if you only have one query you won't get any parallelization!

Dealing with very long query sequences
--------------------------------------

lastal aligns one "batch" of queries at a time, so if the batch has
only one query you won't get any parallelization.  This can be fixed
by increasing the batch size, with option -i::

  lastal -P4 -i3G my-index queries.fasta > out.maf

This specifies a batch size of 3 gibi-bytes.  The downside is that
more memory is needed to hold the batch and its alignments.

Dealing with pipelines
----------------------

If you have a multi-command "pipeline", such as::

  lastal -P4 my-index queries.fasta | last-split > out.maf

then the -P option may help, because lastal is often the slowest step,
but it would be nice to parallelize the whole thing.  Unfortunately,
last-split doesn't have a -P option, and even if it did, the pipe
between the commands would become a bottleneck.

You can use parallel-fasta and parallel-fastq (which accompany LAST,
but require `GNU parallel <http://www.gnu.org/software/parallel/>`_ to
be installed).  These commands read sequence data, split it into
blocks (with a whole number of sequences per block), and run the
blocks in parallel through any command or pipeline you specify, using
all your CPU cores.  Here are some examples.

Instead of this::

  lastal mydb queries.fa > myalns.maf

try this::

  parallel-fasta "lastal mydb" < queries.fa > myalns.maf

Instead of this::

  lastal -Q1 -D100 db q.fastq | last-split > out.maf

try this::

  parallel-fastq "lastal -Q1 -D100 db | last-split" < q.fastq > out.maf

Instead of this::

  zcat queries.fa.gz | lastal mydb > myalns.maf

try this::

  zcat queries.fa.gz | parallel-fasta "lastal mydb" > myalns.maf

Notes:

* parallel-fasta and parallel-fastq simply execute GNU parallel with a
  few options for fasta or fastq: you can specify other GNU parallel
  options to control the number of simultaneous jobs, use remote
  computers, get the output in the same order as the input, etc.

* parallel-fastq assumes that each fastq record is 4 lines, so there
  should be no line wrapping or blank lines.