File: Running.html

package info (click to toggle)
shasta 0.14.0-1
  • links: PTS, VCS
  • area: main
  • in suites: forky, sid
  • size: 29,636 kB
  • sloc: cpp: 82,262; python: 2,348; makefile: 222; sh: 143
file content (653 lines) | stat: -rw-r--r-- 21,580 bytes parent folder | download | duplicates (2)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
<!DOCTYPE html>
<html>

<head>
<link rel=stylesheet href=style.css />
<link rel=icon href=CZI-new-logo.png />
</head>

<body>
<main>
<div class="goto-index"><a href="index.html">Table of contents</a></div>
<h1>Running an assembly</h1>

<p>
Refer the <a href="QuickStart.html">quick start</a> guide to download an executable or
<a href="BuildingFromSource.html">build</a> one from the source code.

<h2 id=Configuration>Selecting assembly options</h2>
<p>
You can invoke the Shasta executable without options or with
<code>--help</code> to get a description of the options.
A list of available options can also be found 
<a href=CommandLineOptions.html>here</a>.

<p>
The only mandatory option is <code>--input</code> 
which must be used to specify the input FASTA or FASTQ files
containing reads to be used for the assembly.
If there is more than one file, the names
should be specified separated by white space,
entering <code>--input</code> only once, like this:

<pre>
--input a.fasta b.fasta c.fasta
</pre>

<p>
The Shasta assembly process is controlled by many
parameters whose values can be specified by 
using <a href=CommandLineOptions.html>command line options</a> 
or a <a href=Configurations.html#ConfigFile>configuration file</a>.
Configuration files for some common situations are
provided in <code>shasta/conf</code> 
or <code>shasta-install/conf</code>
(download a <code>tar</code> file from a Shasta release 
if you don't have access to these directories). 
Creating options for a new type of data or genome can require
some work, experimentation, and some knowledge of Shasta 
computational methods. 
If you are unable to get a satisfactory assembly for
your data, feel free to 
<a href="https://github.com/paoloshasta/shasta/issues">file an issue</a> 
in the Shasta GitHub repository and we will help.
You can also use an experimental script, 
<code>GenerateConfig.py</code>,
that will help you create a starting configuration file. 
This script is available in 
<code>shasta/scripts</code> 
or <code>shasta-install/bin</code>
(download a <code>tar</code> file from a Shasta release 
if you don't have access to these directories).

<p>
There is also another experimental script,
<code>GenerateFeedback.py</code>,
that can be used after an assembly is complete
to assess some (limited) aspects of assembly quality.
If you decide to file an issue on the Shasta GitHub repository,
including the output of this script would be helpful.



<h2 id=MemoryRequirements>Memory requirements</h2>

<p><b><i>
Note that in this section "performance" refers to assembly time only.
</i></b>

<p>
For best performance, the Shasta assembler uses a single large
machine rather than a cluster of smaller machines, 
and operates in memory,
with no access to data on disk except during
initial input of the reads, the final output of the assembly,
and for small output files containing summary assembly information.
As a result, for optimal performance, Shasta memory requirements
are higher than
comparable tools that keep some or most of their data on disk.

<p>
Memory requirements for optimal performance are roughly proportional
to genome size and coverage and 
are around 4 to 6 bytes per input base.
This only counts input bases that are used in the assembly -
that is, excluding reads that were discarded
because they were too short or for other reasons.
For a human-size genome (&#8776;3 Gb) at coverage 60x,
this works out to around 
1 TB of memory. 

<p>
Machines with 1 to 4 TB of memory have become available
at reasonable prices in the last few years, and they are widely available
on cloud computing platforms. 
For example, for human assemblies using Shasta, we have routinely
been using AWS <code>x1.32xlarge</code> instances with
1952 TB of memory and 128 virtual processors 
(64 cores with hyperthreading). 
These machines are available at around $13/hour as on demand
instances and they complete a typical human assembly at coverage
60x in about 4 hours, at a compute cost of around $50 per assembly,
much lower than current sequencing costs to produce the input reads. 
Much lower prices (2x to 3x lower) are also available for
reserved instances and on the AWS spot market,
which means that a production facility could achieve
a compute cost of around $20 per genome.

<h2 id=LowMemory>Running with less than optimal memory</h2>

<p><b><i>
Note that in this section "performance" refers to assembly time only.
The memory options discussed here don't affect assembly results in any way.
</i></b>

<p>
Shasta also supports a mode of operation
with data structures physically on disk
or other storage systems,
mapped to virtual memory.
In this mode of operation, the operating system
dynamically moves pages to disk from memory and back as needed,
which can result in a huge performance penalty.
For this reason, this mode of operation is not suggested,
except in the following conditions:

<ul>
<li>For very small genomes.
<li>If the amount of memory available is not much smaller
than the amount required for optimal performance.
<li>If your machine has 
<a href="https://en.wikipedia.org/wiki/Solid-state_drive">SSD</a>
or, better, 
<a href="https://en.wikipedia.org/wiki/NVM_Express">NVMe</a> storage.
<a href="https://en.wikipedia.org/wiki/Hard_disk_drive">HDD storage (hard disk)</a>
is not suitable because of its high latency for random access.
</ul>

<p>
Under these conditions, this mode of operation can be
selected using the following Shasta command line options:
<br>
<code>
--memoryMode filesystem --memoryBacking disk
</code>

<p>
See 
<a href=#MemoryModes>the next section</a>
for more information on these options.
In addition to specifying these options, make sure the assembly directory
(command line option <code>--assemblyDirectory</code>)
is on a filesystem backed by the storage type 
(SSD or NVMe) you want to use. 
If you are using a local or departmental machine, likely that a filesystem is already mounted on that storage.
If you are using an AWS instance, you can use the directions
<a href='https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ebs-using-volumes.html'>here</a>
and 
<a href='https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/nvme-ebs-volumes.html'>here</a>
to make your SSD or NVMe storage accessible.

<p>
To illustrate the effect of using less than optimal memory,
the table below summarizes benchmark results for a
human assembly at coverage 45x. 
This assembly requires about 650 GB 
to run in memory - see the first row in the table.
The remaining rows show alternative scenarios
on machines with less memory. 
These data show that Shasta still runs in reasonable
amounts of time with less than optimal memory,
as long as NVMe or SSD storage is available.
Note the competitive assembly time and AWS cost of the ARM Graviton2 
assembly.

Some notes on the table:
<ul>
<li>The benchmarks were run on AWS EC2 - details are provided
in the right section of the table.
<li>All assemblies ran with the default number of threads,
equal to the number of virtual processors.
<li>Costs are shown as on demand AWS prices in the AWS us-west-2
region (Oregon).
<li>For Intel
machines, there is one core for each two virtual processors,
while for ARM processors
there is one core for every virtual processor.
<li>A Shasta executable for ARM is currently available starting with Shasta Release 0.7.0.
</ul>

<table>

<tr>
<th rowspan=2>Architecture
<th rowspan=2>Virtual<br>processors
<th rowspan=2>Memory<br>
(<a href="https://en.wikipedia.org/wiki/Gibibyte">GiB</a>)
<th rowspan=2>Data location
<th rowspan=2>Elapsed<br>Time<br>(h)
<th colspan=4>AWS benchmark information

<tr>
<th>Instance<br>type
<th>Data location
<th>Hourly<br>cost ($)
<th>Assembly<br>cost ($)

<tr>
<td>Intel Skylake-SP
<td class=centered>96
<td class=centered>768
<td>Memory (2 MB pages)
<td class=centered>2.5
<td>r5.24xlarge
<td>Memory (2 MB pages)
<td class=centered>6.05
<td class=centered>15.12

<tr>
<td>Intel Skylake-SP
<td class=centered>96
<td class=centered>384
<td>NVMe
<td class=centered>4.5
<td>m5d.24xlarge
<td>Local NVMe volume
<td class=centered>5.42
<td class=centered>24.51

<tr>
<td>Intel Skylake-SP
<td class=centered>96
<td class=centered>384
<td>SSD
<td class=centered>8.0
<td>m5.24xlarge
<td>EBS gp2 volume
<td class=centered>4.61
<td class=centered>36.82

<tr>
<td>Intel Skylake-SP
<td class=centered>96
<td class=centered>384
<td>Disk
<td class=centered>Too slow
<td>m5.24xlarge
<td>EBS st1 volume
<td class=centered>
<td class=centered>

<tr>
<td>Intel Skylake-SP
<td class=centered>48
<td class=centered>192
<td>NVMe
<td class=centered>6.6
<td>m5d.12xlarge
<td>Local NVMe volume
<td class=centered>2.71
<td class=centered>18.00

<tr>
<td>ARM Graviton2
<td class=centered>48
<td class=centered>384
<td>NVMe
<td class=centered>4.1
<td>r6gd.12xlarge
<td>Local NVMe volume
<td class=centered>2.76
<td class=centered>11.39

</table>



<h2 id=MemoryModes>Memory modes</h2>

<p><b><i>
Note that in this section "performance" refers to assembly time only.
The memory options described here don't affect assembly results in any way.
</i></b>


<p>
For performance, the Shasta executable operates in memory,
with no access to data on disk except during
initial input of the reads, final output of the assembly,
and for small output files containing summary assembly information.
All large memory areas are allocated using <code>mmap</code> 
calls in one of several different modes of operation
described below. 
The choice of the optimal mode of operation is dependent
on many factors and decribed below.
The default mode of operation works reasonably well in most cases
and does not require root privilege. 
However, it does not deliver the best possible performance.

<p>
The memory modes are controlled by two command line options:

<ul>
<li><code>--memoryMode</code> controls whether <code>mmap</code>
allocates anonymous memory or memory mapped to a filesystem.
It can take one of the following values:
<ul>
<li><code>anonymous</code> (the default value)
<li><code>filesystem</code>
</ul> 

<li><code>--memoryBacking</code> specifies the physical backing
of pages allocated via <code>mmap</code> and can take
one of the following values:
<ul>
<li><code>disk</code>: <code>mmap</code> uses standard 4 KB pages
mapped to the existing filesystem on disk in the current directory.
<li><code>4K</code> (the default value): <code>mmap</code> uses standard 4 KB pages
(anonymous or on a <code>tmpfs</code> filesystem, depending
on the setting of <code>--memoryBacking</code>).
<li><code>2M</code>: <code>mmap</code> uses large 2 MB pages
(anonymous or on a <code>hugetlbfs</code> filesystem, depending
on the setting of <code>--memoryBacking</code>).
The 2MB pages are often referred to as "huge pages".
</ul>
</ul>


<p>
There are a total six possible combinations of these two options, summarized
in the table below.



<table>

<tr>
<td colspan=2 rowspan=2>
<td colspan=2 class=centered><code>--memoryMode</code>

<tr>
<td class=centered><code>anonymous</code><br>(default)
<td class=centered><code>filesystem</code>

<tr>
<td rowspan=3><code>--memoryBacking</code>
<td class=centered><code>disk</code>
<td class="centered error">Not allowed
<td class="success">
Memory allocated by <code>mmap</code> uses 4 KB pages
on a the filesystem on disk that
the run output directory is in.
<b>This mode of operation can incur severe performance degradation
and is therefore not generally suggested</b>.
After the run terminates, binary data are permanently available 
on disk
and you can use the http server or
the Python API to inspect assembly results.

<tr>
<td class=centered><code>4K</code><br>(default)
<td class="success">
The default option.
Memory allocated by <code>mmap</code> uses anonymous 4 KB pages.
After the run terminates, binary data are destroyed,
which means you cannot use the http server or
the Python API to inspect assembly results.
<b>Performance is less than optimal</b> (typically 30% degradation 
on a large run).
<td class="warning">
Memory allocated by <code>mmap</code> uses 4 KB pages
on a <code>tmpfs</code> filesystem which is created
and mounted on the <code>Data</code> directory
under the run output directory.
After the run terminates, binary data are available 
(until the next reboot)
and you can use the http server or
the Python API to inspect assembly results.
<b>Performance is less than optimal.</b>
When done using the binary data, you can free
the  memory using the following command:
<code>shasta --command cleanupBinaryData</code>.

<tr>
<td class=centered><code>2M</code>
<td class="warning">
Memory allocated by <code>mmap</code> uses anonymous 2 MB pages.
After the run terminates, binary data are destroyed,
which means you cannot use the http server or
the Python API to inspect assembly results.
<b>Performance is less than optimal</b> (typically 30% degradation 
on a large run).
<td class="warning">
Memory allocated by <code>mmap</code> uses 2 MB pages
on a <code>hugetlbfs</code> filesystem which is created
and mounted on the <code>Data</code> directory
under the run output directory.
After the run terminates, binary data are available 
(until the next reboot)
and you can use the http server or
the Python API to inspect assembly results.
<b>This mode of operation delivers optimal performance.</b>
When done using the binary data, you can free
the  memory using the following command:
<code>shasta --command cleanupBinaryData</code>.

</table>

<div class="table-legend">
The table is color coded with the following meaning:

<div>
        <div class="color-box error"></div>
This combination is not allowed.
</div>
<div>
    <div class="color-box warning"></div>
This combination requires root privilege to be acquired
via <code>sudo</code>.
Depending on <code>sudo</code> settings, this
may fail or ask for a user password.
</div>
<div>
        <div class="color-box success"></div>
This combination is allowed and does not require
root privilege.
</div>
</div>


<p>
In summary:
<ul>
<li>
<b>For large assemblies</b> it is best to 
make sure to have root privilege and use
<code>--memoryMode filesystem --memoryBacking 2M</code>.
Remember to use <code>shasta --command cleanupBinaryData</code>
to free up the memory when done using the binary data!

<li>
<b>For small assemblies for which performance is not important</b> 
use the default mode
<code>--memoryMode anonymous --memoryBacking 4K</code>.
However, if access to binary data is required after the assembly completes
to inspect assembly results using the http server or the Python API, use
<code>--memoryMode filesystem --memoryBacking disk</code>.
</ul>


<p>
See 
<a href=ChooseMemoryOptions.html>here</a> for interactive help to
select these options.


<h2 id=ScriptedApproaches>Scripted approaches to running an assembly</h2>
<p>
The Shasta assembler provides a
<a href=Python.html>Python API</a> that can be used for scripting.
This makes it possible to write Python scripts to run assemblies
that have more flexibility or functionality than allowed by
the Shasta executable. 
For example, these scripts allow rerunning only a portion 
of an assembly, which can be useful for development
of new assembly functionality.

<p>
These scripts use a Python <code>import</code> to import the Shasta
shared library, <code>shasta.so</code>, which provides
Python bindings to Shasta functionality. 
The scripts and the library are in <code>shasta-install/bin</code>,
but they will normally not be needed during basic operation of 
the Shasta assembler. They are more likely to be needed
for debugging, testing, or development.

<p>
The Python API also allows you to write your own assembly scripts.
For such a script to work, and under the assumption that the
script is not located at <code>shasta-install/bin</code>,
it will be necessary to set environment variable
<code>PYTHONPATH</code> to <code>(actual path)/shasta-install/bin</code>,
so the Python interpreter can locate the Shasta shared library 
during <code>import</code>.



<h2 id=InputFiles>Input files</h2>
<p>
The Shasta
assembler uses as input one or more 
files containing reads in one of the supported formats listed below.
Shasta uses the file extension to deduce the file format, as described below.

<ul>
<li>
<a href="https://en.wikipedia.org/wiki/FASTA">FASTA</a>:
input files in this format must be named with a file extension
<code>.fasta</code>,
<code>.fa</code>,
<code>.FASTA</code>, or
<code>.FA</code>. 
Older versions of Shasta had strict restrictions on Fasta files,
but these restrictions were eliminated. 
In particular, multiline reads are now supported.
Reads containing no-called or invalid bases are also allowed
but are discarded on input. The file can have either Unix-style (LF) or
Windows-style (CR+LF) line ends.

<li>
<a href="https://en.wikipedia.org/wiki/FASTQ">FASTQ</a>:
input files in this format must be named with a file extension
<code>.fastq</code>,
<code>.fq</code>,
<code>.FASTQ</code>, or
<code>.FQ</code>. 
Shasta requires the stricter FASTQ format with exactly four lines per read.
The third line must consist of just the "+" sign.
The file must have Unix-style (LF) line ends.
Windows-style (CR+LF) line ends are not supported.

</ul>

<p>
Input from compressed files is not supported.
if your input files are compressed, you have to decompress them
first using the appropriate decompression utilities.

<p>
Any reads shorter
than <code>Reads.minReadLength</code> bases (default 10000) 
present in the input files are discarded on input.
Reads that contain bases with repeat counts
greater than 255 are also discarded. 
This is a consequence of the fact that repeat counts
are stored using one byte, and therefore there would
be no way to store such reads. Reads with such
long repeat counts are extremely rare, however,
and when they occur they are of suspicious quality.



<h2 id=OutputFiles>Output files</h2>
<p>
The Shasta executable creates output in a directory with a name
specified by the <code>--assemblyDirectory</code> option (default <code>ShastaRun</code>).
The directory is created automatically at the beginning of the run. 
The run stops if the directory already exists.
This reduces the possibility of unwanted deletion of data.

<p>
Contents of the output directory after a successful run
include the following:

<ul>

<li>
<code>Assembly.fasta</code>: 
The assembly results in FASTA format.
The id of each assembled segment is the same as the
edge id of the corresponding edge in the assembly graph.

<li><code>Assembly.gfa</code>: 
The assembly results in 
<a href=https://github.com/GFA-spec/GFA-spec/blob/master/GFA1.md>GFA 1.0</a> format.
This contains the same sequences in the FASTA file (as GFA segments),
plus their connectivity information (as GFA links). 
A convenient tool to inspect and study these files is
<a href='https://rrwick.github.io/Bandage'>Bandage</a>.
Segment ids in the GFA file correspond to FASTA ids 
in the FASTA file and also to assembly graph edge ids.

<li><code>Assembly-BothStrands.gfa</code>:
An alternative GFA output file for the assembly.
This contains both strands of the assembly.
This can be useful in some cases to clarify connectivity
of assembled segments. 
See <code>AssemblySummary.csv</code> to find the 
id of the reverse complement of each assembled segment.

<li>
<code>AssemblySummary.html</code>: 
An html file summarizing many assembly metrics.
It can be viewed using an Internet browser.
It has the same content shown by the summary page
of the Shasta http server (<code>--command explore</code>).

<li><code>shasta.conf</code>: 
a configuration file containing
the values of all assembly parameters used. 
This file uses a format that can also be used as input for 
a subsequent Shasta run using option <code>--config</code>.

<li><code>ReadLengthHistogram.csv</code>:
A spreadsheet file containing statistics of the read length distribution.
This only includes reads that were used by the assembler.
The assembler discards reads shorter than 
<code>Reads.minReadLength</code> bases (default 10000)
and reads that contain bases repeated more than 255 times.
The fifth field of the last line of this file
contains the total number of input bases
used by the assembler in this run.

<li><code>Binned-ReadLengthHistogram.csv</code>:
Similar to <code>ReadLengthHistogram.csv</code>,
but using 1 Kb bins of read lengths.

<li><code>Data</code>:
A directory containing binary data
that can later be used by the Shasta http server
or with the Shasta Python API.
This is only created if option
<code>--memoryMode filesystem</code> 
was used for the run.
Keep in mind that, unless you used
option <code>--memoryBacking disk</code>,
these data are in memory, not on disk,
and will disappear at next reboot.
If you want to save them permanently, you can use
script <code>shasta-install/bin/SaveRun.py</code>
to create a copy on disk of the binary data directory named
<code>DataOnDisk</code>. 
(You can also make the copy yourself using the <code>cp</code> command).
To free up the memory used without rebooting, you can invoke the 
Shasta executable again using the following options:

<pre>
--command cleanupBinaryData --assemblyDirectory outputDirectoryName
</pre>
Here, <code>outputDirectoryName</code>
should specify the same value used in the assembly run
(that is, the name of the directory containing the <code>Data</code>
directory).
Equivalently, you can just <code>umount</code> the <code>Data</code>
directory, then remove it. 

</ul>


<div class="goto-index"><a href="index.html">Table of contents</a></div>
</main>
</body>
</html>