1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311
|
<!DOCTYPE html>
<html>
<head>
<link rel=stylesheet href=style.css />
<link rel=icon href=CZI-new-logo.png />
</head>
<body>
<main>
<div class="goto-index"><a href="index.html">Table of contents</a></div>
<h1>Assembly configurations</h1>
<p>
Shasta provides a number of
<a href="CommandLineOptions.html">command line options</a>
that can be used to set computational parameters and thresholds
for assemblies.
All of these options have default values,
but the default values are not necessarily optimal for
any particular combination of a number of factors:
<ul>
<li>The technology used to generate the reads.
Technologies currently available to generate the long reads
supported by Shasta are
<a href="https://nanoporetech.com/">Oxford Nanopore</a> (ONT) and
<a href="https://www.pacb.com">Pacific BioSciences</a> (HiFi and others).
<li>The amount of coverage available (average number of reads
overlapping each genome region).
<li>The characteristics of the genome being sequenced, including
heterozygosity, ploidy, and repeats content.
</ul>
To adjust to these and other factors,
options adjustments are generally necessary to achieve
good quality assemblies.
To facilitate the process of generating useful assembly options
for a particular situation, Shasta uses <b>assembly configurations</b>.
An assembly configuration is a predefined set of assembly options
that can be stored in a <b>configuration file</b>
in a format defined
<a href="#ConfigFile">below</a>.
A number of sample configuration files applicable to specific situations
are provided in <code>shasta/conf</code>.
The applicability of each of the files is described in comments
embedded in each file.
<p>
Shasta command line option <code>--config</code>
is used to specify the configuration to be used, as described
below in details. This option is <b>mandatory</b>
when running an assembly.
If any option is specified both in a configuration
and explictly on the command line, the value
on the command line takes precedence.
This allows you to use a configuration as a useful
set of defaults, while still overriding some of its
options as desired.
<p>
In addition to configuration files, Shasta also provides
a set of built-in configurations that are compiled
in the Shasta executable. These built-in configurations
can be used without the need for a configuration file.
Each built-in configuration has a corresponding configuration
file with the same name in <code>shasta/conf</code>, with
an extension <code>.conf</code>.
For example, configuration <code>Nanopore-Oct2021</code>
can be specified in one of two ways:
<pre>
shasta --config Nanopore-May2022
</pre>
or
<pre>
shasta --config .../shasta/conf/Nanopore-May2022.conf
</pre>
When using the second form, the file must be available,
and the <code>...</code> should be replaced depending on the
location of the <code>shasta</code> directory.
<p>
To obtain a list of available built-in configurations,
use Shasta command <code>listConfigurations</code> as follows:
<br>
<pre>
shasta --command listConfigurations
</pre>
At the time of writing (May 2024), this outputs the following
list of built-in configurations:
<pre>
Nanopore-Dec2019
Nanopore-UL-Dec2019
Nanopore-Sep2020
Nanopore-UL-Sep2020
Nanopore-UL-iterative-Sep2020
Nanopore-OldGuppy-Sep2020
Nanopore-Plants-Apr2021
Nanopore-Oct2021
Nanopore-UL-Oct2021
HiFi-Oct2021
Nanopore-UL-Jan2022
Nanopore-Phased-Jan2022
Nanopore-UL-Phased-Jan2022
Nanopore-May2022
Nanopore-Phased-May2022
Nanopore-UL-May2022
Nanopore-UL-Phased-May2022
Nanopore-Human-SingleFlowcell-May2022
Nanopore-Human-SingleFlowcell-Phased-May2022
Nanopore-UL-Phased-Nov2022
Nanopore-R10-Fast-Nov2022
Nanopore-R10-Slow-Nov2022
Nanopore-Phased-R10-Fast-Nov2022
Nanopore-Phased-R10-Slow-Nov2022
Nanopore-ncm23-May2024
Nanopore-r10.4.1_e8.2-400bps_sup-Herro-Sep2024
Nanopore-r10.4.1_e8.2-400bps_sup-Raw-Sep2024
</pre>
<p>
The following table summarizes configurations
recommended at the time of writing (November 2022, Shasta 0.11.0)
under the following conditions:
<ul>
<li>Human assemblies
<li>Oxford Nanopore reads.
<li>Guppy 5 or 6 basecaller with "super" accuracy.
</ul>
<p>
<table>
<tr><th>ONT chemistry<th>Read length<th>Coverage<th>Haploid assembly<th>Phased assembly
<tr><th>R9<th>Standard<th>40x to 80x
<td class=centered><code>Nanopore-May2022</code>
<td class=centered><code>Nanopore-Phased-May2022</code>
<tr><th>R9<th class=centered>Ultra-Long (UL)
<th>40x to 80x
<td class=centered><code>Nanopore-UL-May2022</code>
<td class=centered><code>Nanopore-UL-Phased-Nov2022</code>
<tr><th>R9<th>Standard<th>Human genome with a single flowcell
(about 30x)
<td class=centered><code>Nanopore-Human-SingleFlowcell-May2022</code>
<td class=centered><code>Nanopore-Human-SingleFlowcell-Phased-May2022</code>
<tr><th>R10, fast mode<th>Standard<th>Human genome with a single flowcell
(about 30x)
<td class=centered><code>Nanopore-R10-Fast-Nov2022</code>
<td class=centered><code>Nanopore-Phased-R10-Fast-Nov2022</code>
<tr><th>R10, slow mode<br>(no longer in use)<th>Standard<th>Human genome with two flowcells
(about 45x)
<td class=centered><code>Nanopore-R10-Slow-Nov2022</code>
<td class=centered><code>Nanopore-Phased-R10-Slow-Nov2022</code>
<tr>
<th><a href='https://labs.epi2me.io/gm24385_ncm23_preview/'>
ONT December 2023 Data release</a><br>
(<i>"Experimental extremely high-accuracy, ultra-long
sequencing kit"</i>)
<th>Ultra-Long (UL)
<th>Tested at 40x to 60x but may be functional outside this range
<td>
<td class=centered><code>Nanopore-ncm23-May2024</code>
<tr>
<th>
r10.4.1_e8.2-400bps_sup, error corrected with Herro
<th>
Ultra-Long (UL)
<th>
Tested at 45x but may be functional at higher or lower coverage
<td><td>
<code>Nanopore-r10.4.1_e8.2-400bps_sup-Herro-Sep2024</code>
<tr>
<th>
r10.4.1_e8.2-400bps_sup, without error correction
<th>
Ultra-Long (UL)
<th>
Tested at 45x but may be functional at higher or lower coverage
<td><td>
<code>Nanopore-r10.4.1_e8.2-400bps_sup-Raw-Sep2024</code>
</table>
<p>
To get details of a specific built-in configuration
use Shasta command <code>listConfiguration</code> as follows,
specifiying the built-in configuration of interest after <code>--config</code>:
<pre>
shasta --command listConfiguration --config Nanopore-May2022
</pre>
<p>
This output includes comments that describe the
applicability of the selected configuration.
Details of the configuration are written out in the configuration
file format defined below. This allows you to
create your own configuration file using a built-in configuration
as a starting point.
<p>
Shasta command line option <code>--config</code> must be used
to specified the desired configuration to be used for an assembly.
The option must specify either a build-in configuration
or a path to a configuration file.
<h2 id=ConfigFile>Configuration file</h2>
<p>
Some options are only allowed on the command line,
but most of them can also optionally be specified using a configuration file.
Values specified on the command line take precedence over
values specified in the configuration file.
This makes it easy to override specific values in a
configuration file.
<p>
Options that can be specified both on the command line
and in a configuration file are of the form
<code>--SectionName.optionName</code>. The format of the configuration file
is as follows:
<pre id=ConfigFile>
[SectionA]
option1 = valueA1
option2 = valueA2
[SectionB]
option1 = valueB1
option2 = valueB2
</pre>
The above is equivalent to using the following command line options:
<pre>
--SectionA.option1 valueA1
--SectionA.option2 valueA2
--SectionB.option1 valueB1
--SectionB.option2 valueB2
</pre>
<p>
For example, the value for option <code>MarkerGraph.minCoverage</code>
can be specified in the <code>[MarkerGraph]</code>
section of the configuration file as follows:
<pre>
[MarkerGraph]
minCoverage = 0
</pre>
<p>
In the configuration file, blank lines and lines begining with <code>#</code>
are ignored and can be used to add coments and to improve readability
of the configuration file.
<h2 id=BooleanSwitches>Boolean switches</h2>
<p>
Some command line options are boolean switches,
that is, control options that can be turned on or off
rather then be given a value.
<p>
To turn on one of these switches on the command line,
just add it to the command line without any value, for
example <code>--Assembly.storeCoverageData</code>.
To turn it off, just omit it from the command line
(the default value is turned off).
<p>
To turn on one of these switches in a
configuration file, you can either enter it without value
<pre>
storeCoverageData =
</pre>
or assign to it one of the following values:
<code>1, true, True, yes, Yes</code>.
To turn off one of these switches in a
configuration file, assign to it one of the following values:
<code>0, false, False,no, No</code>.
<p>
Boolean switches are indicated as such in the Description column in he tables below.
<div class="goto-index"><a href="index.html">Table of contents</a></div>
</main>
</body>
</html>
|