1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550
|
.. _celera-assembler: `Celera Assembler <http://wgs-assembler.sourceforge.net>`
.. _tutorial:
Canu Tutorial
=============
Canu assembles reads from PacBio RS II or Oxford Nanopore MinION instruments into
uniquely-assemblable contigs, unitigs. Canu owes lots of it design and code to
`celera-assembler <Celera Assembler>`_.
Canu can be run using hardware of nearly any shape or size, anywhere from laptops to computational
grids with thousands of nodes. Obviously, larger assemblies will take a long time to compute on
laptops, and smaller assemblies can't take advantage of hundreds of nodes, so what is being
assembled plays some part in determining what hardware can be effectively used.
Most algorithms in canu have been multi-threaded (to use all the cores on a single node),
parallelized (to use all the nodes in a grid), or both (all the cores on all the nodes).
.. _canu-command:
Canu, the command
~~~~~~~~~~~~~~~~~~~~~~
The **canu** command is the 'executive' program that runs all modules of the assembler. It oversees
each of the three top-level tasks (correction, trimming, unitig construction), each of which
consists of many steps. Canu ensures that input files for each step exist, that each step
successfully finished, and that the output for each step exists. It does minor bits of processing,
such as reformatting files, but generally just executes other programs.
::
canu [-correct | -trim | -assemble | -trim-assemble] \
[-s <assembly-specifications-file>] \
-p <assembly-prefix> \
-d <assembly-directory> \
genomeSize=<number>[g|m|k] \
[other-options] \
[-pacbio-raw | -pacbio-corrected | -nanopore-raw | -nanopore-corrected] *fastq
The -p option, to set the file name prefix of intermediate and output files, is mandatory. If -d is
not supplied, canu will run in the current directory, otherwise, Canu will create the
`assembly-directory` and run in that directory. It is _not_ possible to run two different
assemblies in the same directory.
The -s option will import a list of parameters from the supplied specification ('spec') file. These
parameters will be applied before any from the command line are used, providing a method for
setting commonly used parameters, but overriding them for specific assemblies.
By default, all three top-level tasks are performed. It is possible to run exactly one task by
using the -correct, -trim or -assemble options. These options can be useful if you want to correct
reads once and try many different assemblies. We do exactly that in the :ref:`quickstart`.
Additionally, suppling pre-corrected reads with -pacbio-corrected or -nanopore-corrected
will run only the trimming (-trim) and assembling (-assemble) stages.
Parameters are key=value pairs that configure the assembler. They set run time parameters (e.g.,
memory, threads, grid), algorithmic parameters (e.g., error rates, trimming aggressiveness), and
enable or disable entire processing steps (e.g., don't correct errors, don't search for subreads).
They are described later. One parameter is required: the genomeSize (in bases, with common
SI prefixes allowed, for example, 4.7m or 2.8g; see :ref:`genomeSize`). Parameters are listed in
the :ref:`parameter-reference`, but the common ones are described in this document.
Reads are supplied to canu by options that options that describe how the reads were generated, and
what level of quality they are, for example, -pacbio-raw indicates the reads were generated on a
PacBio RS II instrument, and have had no processing done to them. Each file of reads supplied this
way becomes a 'library' of reads. The reads should have been (physically) generated all at the same
time using the same steps, but perhaps sequenced in multiple batches. In canu, each library has a
set of options setting various algorithmic parameters, for example, how aggressively to trim. To
explicitly set library parameters, a text 'gkp' file describing the library and the input files must
be created. Don't worry too much about this yet, it's an advanced feature, fully described in
Section :ref:`gkp-files`.
The read-files contain sequence data in either FASTA or FASTQ format (or both! A quirk of the
implementation allows files that contain both FASTA and FASTQ format reads). The files can be
uncompressed, gzip, bzip2 or xz compressed. We've found that "gzip -1" provides good compression
that is fast to both compress and decompress. For 'archival' purposes, we use "xz -9".
.. _canu-pipeline:
Canu, the pipeline
~~~~~~~~~~~~~~~~~~~~~~
The canu pipeline, that is, what it actually computes, comprises of computing overlaps and
processing the overlaps to some result. Each of the three tasks (read correction, read trimming and
unitig construction) follow the same pattern:
* Load reads into the read database, gkpStore.
* Compute k-mer counts in preparation for the overlap computation.
* Compute overlaps.
* Load overlaps into the overlap database, ovlStore.
* Do something interesting with the reads and overlaps.
* The read correction task will replace the original noisy read sequences with consensus sequences
computed from overlapping reads.
* The read trimming task will use overlapping reads to decide what regions of each read are
high-quality sequence, and what regions should be trimmed. After trimming, the single largest
high-quality chunk of sequence is retained.
* The unitig construction task finds sets of overlaps that are consistent, and uses those to place
reads into a multialignment layout. The layout is then used to generate a consensus sequence
for the unitig.
.. _module-tags:
Module Tags
~~~~~~~~~~~~~~~~~~~~~~
Because each of the three tasks share common algorithms (all compute overlaps, two compute
consensus sequences, etc), parameters are differentiated by a short prefix 'tag' string. This lets
canu have one generic parameter that can be set to different values for each stage in each task.
For example, "corOvlMemory" will set memory usage for overlaps being generated for read correction;
"obtOvlMemory" for overlaps generated for Overlap Based Trimming; "utgOvlMemory" for overlaps
generated for unitig construction.
The tags are:
+--------+-------------------------------------------------------------------+
|Tag | Usage |
+========+===================================================================+
|master | the canu script itself, and any components that it runs directly |
+--------+-------------------------------------------------------------------+
+--------+-------------------------------------------------------------------+
|cns | unitig consensus generation |
+--------+-------------------------------------------------------------------+
|cor | read correction generation |
+--------+-------------------------------------------------------------------+
+--------+-------------------------------------------------------------------+
|red | read error detection |
+--------+-------------------------------------------------------------------+
|oea | overlap error adjustment |
+--------+-------------------------------------------------------------------+
+--------+-------------------------------------------------------------------+
|ovl | the standard overlapper |
+--------+-------------------------------------------------------------------+
|corovl | the standard overlapper, as used in the correction phase |
+--------+-------------------------------------------------------------------+
|obtovl | the standard overlapper, as used in the trimming phase |
+--------+-------------------------------------------------------------------+
|utgovl | the standard overlapper, as used in the assembly phase |
+--------+-------------------------------------------------------------------+
+--------+-------------------------------------------------------------------+
|mhap | the mhap overlapper |
+--------+-------------------------------------------------------------------+
|cormhap | the mhap overlapper, as used in the correction phase |
+--------+-------------------------------------------------------------------+
|obtmhap | the mhap overlapper, as used in the trimming phase |
+--------+-------------------------------------------------------------------+
|utgmhap | the mhap overlapper, as used in the assembly phase |
+--------+-------------------------------------------------------------------+
+--------+-------------------------------------------------------------------+
|mmap | the `minimap <https://github.com/lh3/minimap>`_ overlapper |
+--------+-------------------------------------------------------------------+
|cormmap | the minimap overlapper, as used in the correction phase |
+--------+-------------------------------------------------------------------+
|obtmmap | the minimap overlapper, as used in the trimming phase |
+--------+-------------------------------------------------------------------+
|utgmmap | the minimap overlapper, as used in the assembly phase |
+--------+-------------------------------------------------------------------+
+--------+-------------------------------------------------------------------+
|ovb | the bucketizing phase of overlap store building |
+--------+-------------------------------------------------------------------+
|ovs | the sort phase of overlap store building |
+--------+-------------------------------------------------------------------+
We'll get to the details eventually.
.. _execution:
Execution Configuration
~~~~~~~~~~~~~~~~~~~~~~~~
There are two modes that canu runs in: locally, using just one machine, or grid-enabled, using
multiple hosts managed by a grid engine. LSF, PBS/Torque, PBSPro, Sun Grid Engine (and
derivations), and Slurm are supported, though LSF has had limited testing. Section
:ref:`grid-engine-config` has a few hints on how to set up a new grid engine.
By default, if a grid is detected the canu pipeline will immediately submit itself to the grid and
run entirely under grid control. If no grid is detected, or if option ``useGrid=false`` is set,
canu will run on the local machine.
In both cases, Canu will auto-detect available resources and configure job sizes based on the
resources and genome size you're assembling. Thus, most users should be able to run the command
without modifying the defaults. Some advanced options are outlined below. Each stage has the same
five configuration options, and tags are used to specialize the option to a specific stage. The
options are:
useGrid<tag>=boolean
Run this stage on the grid, usually in parallel.
gridOptions<tag>=string
Supply this string to the grid submit command.
<tag>Memory=integer
Use this many gigabytes of memory, per process.
<tag>Threads
Use this many compute threads per process.
<tag>Concurrency
If not on the grid, run this many jobs at the same time.
Global grid options, applied to every job submitted to the grid, can be set with 'gridOptions'.
This can be used to add accounting information or access credentials.
A name can be associated with this compute using 'gridOptionsJobName'. Canu will work just fine
with no name set, but if multiple canu assemblies are running at the same time, they will tend to
wait for each others jobs to finish. For example, if two assemblies are running, at some point both
will have overlap jobs running. Each assembly will be waiting for all jobs named 'ovl_asm' to
finish. Had the assemblies specified job names, gridOptionsJobName=apple and
gridOptionsJobName=orange, then one would be waiting for jobs named 'ovl_asm_apple', and the other
would be waiting for jobs named 'ovl_asm_orange'.
.. _error-rates:
Error Rates
~~~~~~~~~~~~~~~~~~~~~~
Canu expects all error rates to be reported as fraction error, not as percent error. We're not sure
exactly why this is so. Previously, it used a mix of fraction error and percent error (or both!),
and was a little confusing. Here's a handy table you can print out that converts between fraction
error and percent error. Not all values are shown (it'd be quite a large table) but we have every
confidence you can figure out the missing values:
============== =============
Fraction Error Percent Error
============== =============
0.01 1%
0.02 2%
0.03 3%
. .
. .
0.12 12%
. .
. .
============== =============
Canu error rates always refer to the percent difference in an alignment of two reads, not the
percent error in a single read, and not the amount of variation in your reads. These error rates
are used in two different ways: they are used to limit what overlaps are generated, e.g., don't
compute overlaps that have more than 5% difference; and they are used to tell algorithms what
overlaps to use, e.g., even though overlaps were computed to 5% difference, don't trust any above 3%
difference.
There are seven error rates. Three error rates control overlap creation (:ref:`corOvlErrorRate
<ovlErrorRate>`, :ref:`obtOvlErrorRate <ovlErrorRate>` and :ref:`utgOvlErrorRate <ovlErrorRate>`),
and four error rates control algorithms (:ref:`corErrorRate <corErrorRate>`, :ref:`obtErrorRate
<obtErrorRate>`, :ref:`utgErrorRate <utgErrorRate>`, :ref:`cnsErrorRate <cnsErrorRate>`).
The three error rates for overlap creation apply to the `ovl` overlap algorithm and the
:ref:`mhapReAlign <mhapReAlign>` option used to generate alignments from `mhap` or `minimap`
overlaps. Since `mhap` is used for generating correction overlaps, the :ref:`corOvlErrorRate
<ovlErrorRate>` parameter is not used by default. Overlaps for trimming and assembling use the
`ovl` algorithm, therefore, :ref:`obtOvlErrorRate <ovlErrorRate>` and :ref:`utgOvlErrorRate
<ovlErrorRate>` are used.
The four algoriothm error rates are used to select which overlaps can be used for correcting reads
(:ref:`corErrorRate <corErrorRate>`); which overlaps can be used for trimming reads
(:ref:`obtErrorRate <obtErrorRate>`); which overlaps can be used for assembling reads
(:ref:`utgErrorRate <utgErrorRate>`). The last error rate, :ref:`cnsErrorRate <cnsErrorRate>`,
tells the consensus algorithm to not trust read alignments above that value.
For convenience, two meta options set the error rates used with uncorrected reads
(:ref:`rawErrorRate <rawErrorRate>`) or used with corrected reads. (:ref:`correctedErrorRate
<correctedErrorRate>`). The default depends on the type of read being assembled.
================== ====== ========
Parameter PacBio Nanopore
================== ====== ========
rawErrorRate 0.300 0.500
correctedErrorRate 0.045 0.144
================== ====== ========
In practice, only :ref:`correctedErrorRate <correctedErrorRate>` is usually changed. The :ref:`faq`
has :ref:`specific suggestions <tweak>` on when to change this.
Canu v1.4 and earlier used the :ref:`errorRate <errorRate>` parameter, which set the expected
rate of error in a single corrected read.
.. _minimum-lengths:
Minimum Lengths
~~~~~~~~~~~~~~~~~~~~~~
Two minimum sizes are known:
minReadLength
Discard reads shorter than this when loading into the assembler, and when trimming reads.
minOverlapLength
Do not save overlaps shorter than this.
Overlap configuration
~~~~~~~~~~~~~~~~~~~~~~
The largest compute of the assembler is also the most complicated to configure. As shown in the
'module tags' section, there are up to eight (!) different overlapper configurations. For
each overlapper ('ovl' or 'mhap') there is a global configuration, and three specializations
that apply to each stage in the pipeline (correction, trimming or assembly).
Like with 'grid configuration', overlap configuration uses a 'tag' prefix applied to each option. The
tags in this instance are 'cor', 'obt' and 'utg'.
For example:
- To change the k-mer size for all instances of the ovl overlapper, 'merSize=23' would be used.
- To change the k-mer size for just the ovl overlapper used during correction, 'corMerSize=16' would be used.
- To change the mhap k-mer size for all instances, 'mhapMerSize=18' would be used.
- To change the mhap k-mer size just during correction, 'corMhapMerSize=15' would be used.
- To use minimap for overlap computation just during correction, 'corOverlapper=minimap' would be used. The minimap2 executable must be symlinked from the Canu binary folder ('Linux-amd64/bin' or 'Darwin-amd64/bin' depending on your system).
Ovl Overlapper Configuration
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
<tag>Overlapper
select the overlap algorithm to use, 'ovl' or 'mhap'.
Ovl Overlapper Parameters
~~~~~~~~~~~~~~~~~~~~~~~~~~~
<tag>ovlHashBlockLength
how many bases to reads to include in the hash table; directly controls process size
<tag>ovlRefBlockSize
how many reads to compute overlaps for in one process; directly controls process time
<tag>ovlRefBlockLength
same, but use 'bases in reads' instead of 'number of reads'
<tag>ovlHashBits
size of the hash table (SHOULD BE REMOVED AND COMPUTED, MAYBE TWO PASS)
<tag>ovlHashLoad
how much to fill the hash table before computing overlaps (SHOULD BE REMOVED)
<tag>ovlMerSize
size of kmer seed; smaller - more sensitive, but slower
The overlapper will not use frequent kmers to seed overlaps. These are computed by the 'meryl' program,
and can be selected in one of three ways.
Terminology. A k-mer is a contiguous sequence of k bases. The read 'ACTTA' has two 4-mers: ACTT
and CTTA. To account for reverse-complement sequence, a 'canonical kmer' is the lexicographically
smaller of the forward and reverse-complemented kmer sequence. Kmer ACTT, with reverse complement
AAGT, has a canonical kmer AAGT. Kmer CTTA, reverse-complement TAAG, has canonical kmer CTTA.
A 'distinct' kmer is the kmer sequence with no count associated with it. A 'total' kmer (for lack
of a better term) is the kmer with its count. The sequence TCGTTTTTTTCGTCG has 12 'total' 4-mers
and 8 'distinct' kmers.
::
TCGTTTTTTTCGTCG count
TCGT 2 distinct-1
CGTT 1 distinct-2
GTTT 1 distinct-3
TTTT 4 distinct-4
TTTT 4 copy of distinct-4
TTTT 4 copy of distinct-4
TTTT 4 copy of distinct-4
TTTC 1 distinct-5
TTCG 1 distinct-6
TCGT 2 copy of distinct-1
CGTC 1 distinct-7
GTCG 1 distinct-8
<tag>MerThreshold
any kmer with count higher than N is not used
<tag>MerDistinct
pick a threshold so as to seed overlaps using this fraction of all distinct kmers in the input. In the example above,
fraction 0.875 of the k-mers (7/8) will be at or below threshold 2.
<tag>MerTotal
pick a threshold so as to seed overlaps using this fraction of all kmers in the input. In the example above,
fraction 0.667 of the k-mers (8/12) will be at or below threshold 2.
<tag>FrequentMers
don't compute frequent kmers, use those listed in this fasta file
Mhap Overlapper Parameters
~~~~~~~~~~~~~~~~~~~~~~~~~~~
<tag>MhapBlockSize
Chunk of reads that can fit into 1GB of memory. Combined with memory to compute the size of chunk the reads are split into.
<tag>MhapMerSize
Use k-mers of this size for detecting overlaps.
<tag>ReAlign
After computing overlaps with mhap, compute a sequence alignment for each overlap.
<tag>MhapSensitivity
Either 'normal', 'high', or 'fast'.
Mhap also will down-weight frequent kmers (using tf-idf), but it's selection of frequent is not exposed.
Minimap Overlapper Parameters
~~~~~~~~~~~~~~~~~~~~~~~~~~~
<tag>MMapBlockSize
Chunk of reads that can fit into 1GB of memory. Combined with memory to compute the size of chunk the reads are split into.
<tag>MMapMerSize
Use k-mers of this size for detecting overlaps
Minimap also will ignore high-frequency minimizers, but it's selection of frequent is not exposed.
.. _outputs:
Outputs
~~~~~~~~~~~~~~~~~~~~~~~~~~~
As Canu runs, it outputs status messages, execution logs, and some analysis to the console. Most of
the analysis is captured in ``<prefix>.report`` as well.
LOGGING
<prefix>.report
Most of the analysis reported during assembly. This will report the histogram of read lengths, the histogram or k-mers in the raw and corrected reads, the summary of corrected data, summary of overlaps, and the summary of contig lengths.
You can use the k-mer corrected read histograms with tools like `GenomeScope <http://qb.cshl.edu/genomescope/>`_ to estimate heterozygosity and genome size. In particular, histograms with more than 1 peak likely indicate a heterozygous genome. See the :ref:`FAQ` for some suggested parameters.
The corrected read report gives a summary of the fate of all input reads. The first part:::
-- original original
-- raw reads raw reads
-- category w/overlaps w/o/overlaps
-- -------------------- ------------- -------------
-- Number of Reads 250609 477
-- Number of Bases 2238902045 1896925
-- Coverage 97.344 0.082
-- Median 6534 2360
-- Mean 8933 3976
-- N50 11291 5756
-- Minimum 1012 0
-- Maximum 60664 41278
reports the fraction of reads which had an overlap. In this case, the majority had at least one overlap, which is good. Next::
-- --------corrected---------
-- evidence expected
-- category reads raw corrected
-- -------------------- ------------- ------------- -------------
-- Number of Reads 229397 48006 48006
-- Number of Bases 2134291652 993586222 920001699
-- Coverage 92.795 43.199 40.000
-- Median 6842 15330 14106
-- Mean 9303 20697 19164
-- N50 11512 28066 26840
-- Minimum 1045 10184 10183
-- Maximum 60664 60664 59063
--
reports that a total of 92.8x of raw bases are candidates for correction. By default, Canu only selects the longest 40x for correction. In this case, it selects 43.2x of raw read data which it estimates will result in 40x correction. Not all raw reads survive full-length through correction::
-- ----------rescued----------
-- expected
-- category raw corrected
-- -------------------- ------------- -------------
-- Number of Reads 20030 20030
-- Number of Bases 90137165 61903752
-- Coverage 3.919 2.691
-- Median 3324 2682
-- Mean 4500 3090
-- N50 5529 3659
-- Minimum 1012 501
-- Maximum 41475 10179
The rescued reads are those which would not have contributed to the correction of the selected longest 40x subset. These could be short plasmids, mitochondria, etc. Canu includes them even though they're too short by the 40x cutoff to avoid losing sequence during assembly. Lastly::
-- --------uncorrected--------
-- expected
-- category raw corrected
-- -------------------- ------------- -------------
-- Number of Reads 183050 183050
-- Number of Bases 1157075583 951438105
-- Coverage 50.308 41.367
-- Median 5729 5086
-- Mean 6321 5197
-- N50 7467 6490
-- Minimum 0 0
-- Maximum 50522 10183
are the reads which were deemed too short to correct. If you increase ``corOutCoverage``, you could get up to 41x more corrected sequence. However, unless the genome is very heterozygous, this does not typically improve the assembly and increases the running time.
The assembly statistics (NG50, etc) are reported before and after consensus calling.
READS
<prefix>.correctedReads.fasta.gz
The reads after correction.
<prefix>.trimmedReads.fasta.gz
The corrected reads after overlap based trimming.
SEQUENCE
<prefix>.contigs.fasta
Everything which could be assembled and is the primary assembly, including both unique
and repetitive elements.
<prefix>.unitigs.fasta
Contigs, split at alternate paths in the graph.
<prefix>.unassembled.fasta
Reads and low-coverage contigs which could not be incorporated into the primary assembly.
The header line for each sequence provides some metadata on the sequence.::
>tig######## len=<integer> reads=<integer> covStat=<float> gappedBases=<yes|no> class=<contig|bubble|unassm> suggestRepeat=<yes|no> suggestCircular=<yes|no>
len
Length of the sequence, in bp.
reads
Number of reads used to form the contig.
covStat
The log of the ratio of the contig being unique versus being two-copy, based on the read arrival rate. Positive values indicate more likely to be unique, while negative values indicate more likely to be repetitive. See `Footnote 24 <http://science.sciencemag.org/content/287/5461/2196.full#ref-24>`_ in `Myers et al., A Whole-Genome Assembly of Drosophila <http://science.sciencemag.org/content/287/5461/2196.full>`_.
gappedBases
If yes, the sequence includes all gaps in the multialignment.
class
Type of sequence. Unassembled sequences are primarily low-coverage sequences spanned by a single read.
suggestRepeat
If yes, sequence was detected as a repeat based on graph topology or read overlaps to other sequences.
suggestCircular
If yes, sequence is likely circular. Not implemented.
GRAPHS
<prefix>.contigs.gfa
Unused or ambiguous edges between contig sequences. Bubble edges cannot be represented in this format.
<prefix>.unitigs.gfa
Contigs split at bubble intersections.
<prefix>.unitigs.bed
The position of each unitig in a contig.
METADATA
The layout provides information on where each read ended up in the final assembly, including
contig and positions. It also includes the consensus sequence for each contig.
<prefix>.contigs.layout, <prefix>.unitigs.layout
(undocumented)
<prefix>.contigs.layout.readToTig, <prefix>.unitigs.layout.readToTig
The position of each read in a contig (unitig).
<prefix>.contigs.layout.tigInfo, <prefix>.unitigs.layout.tigInfo
A list of the contigs (unitigs), lengths, coverage, number of reads and other metadata.
Essentially the same information provided in the FASTA header line.
|