1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109
|
Sketches
========
For sequences to be compared with :code:`mash`, they must first be `sketched`,
which creates vastly reduced representations of them. This will happen
automatically if :code:`mash dist` is given raw sequences. However, if multiple
comparisons will be performed, it is more efficient to create sketches with
:code:`mash sketch` first and provide them to :code:`mash dist` in place of the
raw sequences. Sketching parameters can be provided to either tool via
command line options.
Reduced representations with MinHash tables
-------------------------------------------
Sketches are used by the `MinHash` algorithm to allow fast distance estimations
with low storage and memory requirements. To make a sketch, each k-mer in a
sequence is `hashed`, which creates a pseudo-random identifier. By sorting these
identifiers (`hashes`), a small subset from the top of the sorted list can
represent the entire sequence (these are `min-hashes`). The more similar another
sequence is, the more min-hashes it is likely to share.
k-mer size
''''''''''
As in any k-mer based method, larger k-mers will provide more specificity, while
smaller k-mers will provide more sensitivity. Larger genomes will also require
larger k-mers to avoid k-mers that are shared by chance. K-mer size is
specified with :code:`-k`, and sketch files must have the same k-mer size to be
compared with :code:`mash dist`. When :code:`mash sketch` is run, it
automatically assesses the specified k-mer size against the sizes of input
genomes by estimating the probability of a random match as:
.. math::
p = \frac 1 {\frac {\left(\overline\Sigma\right)^k} g + 1}
...where :math:`g` is the genome size and :math:`\Sigma` is the alphabet (ACGT
by default). If this probability exceeds a threshold (specified by
:code:`-w`; 0.01 by default) for any input genomes, a warning will be given
with the minimum k-mer size needed to get within the threshold.
For large collections of sketches, memory and storage may also be a
consideration when choosing a k-mer size. Mash will use 32-bit hashes, rather
than 64-bit, if they can encompass the full k-mer space for the alphabet in use.
This will (roughly) halve the size of the size of the sketch file on disk and
the memory it uses when loaded for :code:`mash dist`. The criterion for using a
32-bit hash is:
.. math::
\left({\overline\Sigma}\right)^k \leq 2^{32}
...which becomes :math:`k \leq 16` for nucleotides (the default) and
:math:`k \leq 7` for amino acids.
sketch size
'''''''''''
Sketch size corresponds to the number of (non-redundant) min-hashes that are
kept. Larger sketches will better represent the sequence, but at the cost of
larger sketch files and longer comparison times. The error bound of a distance
estimation for a given sketch size :math:`s` is formulated as:
.. math::
\sqrt{\frac{1}{s}}
Sketch size is specified with :code:`-s`. Sketches of different sizes can be
compared with :code:`mash dist`, although the comparison will be restricted to
the smaller of the two sizes.
Strand and alphabet
-------------------
By default, :code:`mash` uses a nucleotide alphabet (ACGT), is case-insensitive,
and will ignore strandedness by using canonical k-mers, as done in
`Jellyfish`_. This works by using the reverse complement of a k-mer if it comes
before the original k-mer alphabetically. Strandedness can be preserved with
:code:`-n` (non-canonical) and case can be preserved with :code:`-Z`. Note that
the default nucleotide alphabet does not include lowercase and thus will filter
out k-mers with lowercase nucleotides if :code:`-Z` is specified. The amino acid
alphabet can be specified with :code:`-a`, which also changes the default k-mer
size to reflect the denser information. A completely custom alphabet can also be
specified with :code:`-z`. Note that alphabet size affects p-value calculation
and hash size (see `Assessing significance with p-values <distances.html#assessing-significance-with-p-values>`_ and `k-mer size`_).
Sketching read sets
-------------------
When sketching reads instead of complete genomes or assemblies, :code:`-r`
should be specified, which will estimate genome size from k-mer content
rather than total sequence length, allowing more accurate p-vlaues. Genome
size can also be specified directly with :code:`-g`. Additionally, Since
MinHash is a k-mer based method, removing unique or low-copy k-mers usually
improves results for read sets, since these k-mers are likely to represent
sequencing error. The minimum copies of each k-mer required can be specified
with :code:`-m` (e.g. :code:`-m 2` to filter unique). However, this could
lead to high memory usage if genome size is high and coverage is low, such as
in metagenomic read sets. In these cases a Bloom filter can be used (:code:`-b`)
to filter out most unique k-mers with constant memory. If coverage is high (e.g.
>100x), it can be helpful to limit it to save time and to avoid repeat errors
appearing as legitimate k-mers. This can be done with :code:`-c`, which stops
sketching reads once the estimated average coverage (based on k-mer
multiplicity) reaches the target.
Working with sketch files
-------------------------
The sketch or sketches stored in a sketch file, and their parameters, can be
inspected with :code:`mash info`. If sketch files have matching k-mer sizes,
their sketches can be combined into a single file with :code:`mash paste`. This
allows simple pairwise comparisons with :code:`mash dist`, and allows sketching
of multiple files to be parallelized.
.. _Jellyfish: http://www.cbcb.umd.edu/software/jellyfish/
|