1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160
|
# Minimisers {#tutorial_minimiser}
<!-- SPDX-FileCopyrightText: 2006-2025 Knut Reinert & Freie Universität Berlin
SPDX-FileCopyrightText: 2016-2025 Knut Reinert & MPI für molekulare Genetik
SPDX-License-Identifier: CC-BY-4.0
-->
[TOC]
This tutorial introduces minimisers. Minimisers are a compact representation of a DNA or RNA sequence that is closely
related to, but more efficient than, a representation by k-mers. Minimisers are implemented as a view in SeqAn3. For
more information about views and how to implement your own view, please see
\ref tutorial_ranges and \ref howto_write_a_view .
\tutorial_head{Easy, 20 min, , \ref tutorial_ranges\, \ref howto_write_a_view}
# Motivation
A common way to work with sequences is to obtain their k-mers, but even with a large size of k, the number of k-mers is
non-trivial, therefore storing k-mers is costly. At the same time, due to the overlap of consecutive k-mers, the
information they contain is highly redundant. That is why minimisers come in handy. Minimisers are k-mers, which have
a minimal value in a window of a specified size. Minimal could for example mean lexicographically smallest. By storing
only these minimal k-mers, the storage cost is significantly reduced while maintaining a similar amount of information.
# Minimiser Workflow
Because minimisers are minimal k-mers, they depend on a given k-mer size, a given shape (specifying which positions
should be considered) and a window size, which has to be greater than or equal to the k-mer size. If all these values are
given, then the minimisers can be obtained by determining all k-mers in the forward and in the backward strand for one
window. Only the lexicographically smallest k-mer in one window is saved, then the window is shifted by one and the
procedure is repeated until all windows have been processed. If two consecutive windows share the same minimiser, it is
stored only once.
The following figure shows three examples, where a) and b) differ only in their window size, while c) introduces
a gapped k-mer. All positions marked with a ’.’ are gaps and all k-mers in lower case come from the reverse strand.
As the example shows, k-mer size, shape and window size influence the total number of minimisers.
\image html minimisers.png
### Non-lexicographical Minimisers
When sliding the window over the sequence, consecutive minimisers might differ only slightly.
For instance, when a minimiser starts with a repetition of A’s, it is highly likely that the minimiser of the next
window will start with a repetition of A’s too, which will then just be one A shorter. This may go on for multiple
window shifts, depending on how long the repetition is. Saving these only slightly different minimisers makes no sense
because they contain no new information about the underlying sequence.
Additionally, sequences with a repetition of A’s will be seen as more similar to each other than they actually are.
As [Marçais et al.](https://doi.org/10.1093/bioinformatics/btx235) have shown, randomizing the order of the k-mers
can solve this problem.
### Robust Winnowing
In case there are multiple minimal values within one window, the minimum and therefore the minimiser is ambiguous.
We choose the rightmost value as the minimiser of the window, and when shifting the window, the minimiser is only
changed if there appears a value that is strictly smaller than the current minimum. This approach is termed
*robust winnowing* by [Chirag et al.](https://doi.org/10.1093/bioinformatics/btaa435) and is
proven to work especially well on repeat regions.
# Usage in SeqAn3
The minimisers are implemented in `seqan3::views::minimiser_hash`. The function requires three arguments: The sequence,
the shape and the window size. The above-mentioned k-mer size is implicitly given by the size of the shape.
That is all the information you need to obtain minimisers with SeqAn3, so let's just dive right into the first
assignment!
\assignment{Assignment 1: Fun with minimisers I}
Task: Obtain the minimisers for `CCACGTCGACGGTT` with an ungapped shape of size 4 and a window size of 8.
\endassignment
\solution
\include doc/tutorial/06_minimisers/minimisers_solution1.cpp
\endsolution
If you have completed the assignment, you're probably wondering what those large numbers mean. As explained above,
lexicographical ordering is less than optimal. Therefore, SeqAn3 uses a seed to randomize the order. To do so, SeqAn3
simply XORs the hash values with a random seed (Default: `0x8F3F73B5CF1C9ADE`). How would you get back the actual hash
values then?
Well, you just use XOR again!
\include doc/tutorial/06_minimisers/seed_example.cpp
From these hash values, you can obtain the sequence they represent by transforming the numbers to base 4. (For
example, 134 is "2012" in base four and therefore represents `GACG`.)
Now take a closer look at the resulting minimiser sequences. Are they what you
expected? Probably not, since they do not correspond to the minimsers computed in our original example. Can you figure
out why?
\hint
The example above is based on a lexicographical ordering, our first assignment is not.
\endhint
So, what would we need to change to achieve the results from our example? We need to use a different seed. SeqAn3
makes this simple by allowing us to add another input parameter of type `seqan3::seed`.
\assignment{Assignment 2: Fun with minimisers II}
Task: Reproduce the results from the example above by obtaining the minimisers for `CCACGTCGACGGTT`.
To do so, you have to know which seed to use.
\hint
Because the randomization is based on an XOR, you have to use a value X which XOR-ed with a hash value results in the
same hash value.
\endhint
\hint
10001 XOR 0 = 10001
\endhint
\endassignment
\solution
\include doc/tutorial/06_minimisers/minimisers_solution2.cpp
\endsolution
### Ignoring the backward strand
So far, we have always considered the forward and the backward strand of a sequence, but there might be cases where
the backward strand should not be considered. If this is the desired behaviour, then the `seqan3::views::minimiser` needs
to be used. Unlike the `seqan3::views::minimiser_hash`, `seqan3::views::minimiser` does not hash the values for you, so
you have to do this yourself. But fear not, `seqan3::views::kmer_hash` makes this really easy for you!
\snippet doc/tutorial/06_minimisers/minimisers_snippets.cpp minimiser
This syntax will result in minimisers with k-mer size 4 and a window-length of 8 (`5 + 4 - 1`). (So, to determine the
right window size you always have to determine the window size you would like to use in `seqan3::views::minimiser_hash`
and subtract it by the k-mer size and add 1, here: `8 - 4 + 1 = 5`.)
Got it? Then let's make a simple test, what is the actual window size of the following:
```cpp
auto minimisers = text | seqan3::views::kmer_hash(seqan3::ungapped{4}) | seqan3::views::minimiser(1);
```
\hint
`1 + 4 - 1 = 4`. So, k-mer size and window size are equal.
\endhint
Wait a second, what does it mean, if k-mer size and window size are equal and we are not considering the backward
strand?
\hint
`seqan3::views::kmer_hash(seqan3::ungapped{4}) | seqan3::views::minimiser(1)` will return the same values as
`seqan3::views::kmer_hash` because if k-mer size and window size are equal, there is only one comparison to determine
the minimal k-mer, the comparison between forward and backward strand. However, if the backward strand is not
considered, the minimisers are all k-mers from the given sequence.
\endhint
To ensure that this is the desired behaviour, using `seqan3::views::minimiser(1)` is prohibited. Please use
`seqan3::views::kmer_hash` instead.
Last but not least, `seqan3::views::kmer_hash` and `seqan3::views::minimiser` do not have a seed parameter. So, in order
to obtain a random ordering, you have to XOR the view yourself. This can be done with the following command:
\snippet doc/tutorial/06_minimisers/minimisers_snippets.cpp minimiser_seed
\assignment{Assignment 3: Fun with minimisers III}
Task: Repeat assignment 2, but this time do not consider the backward strand.
\endassignment
\solution
\include doc/tutorial/06_minimisers/minimisers_solution3.cpp
\endsolution
|