File: manual.tex

package info (click to toggle)
minia 3.2.1%2Bgit20200522.4960a99-1
  • links: PTS, VCS
  • area: main
  • in suites: bullseye
  • size: 3,012 kB
  • sloc: cpp: 1,376; sh: 397; python: 208; makefile: 8
file content (171 lines) | stat: -rw-r--r-- 9,147 bytes parent folder | download | duplicates (4)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
\documentclass[a4paper]{article}
\usepackage{fancyvrb}
\usepackage{pdfpages}
\usepackage{hyperref}
\usepackage{url}
\begin{document}

\newcommand\vitem[1][]{\SaveVerb[% to use verb in description
    aftersave={\item[\textnormal{\UseVerb[#1]{vsave}}]}]{vsave}}

\title{\Huge \texttt{Minia-GATB} --- Short manual}

\author{R. Chikhi \& G. Rizk \& E. Drezen \& D. Lavenier\\
        {\small{rayan.chikhi@ens-cachan.org}}}
\maketitle

\begin{abstract}
\noindent {\normalsize Minia is a software for ultra-low memory DNA sequence assembly. It takes as input a set of short genomic sequences (typically, data produced by the Illumina DNA sequencer). Its output is a set of contigs (assembled sequences), forming an approximation of the expected genome. Minia is based on a succinct representation of the de Bruijn graph. The computational resources required to run Minia are significantly lower than that of other assemblers.}
\end{abstract}

\tableofcontents

\section{Forewords}

Minia 3 is a completely new version that no longer uses Bloom filters.
In terms of goals, not much has changed: Minia remains a contigs assembler using very low memory. Notable changes since Minia 1/2 are:
\begin{itemize}
    \item Faster $k$-mer counting step
    \item Different graph construction step (uses BCALM software to construct unitigs)
    \item Different graph representation (not a Bloom filter, but an indexed list of unitigs)
    \item SPAdes-style graph simplifications
\end{itemize}

\section{Installation}

See Github page.

\section{Parameters}

The basic usage is:\\

\begin{verbatim}
    ./minia -in [input file] -kmer-size [kmer size] -out [prefix]
\end{verbatim}

Not much has changed since Minia 2 but we recommend that you still read the next section for an updated explanation of parameters.

An example command line is:\\


\verb+./minia -in reads.fastq -kmer-size 31 -out minia_assembly_k31+\\

The main parameters are:

\begin{enumerate}

    \item \verb+in+ -- the input file(s) (see Section 5 for inputting multiple files)

    \item \verb+kmer-size+  -- k-mer length (integer)

\item \verb+prefix+ -- any prefix string to store output contigs as well as temporary files for this assembly

\end{enumerate}

\section{Explanation of parameters}
\begin{description}

\vitem+kmer-size+
        The $k$-mer length is the length of the nodes in the de Bruijn graph. It strongly depends on the input dataset. For proper assembly, we recommend that you use the Minia-pipeline (\url{https://github.com/GATB/gatb-minia-pipeline}) that runs Minia multiple times, using an iterative multi-k algorithm (similar to IDBA, Megahit). That way, you won't need to choose k. If you insist on running with a single k value, the KmerGenie software can automatically find the best $k$ for your dataset.

\vitem+abundance-min+
The \verb+abundance-min+ is a hard cut-off to remove likely erroneous, low-abundance $k$-mers. In Minia 3, it is recommended to set it to 2 unless you have a good reason not to. Minia 3 also implements other ways to remove erroneous k-mers, using a more adequate relative cut-off, in the graph simplifications step.

Setting \emph{abundance-min} to $1$ is not recommended, as too many erroneous $k$-mer will be kept, which will likely result in a very large memory usage. If the dataset has high coverage, it may be worth it to try larger values. 

\vitem+prefix+
The \verb+prefix+ parameter is any arbitrary output files name, for example, \verb+test_assembly+.

\end{description}

These are the main two parameters that control the quality of assembly. Minia command line has a few more parameters but they are either for advanced usage, or just minor tweaks. Some are described in the rest of this manual.


\section{Input}

\begin{description}
\item \emph{Larger $k$-mer lengths}

Minia supports arbitrary large $k$-mer lengths. To compile Minia from the source, to support $k$-mer lengths up to, say, 320, type this in the build folder:
\begin{verbatim}
rm -Rf CMake*
cmake -DKSIZE_LIST="32   64   96  128  160  192  224  256  320" ..
make -j 4
\end{verbatim}

The list of kmers should only contain multiples of 32. Intermediate values are used to create optimized code for k values that are shorter than the max. The last k value specifies the highest kmer size that can effectively be used in this compiled version of Minia. Apart from that, whatever k values are in this list do not affect the assembly quality in any way.

\item \emph{FASTA/FASTQ}

Minia assembles any type of Illumina reads, given in the FASTA or FASTQ format. Giving paired or mate-pairs reads as input is OK, but keep in mind that Minia won't use pairing information.
\item \emph{Multipe Files}

 Minia can assemble multiple input files. Just create a text file containing the list of read files, one file name per line, and pass this list as the first parameter of Minia (instead of a FASTA/FASTQ file). Therefore the parameter \verb+input_file+ can be either (i) the read file itself (FASTA/FASTQ/compressed), or (ii) a file containing a list of file names.

\item \emph{Line format}

 In FASTA files, each read can be split into multiple lines, whereas in FASTQ, each read sequence must be in a single line.

\item \emph{Gzip compression}

Minia can direclty read files compressed with gzip. Compressed files should end with '.gz'. Input files of different types can be mixed (i.e. gzipped or not, in FASTA or FASTQ)

\end{description}

\section{Output}

\subsection{Fasta}

The main output of Minia is a set of contigs in the FASTA format, in the file \verb+[prefix].contigs.fa+. 

\subsection{Assembly graph}

The contigs FASTA file can be converted to GFA using \href{https://github.com/GATB/bcalm/blob/master/scripts/convertToGFA.py}{BCALM's convertToGFA.py script}.

\subsection*{Unitigs}

Minia will also output unitigs, in the FASTA format. They correspond to non-branching paths in the de Bruijn graph prior to any graph simplification. File: \verb+[prefix].unitigs.fa+. 

\section{Selection of graph simplification steps}

By default, Minia performs 3 types of graph simplifications: tip removal, bulges removal, and erroneous connections removal. Those are similar to SPAdes. You can selectively ask to not perform any (or all) of those operation through those command line arguments: \verb+-no-tip-removal+, \verb+-no-bulge-removal+, \verb+-no-ec-removal+.

\section{Fine-tuning of graph simplification steps}

Most of the graph simplification steps are highly inspired by the SPAdes assembler.

A regular user needs not modify any of the graph simplifications parameters.
Yet, for advanced users, it is possible to do so in order to increase or reduce the aggresiveness of variant/error removal.

Tip removal implements two modes: "short tips" removal and "longer tips" removal. Short tips are removed no matter what their coverage is. By default, a tip is considered short if it is of length $\leq 3.5k$. Longer tips are of length $\leq 10k$. They are removed if their coverage is smaller than the average coverage of the neighbor unitigs, multiplied by a constant factor.

To make tip clipping more conservative, I would recommend specifying a high \verb+-tip-rctc-cutoff+ cutoff, such as \verb+-tip-rctc-cutoff 20+. This will make sure that the coverage of a tip is at least 20 times smaller than the coverage of its neighbors.

Explanation all the other simplifications steps would make this manual a bit longer. For now you can try to guess using slide 11 of \url{http://cristal.univ-lille.fr/~chikhi/pdf/2017-august-3-biata.pdf}. Some information can also be found in the SPAdes paper, where the term \emph{bulge} is defined.

\section{Memory usage}

We estimate that the memory usage of Minia is in the order of $1$ GB of RAM per gigabases in the target genome to assemble. It is independent of the coverage of the input dataset, provided that the \verb!abundance-min! parameter is not too low.

\section{Disk usage}

Minia writes large temporary files during the k-mer counting phase. These files are written in the working directory when you launched Minia. For better performance, run Minia on a local hard drive, SSD, or (very large) ram disk, around 2-3x time the space of the input.

\section{Reproducibility}

Default Minia parameters do not ensure a deterministic execution: due to e.g. multi-threading, the results may change a little bit between executions (but should still give more or less the same assembly).
To make sure that results are perfectly identical between runs on the same data and the same parameters, run with these options:
\verb!-nb-cores 1 -nb-glue-partitions 200!. Also make sure that the same compiler version was used.

\section{Influences}

This version of Minia could not exist without inspiration taken from these great pieces of software and methods:
\begin{itemize}
    \item khmer \url{https://github.com/dib-lab/khmer}
    \item KMC2 \url{https://github.com/marekkokot/KMC}
    \item SPAdes \url{http://bioinf.spbau.ru/spades}
    \item MSP \url{http://www.vldb.org/pvldb/vol6/p169-li.pdf}
    \item Meraculous \url{https://jgi.doe.gov/data-and-tools/meraculous/}
\end{itemize}
\end{document}