File: README.md

package info (click to toggle)
mapsembler2 2.2.4%2Bdfsg1-3
  • links: PTS, VCS
  • area: main
  • in suites: bullseye
  • size: 7,288 kB
  • sloc: cpp: 51,204; ansic: 13,165; sh: 542; makefile: 394; asm: 271; python: 28
file content (338 lines) | stat: -rw-r--r-- 14,682 bytes parent folder | download | duplicates (8)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
# Maspsembler 2 pipeline

> Mapsembler 2 extends a DNA sequence of interest given a set of reads, by computing a targeted micro assembly. 
This package contains 4 tools, Mapsembler2_extremities, Mapsembler2_extend, kissreads and kissreadsGraph, that are combined into a pipeline.

## Quick start:
Use the script *run_mapsembler2_pipeline.sh* which pipelines the needed tools.

###Input

* one or several starter sequence(s) (e.g. contig(s), or conserved sequence(s) from a related organism, or anything you want)  
* one or several  NGS read set(s)  

###Output
Depending on the user choice:

 * The assemblies are presented as sequences: the output are classical fasta files. 
 
 or
 * The assemblies are presented as assembly graphs: the output are json graph files. In this case it can be visualized using an internet browser opening the page `visu/GSV/index.html`

###An important subtlety: 

> Each input starter is extended using its two extreme *k*-mers. In case we had use only these two *k*-mers to perform the extensions, it would have been quite simple. However, as we use all *k*-mers distant by at most one error (substitution or indel) from the original extreme *k*-mers, there can be more than one extreme left *k*-mer and one extreme right *k*-mer.  

>The visualization tool thus proposes to select a couple of extreme *k*-mers while opening the visualization of a graph. 

## My first (toy) example 
### Compile the tools:
> `./compile_all_tools.sh`

### Run pipeline basic example 

Run the script 
> `./run_mapsembler2_pipeline.sh -s sample_example/starter.fa -r "sample_example/reads1.fa.gz sample_example/reads2.fa.gz" -t 3 -p res` 

### Result and visualization

The pipeline generates tree json file, and in particular:

> `res_k_31_c_5_t_3_modified_and_covered.json`


This graph can be visualized with Graph of Sequence Viewer using an internet browser opening the page `visu/GSV/index.html` 

When the visualizer tool has been launched, click on `Select file`, select this file to visualize the graph.


### Same example generating a fasta file:
Run the script (same than before with -t 1 option instead of -t 3)
> `./run_mapsembler2_pipeline.sh -s sample_example/starter.fa -r "sample_example/reads1.fa.gz sample_example/reads2.fa.gz" -t 1 -p res` 

The pipeline generates the following file
> `res_coherent_k_31_c_5_t_1.fasta`

## Mapsembler2 web page

[http://colibread.inria.fr/tools/mapsembler2](http://colibread.inria.fr/tools/mapsembler2 "Mapsembler2 web page") 


## Build instructions

Run building script with `./compile_all_tools.sh` or can build each tool separately with the command `make` in directory of each tool (mapsembler2_extremitires, mapsembler2_extend, kissreads and kissreads_graph).  

If you wish to use the stand alone Graph Sequences Viewer in addition of the others tools, run `./compile_all_tools.sh -D`, or execute the script `./buildGSV.sh`. In the two case an Internet connection is required to download the software `nodeWebkit` uses to transform web app in a desktop app. Remember that you can also visualize graphs from any (decent) internet browser.


## Run full pipeline

The full pipeline can be run with the script: `./run_mapsembler_and_phaser.sh <options>`.

***Mandatory:***

*   `-s file`: Input file containing starters (fasta). 
        Example: -s data_sample/starters.fasta
*   `-r list of reads`: List of reads separated by space, surrounded by the '"' character. Note that reads may be in fasta or fastq format, gzipped or not. 
        Example: -r "data_sample/reads_sequence1.fasta   data_sample\reads_sequence2.fasta.gz"
*   `-t kind of assembly`:  
    `1`: a strict sequence: any branching stops the extension (unitig).  
    `2`: a consensus sequence: contiging approach (contig).  
    `3`: a strict graph: any branching is conserved in the graph (unitig).  
    `4`: a consensus graph: "small" polymorphism is merged, but "large" structures are represented (contig).  
        Example -t 3
		
***Options:***
		
*   `-p prefix`: All output files will start with this prefix. 
		Example: -p resultat_
*   `-k` value. Set the length of used kmers. Must fit the compiled value. 
		Example -k 31
*   `-c` value. Set the minimal coverage: Used by Phaser (don't use kmers with lower coverage). 
		Example: -c 5
*   `-d` value. Set the number of authorized substitutions used while mapping reads on finding SNPs (kissreads and kissreads_graph).
		Example: -d 1
*   `-g` value. Estimated genome size. Used only to control memory usage. e.g. 3 billion (3000000000) uses 4Gb of RAM. 
		Example: -g 10000000
*   `-f` value. Set the process of search in the graph (1=Breadth  and 2=Depth).
		Example: -f 1

*   `-x` value. Set the maximal length of nodes. 
		Example: -x 40
		 
*   `-y` value. Set the maximal depth of the graph. 
		Example: -y 10000
*   `-h` Prints this message and exist

**Default:**

* `-k 31`
* `-c 4`
* `-d 1`
* `-g 10000000`
* `-f 1`
* `-x 40`
* `-y 10000`  

**Note:** By default `-x` is fixed with small value (40) to allow he visualization in the Graph Sequences Viewer. With lots of nodes,  the graph visualization is difficult. 



## Run Graph of Sequences Viewer (GSV)

GSV allows to visualize output files (graph files only: `-t 3` or `-t 4` options).
It can be executed with a `double-click` on GSVDesktop shortcut or with the command `./GSVDesktop`.

This app is a web app so it's possible to put source files on web server. The sources are in `./visu/GSV/` directory.
The index.html file in the `./visu/GSV/` directory, opened with a web browser is an alternative to GSVDesktop app. 
For some browser it's necessary to allow access to local files to use properly this web app:

### Safari
Enable the develop menu using the preferences panel, under Advanced -> "Show develop menu in menu bar"

Then from the safari "Develop" menu, select "Disable local file restrictions", it is also worth noting safari has some odd behaviour with caches, so it is advisable to use the "Disable caches" option in the same menu; if you are editing & debugging using safari.

### Chrome
Close all running chrome instances first. Then start Chrome executable with a command line flag:

```
chrome --allow-file-access-from-files
```

### Firefox
1. Go to `about:config`
2. Find `security.fileuri.strict_origin_policy` parameter
3. Set it to `false`

For more information on how to use GSV see doc_GSV.pdf file in docs directory. 


## Contacts
> See more details on : [http://colibread.inria.fr/tools/mapsembler2](http://colibread.inria.fr/tools/mapsembler2 "Mapsembler2 web page")  
> Any further question, contact us: 

> * <pierre.peterlongo@inria.fr>

## License
> Copyright INRIA - A-GPL License 

---

## Running tools alone (expert mode)
Mapsembler2 is composed of a set of 4 tools, pipelined thanks to the *run_mapsembler2_pipeline.sh* script. However, any of the tool can be run alone. This sections explains how to run each of these tools.
### Run Mapsembler2_extremities

This modules performs the first step of Mapsembler2_extend. For each starter, it outputs a list of extremities of the starter (of length *k*) are found in the reads, up to 1 indel or mismatch.

The algorithm goes as follows:

		for each starter;
			for each kmer u at the extremity of this starter
				for each k-mer u' where u' is u with an indel or a subsitution or just u
					see how many times u' is in the reads
					if u' is present more than M times (M=3 by default)
						then output u' as a extremal substarter for this starter

It can be run alone with the command:`./mapsembler2_extremities <options>`.
 
***Options:*** 

* `--k size_kmers`: kmer size that will be used for mapsembler2.
		Example: --k 31
* `--starters input_file_name`: input starters file (fasta).
		Example: --starters starters.fasta
* `--reads input_file_name`: reads dataset file name. Several reads sets can be provided, surrounded by " and separated by a space.
		Example: --reads "reads1.fa reads2.fasta"
* `--output output_file_name`: output substarters file name.
		Example: --output starters_extremities.fasta
* `--min-solid-subkmer`: minimim abundance to keep a subkmer.
* `-debug`: debugging.
* `-nb-cores`: number of cores.
* `-verbose`: verbosity level. 
* `-help`: display help and exit.

***Defaults:*** 

* `--min-solid-subkmer 3`
* `-nb-cores 1`
* `-verbose 1`

***Mandatory:***

* `--k`:  define the kmer size.
* `--starters`: define the input starters file.
* `--reads`: define inputs sets of reads.
* `--output`: define the name of the output file.

### Run Mapsembler2_extend alone

Mapsembler2 is a targeted assembly software that allows extending reference sequences from each side with one or more sets of reads. The outputs can be presented either as a single sequence (fasta) either as an assembly graphs (json).
It can be run alone with the following command: `./mapsembler2_extend <starters_extremities.fasta> <reads.fasta> <option> `.  

***Options:*** 

*   `-t extension_type`:  
    `1`: a strict sequence: any branching stops the extension (unitig).  
    `2`: a consensus sequence: contiging approach (contig).  
    `3`: a strict graph: any branching is conserved in the graph (unitig).  
    `4`: a consensus graph: "small" polymorphism is merged, but "large" structures are represented (contig).  
		Example: -t 1
*   `-k size_kmers`: size of the k-mers used during the extension phase. Accepted range, depends on the compilation (make k=42 for instance).
		Example: -k 31 
*   `-c min_coverage`: a sequence is covered by at least min_coverage coherent reads.
		Example: -c 2 
*   `-g estimated_genome_size`: estimation of the size of the genome whose reads come from. It is in bp, does not need to be accurate, only controls memory usage.
		Example: -g 30000000
*   `-f search_process`: set the process of search in the graph (1=Breadth  and 2=Depth).
		Example: -f 1
*   `-x max_length`: set the maximal length of nodes. 
		Example: -x 1000000		 
*   `-y max_depth`: set the maximal depth of the graph. 
		Example: -y 10000
*   `-i index_name`: stores the index files in files starting with this prefix name. Can be re-used latter. Default: `index` IF THE FILE `index_name.bloom` EXISTS: the index is not re-created
		Example: -i "index" 
*   `-o file_name_prefix`: where to write outputs.
		Example: -o "res_mapsembler"
*   `-h` print the help message and exit.

***Defaults:*** 

* `-t 1`
* `-k 31`
* `-c 2`
* `-g 30000000`
* `-f 1`
* `-x 1000000`
* `-y 10000`
* `-i "index"`
* `-o "res_mapsembler_k_31_c_5.fasta"` or `-o "res_mapsembler_k_31_c_5.json"`


***Mandatory:***

* `-t` : needed to define type of output file. 

### Run Kissreads_graph alone

KissreadsGraph maps the provided reads on the graph, for example with output graph files of Mapsembler2. It can be run alone with the command: `./kissreads_graph <input_graph> <readsC1.fasta/fastq[.gz]> [<readsC2.fasta/fastq[.gz]> [<readsC3.fasta/fastq[.gz] ...] <option> `.  

***Options:*** 

*   `-M`: the input is considered as a Mapsembler output, thus composed of multiple independent graphs.
*   `-t type`:   
    * `"coverage"` or `"c"`: outputs an equivalent graph removing uncovered edges and adding:
         - for each node: the coverage per sample and per position
         - for each edge: the number of mapped reads per sample using this edge
    * `"modify"` or `"m"`: outputs the same simplified graph:
	   - removing low covered edges and nodes (less that min_coverage)
	   - then phasing simple non branching paths
*   `-o file_name`: write obtained graph. Default: standard output.	
        Example: -o "res_kissreadsGraph.json"
*   `-k size_seed`: will use seeds of length size_seed.
	Example: -k 25
*   `-c min_coverage`: Will consider an edge as coherent if coverage (number of reads from all sets using this edge) is at least min_coverage. 
        Example: -c 2
*   `-d max_substitutions`: Will map a read on the graph with at most max_substitutions substitutions.
        Example: -d 1

***Default:***  

* `-k 25`
* `-c 2`
* `-d 1`
* `-o input_file_name.json`

***Mandatory:***
* `-t type` : needed to define type of operation to perform

### Run Kissreads alone

Kissreads checks for each sequence contained into a fasta file if it is read coherent (each position is covered by at least "min_coverage" read(s)) with sets of reads. A sequence from the fasta file is treated as follows:
  if a sequence is coherent with at least one read set, the sequence is writting in output file as follows:
		>original fasta comment|C1:min<avg-coverage_avg<max|C2:min<avg-coverage_avg<max|C3...
		>sequence
		>...  
Where : 

* `min` : value of the position having minimal coverage in the reads file. 
* `avg`: average coverage in the reads file. 
* `coverage_avg`: R-squarred corrected average in the reads file.
* `max`: value of the position having maximal coverage in the reads file.

The coverage is the number of reads that perfectly mapped a position.

Kissreads can be run alone with the command: `./kissreads <toCheck.fasta> <readsC1.fasta> [<readsC2.fasta> [<readsC3.fasta] ...] <options>`.   

***Options:*** 

* `-k size_seed`: Will use seeds of length size_seed. 
		Example: 25
* `-c min_coverage`: A sequence is covered by at least min_coverage coherent reads. 
		Example: 2
* `-d max_substitutions`: Maximal number of substitutions authorized between a read and a fragment. Note that no substitution is allowed on the central position while anaylizing the kissnp output. 
		Example: 1
* `-o file_name`: write read-coherent outputs. 
		Example: "res_kissreads_coherent.fasta" 
* `-u file_name`: write unread-coherent outputs. 
		Example: "res_kissreads_uncoherent.fasta"
* `-n`: the input file is a kissnp output (incompatible with -I option).

    in this case: 1/ only the upper characters are considered (no mapping done on the extensions) and 2/ the central position (where the SNP occurs) is strictly mapped, no subsitution is authorized on this position.
* `-I`: the input file is an Intl output (incompatible with -n option). 
* `-i index_stride`: This is a heuristic for limiting the memory footprint. Instead of indexing each kmer of the sequences contained into the toCheck.fasta, kissreads indexes kmers occurring each "index_stride" position. 
		Example = 1
* `-t max_nb_threads`: max number of threads (also limited by number of input files).
		Example = 1
* `-m align_file`: write a file of reads mapped to sequences in file align_file.
* `-f` outputs coherent events in a standard fasta file format.
* `-s` silent mode.
* `-h` print the help message and exit.
* `--version` get the kissReads version and exit.

***Default :***  
* `-k 25`
* `-c 2`
* `-d 1`
* `-i 1` (no heuristic)
* `-t 1`