File: EXAMPLE.md

package info (click to toggle)
mapsembler2 2.1.6%2Bdfsg-1
  • links: PTS, VCS
  • area: main
  • in suites: jessie, jessie-kfreebsd
  • size: 4,116 kB
  • ctags: 3,391
  • sloc: cpp: 27,549; ansic: 2,662; asm: 271; sh: 226; makefile: 153
file content (54 lines) | stat: -rw-r--r-- 3,120 bytes parent folder | download
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
## Maspsembler 2 and Phaser pipeline example  

#### Build instructions

First, run the building script `./compile_all_tools.sh`.  
This required Internet connection to download the software `nodeWebkit` uses to transform the web visualizer tool (Graph Sequence Viewer) in a desktop app. Without an internet connection, Graph Sequence Viewer will not be built.

#### Run pipeline basic example

Secondly, just run the launching script `./run_mapsembler_and_phaser.sh -t 3` without any files or others options.  
The script run the pipeline with starters and reads files located in `sample_example` directory and with some options adjusted for this example. The options uses are :
* `-k 31`: size of kmers (default value)
* `-c 5`: minimal coverage
* `-d 1`: estimated number of errors per read (default value)
* `-g 10000000`: estimated genome size (default value)
* `-f 1`: breadth search mode (default value)
* `-x 40`: node length limit (default value)
* `-y 10000`: graph depth limit (default value)

So run `./run_mapsembler_and_phaser.sh -t 3` is the same to run:  
`./run_mapsembler_and_phaser.sh -s sample_example/fragments.fa -r sample_example/reads.fa -t 3 -k 31 -c 5 -g 10000000 -x 40 -y 10000 -f 1` 
or 
`./run_mapsembler_and_phaser.sh -s sample_example/fragments.fa -r sample_example/reads.fa -t 3`

#### Run pipeline paper example

In the paper, the file uses are `starter.fa` as starter and `coliI1_reads.fasta` and `coliI2_reads.fasta` as two sets of reads.
To have the same result, the pipeline must be run with these options:
* `-k 31`: size of kmers (default value)
* `-c 5`: minimal coverage
* `-d 1`: estimated number of errors per read (default value)
* `-g 4639675`: estimated genome size
* `-f 1`: breadth search mode (default value)
* `-x 20`: node length limit
* `-y 10000`: graph depth limit  (default value)

So run the command:
`sh ./run_mapsembler_and_phaser.sh -s starters.fa -r "coliI1_reads.fasta coliI2_reads.fasta" -t 3 -p test_double_snp -g 4639675 -c 5 -x 20`  

To allow to visualize the result with Graph Sequence Viewer, the option `-x` is small to limit the number of nodes. With lots of nodes, several times are needed to load a graph in GSV.

#### Result and visualization

In result, the pipeline has been generated tree json file:
* `res_k_31_q_25_c_2_t_3.json` : Original graphs generated by Mapsembler 2.
* `res_k_31_q_25_c_2_t_3_modified.json`: Graphs modified by the phaser (Remove edges and nodes, that doesn't map with the provided reads). 
* `res_k_31_q_25_c_2_t_3_modified_and_covered.json`: Graphs modified by the phaser where average coverage has been added. 


These files represent several graphs, that can be visualized with Graph Sequence Viewer.
To run Graph Sequence Viewer `double-click` on GSVDesktop shortcut or run the command `./GSVDesktop`.
When the visualizer tool has been launched, click on `Select file`, select one of these files.  
Into these files each graph is associated with one starter and one substarter. So, after the file has been loaded, one starter and one substarter can be selected to load one of the graph contains in the json file.