File: README.md

package info (click to toggle)
mapsembler2 2.1.6%2Bdfsg-1
links: PTS, VCS
area: main
in suites: jessie, jessie-kfreebsd
size: 4,116 kB
ctags: 3,391
sloc: cpp: 27,549; ansic: 2,662; asm: 271; sh: 226; makefile: 153
file content (187 lines) | stat: -rw-r--r-- 8,971 bytes
parent folder | download | duplicates (2)
# Maspsembler 2 and Phaser pipeline

With this pipeline Mapsembler2 takes as input a set of NGS raw reads (fasta or fastq, gzipped or not) and a set of input sequences (starters). It first determines if each starter is read-coherent, e.g. whether reads confirm the presence of each starter in the original sequence. Then for each read-coherent starter, Mapsembler2 outputs its sequence neighborhood as a linear sequence or as a graph, depending on the user choice. After that the Phaser (KissreadsGraph), maps the provided reads on Mapsembler2 output graph for removing low covered edges and nodes and then phasing simple non branching paths.
The aim is to get assembly sequences with a good coverage in the form of a graph.
Finally,the output graph of this pipeline can be visualized with Graph Sequence Viewer (GSV).


## ANR Colib'Read

You can find more details on Mapsembler and other software of the ANR Colib'Read on https://colibread.inria.fr/en.


## Build instructions

Run building script with `./compile_all_tools.sh` or can build each tool separately with the command `make` in directory of each tool (Mapsembler, Minia and KissreadGraph) and with the script `./buildGSV_desktop.sh` for Graph Sequences Viewer. The script `./buildGSV_desktop.sh` required Internet connection to download the software `nodeWebkit` uses to transform web app in a desktop app. 


## Run full pipeline

The full pipeline can be run with the script: `./run_mapsembler_and_phaser.sh <options>`.

***Options available:***
*   `-s file`: Input file containing starters (fasta). 
        Example: -s data_sample/starters.fasta
*   `-r list of reads`: List of reads separated by space, surrounded by the '"' character. Note that reads may be in fasta or fastq format, gzipped or not. 
        Example: -r "data_sample/reads_sequence1.fasta   data_sample\reads_sequence2.fasta.gz"
*   `-t kind of assembly`:  
    `1`: a strict sequence: any branching stops the extension (unitig).  
    `2`: a consensus sequence: contiging approach (contig).  
    `3`: a strict graph: any branching is conserved in the graph (unitig).  
    `4`: a consensus graph: "small" polymorphism is merged, but "large" structures are represented (contig).  
        Example -t 3
*   `-E`: do not check the read coherence, simply extend starters.
*   `-p prefix`: All output files will start with this prefix. 
		Example: -p resultat_
*   `-k` value. Set the length of used kmers. Must fit the compiled value. 
		Example -k 31
*   `-c` value. Set the minimal coverage: Used by Phaser (don't use kmers with lower coverage). 
		Example: -c 5
*   `-d` value. Set the number of authorized substitutions used while mapping reads on finding SNPs.
		Example: -d 1
*   `-g` value. Estimated genome size. Used only to control memory usage. e.g. 3 billion (3000000000) uses 4Gb of RAM. 
		Example: -g 10000000
*   `-f` value. Set the process of search in the graph (1=Breadth  and 2=Depth).
		Example: -f 1

*   `-x` value. Set the maximal length of nodes. 
		Example: -x 40
		 
*   `-y` value. Set the maximal depth of the graph. 
		Example: -y 10000
*   `-h` Prints this message and exist

***By default the pipeline uses : ***
* `-k 31`
* `-c 4`
* `-d 1`
* `-g 10000000`
* `-f 1`
* `-x 40`
* `-y 10000`  

By default `-x` is fixed with small value (40) to allow he visualization in the Graph Sequences Viewer. With lots of nodes, several times are needed to load a graph in GSV. But to have files with biggest graphs, this value can be increased. To be sure to have full graphs set the value to 1000000.

***Mandatory options are:***
* `-s` and `-r`: needed to define input files.
* `-t type` : needed to define type of output file.

## Run Graph Sequences Viewer (GSV)

GSV allows to visualize output files ( graph files only : `-t 3` or `-t 4` options).
It can be executed with `double-click` on GSVDesktop shortcut or with the command `./GSVDesktop`.

This app is a web app so it's possible to put source files on web server. The sources are in `./visu/GSV/` directory.
In this directory there are an index.html, so an alternative to GSVDesktop app is open this html file with a browser.Thus, the browser load files as local files.
For some browser it's necessary to allow access to local files to use properly this web app:

###### Safari
Enable the develop menu using the preferences panel, under Advanced -> "Show develop menu in menu bar"

Then from the safari "Develop" menu, select "Disable local file restrictions", it is also worth noting safari has some odd behaviour with caches, so it is advisable to use the "Disable caches" option in the same menu; if you are editing & debugging using safari.

###### Chrome
Close all running chrome instances first. Then start Chrome executable with a command line flag:

```
chrome --allow-file-access-from-files
```

###### Firefox
1. Go to `about:config`
2. Find `security.fileuri.strict_origin_policy` parameter
3. Set it to `false`

For more information on how to use GSV see doc_GSV.pdf file in docs directory.

## Run Mapsembler2 alone

Mapsembler2 is a targeted assembly software. It can be run alone with the command: `./mapsembler <starters.fasta> <reads.fasta> <option> `.  

***Options available:*** 
*   `-E` Extend only: avoid the mapping+substarter generation phase.
*   `-t extension_type`:  
    `1`: a strict sequence: any branching stops the extension (unitig).  
    `2`: a consensus sequence: contiging approach (contig).  
    `3`: a strict graph: any branching is conserved in the graph (unitig).  
    `4`: a consensus graph: "small" polymorphism is merged, but "large" structures are represented (contig).  
		Example: -t 1
*   `-q size_seed`: will use seeds of length size_seed during the mapping process. The value should be in [5-31].
		Example: -q 25
*   `-k size_kmers`: Size of the k-mers used during the extension phase. Accepted range, depends on the compilation (make k=42 for instance).
		Example: -k 31 
*   `-c min_coverage`: a sequence is covered by at least min_coverage coherent reads.
		Example: -c 2 
*   `-d authorized_distance`: a substarter is distant by at most authorized_distance substitutions.
		Example: -d 2 
*   `-e error_threshold`: a nucleotide is corrected if occurs less than error_threshold times.
		Example: -e 2 
*   `-g estimated_genome_size`: estimation of the size of the genome whose reads come from. It is in bp, does not need to be accurate, only controls memory usage.
		Example: -g 30000000
*   `-f` value. Set the process of search in the graph (1=Breadth  and 2=Depth).
		Example: -f 1

*   `-x` value. Set the maximal length of nodes. 
		Example: -x 1000000
		 
*   `-y` value. Set the maximal depth of the graph. 
		Example: -y 10000
*   `-i index_name`: stores the index files in files starting with this prefix name. Can be re-used latter. Default: `index` IF THE FILE `index_name.bloom` EXISTS: the index is not re-created
		Example: -i "index" 
*   `-o file_name_prefix`: where to write outputs.
		Example: -o "res_mapsembler"
*   `-m` file_name: write in file "file_name" the reads mapped on starters.
*   `-h` prints this message and exit.

***By default mapsembler2 use :*** 
* `-t 1`
* `-q 25`
* `-k 31`
* `-c 2`
* `-d 2`
* `-e 2`
* `-g 30000000`
* `-f 1`
* `-x 1000000`
* `-y 10000`
* `-i "index"`
* `-o "res_mapsembler"`


***Mandatory options are:***
* `-t` : needed to define type of output file. 

## Run Phaser (KissreadsGraph) alone

Phaser (KissreadsGraph) maps the provided reads on the graph, for example with output graph files of Mapsembler2. It can be run alone with the command: `./phaser <input_graph> <readsC1.fasta/fastq[.gz]> [<readsC2.fasta/fastq[.gz]> [<readsC3.fasta/fastq[.gz] ...] <option> `.  

***Options available :*** 
*   `-M` the input is considered as a Mapsembler output, thus composed of multiple independent graphs.
*   `-t type`:   
    * `"coverage"` or `"c"`: outputs an equivalent graph removing uncovered edges and adding:
            - for each node: the coverage per sample and per position
            - for each edge: the number of mapped reads per sample using this edge
    * `"modify"` or `"m"`: outputs the same simplified graph:
	        - removing low covered edges and nodes (less that min_coverage)
	        - then phasing simple non branching paths
*   `-o file_name`: write obtained graph. Default: standard output.	
        Example: -o "res_phaser"
*   `-k size_seed`: will use seeds of length size_seed.
	Example: -k 25
*   `-c min_coverage`: Will consider an edge as coherent if coverage (number of reads from all sets using this edge) is at least min_coverage. 
        Example: -c 2
*	`-d max_substitutions`: Will map a read on the graph with at most max_substitutions substitutions.
        Example: -d 1

***By default mapsembler2 use :*** 
* `-o standard output` 
* `-k 25`
* `-c 2`
* `-d 1`

***Mandatory options are:***
* `-t type` : needed to define type of output file.


### License
Copyright INRIA - CeCILL License