1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187
|
# Maspsembler 2 and Phaser pipeline
With this pipeline Mapsembler2 takes as input a set of NGS raw reads (fasta or fastq, gzipped or not) and a set of input sequences (starters). It first determines if each starter is read-coherent, e.g. whether reads confirm the presence of each starter in the original sequence. Then for each read-coherent starter, Mapsembler2 outputs its sequence neighborhood as a linear sequence or as a graph, depending on the user choice. After that the Phaser (KissreadsGraph), maps the provided reads on Mapsembler2 output graph for removing low covered edges and nodes and then phasing simple non branching paths.
The aim is to get assembly sequences with a good coverage in the form of a graph.
Finally,the output graph of this pipeline can be visualized with Graph Sequence Viewer (GSV).
## ANR Colib'Read
You can find more details on Mapsembler and other software of the ANR Colib'Read on https://colibread.inria.fr/en.
## Build instructions
Run building script with `./compile_all_tools.sh` or can build each tool separately with the command `make` in directory of each tool (Mapsembler, Minia and KissreadGraph) and with the script `./buildGSV_desktop.sh` for Graph Sequences Viewer. The script `./buildGSV_desktop.sh` required Internet connection to download the software `nodeWebkit` uses to transform web app in a desktop app.
## Run full pipeline
The full pipeline can be run with the script: `./run_mapsembler_and_phaser.sh <options>`.
***Options available:***
* `-s file`: Input file containing starters (fasta).
Example: -s data_sample/starters.fasta
* `-r list of reads`: List of reads separated by space, surrounded by the '"' character. Note that reads may be in fasta or fastq format, gzipped or not.
Example: -r "data_sample/reads_sequence1.fasta data_sample\reads_sequence2.fasta.gz"
* `-t kind of assembly`:
`1`: a strict sequence: any branching stops the extension (unitig).
`2`: a consensus sequence: contiging approach (contig).
`3`: a strict graph: any branching is conserved in the graph (unitig).
`4`: a consensus graph: "small" polymorphism is merged, but "large" structures are represented (contig).
Example -t 3
* `-E`: do not check the read coherence, simply extend starters.
* `-p prefix`: All output files will start with this prefix.
Example: -p resultat_
* `-k` value. Set the length of used kmers. Must fit the compiled value.
Example -k 31
* `-c` value. Set the minimal coverage: Used by Phaser (don't use kmers with lower coverage).
Example: -c 5
* `-d` value. Set the number of authorized substitutions used while mapping reads on finding SNPs.
Example: -d 1
* `-g` value. Estimated genome size. Used only to control memory usage. e.g. 3 billion (3000000000) uses 4Gb of RAM.
Example: -g 10000000
* `-f` value. Set the process of search in the graph (1=Breadth and 2=Depth).
Example: -f 1
* `-x` value. Set the maximal length of nodes.
Example: -x 40
* `-y` value. Set the maximal depth of the graph.
Example: -y 10000
* `-h` Prints this message and exist
***By default the pipeline uses : ***
* `-k 31`
* `-c 4`
* `-d 1`
* `-g 10000000`
* `-f 1`
* `-x 40`
* `-y 10000`
By default `-x` is fixed with small value (40) to allow he visualization in the Graph Sequences Viewer. With lots of nodes, several times are needed to load a graph in GSV. But to have files with biggest graphs, this value can be increased. To be sure to have full graphs set the value to 1000000.
***Mandatory options are:***
* `-s` and `-r`: needed to define input files.
* `-t type` : needed to define type of output file.
## Run Graph Sequences Viewer (GSV)
GSV allows to visualize output files ( graph files only : `-t 3` or `-t 4` options).
It can be executed with `double-click` on GSVDesktop shortcut or with the command `./GSVDesktop`.
This app is a web app so it's possible to put source files on web server. The sources are in `./visu/GSV/` directory.
In this directory there are an index.html, so an alternative to GSVDesktop app is open this html file with a browser.Thus, the browser load files as local files.
For some browser it's necessary to allow access to local files to use properly this web app:
###### Safari
Enable the develop menu using the preferences panel, under Advanced -> "Show develop menu in menu bar"
Then from the safari "Develop" menu, select "Disable local file restrictions", it is also worth noting safari has some odd behaviour with caches, so it is advisable to use the "Disable caches" option in the same menu; if you are editing & debugging using safari.
###### Chrome
Close all running chrome instances first. Then start Chrome executable with a command line flag:
```
chrome --allow-file-access-from-files
```
###### Firefox
1. Go to `about:config`
2. Find `security.fileuri.strict_origin_policy` parameter
3. Set it to `false`
For more information on how to use GSV see doc_GSV.pdf file in docs directory.
## Run Mapsembler2 alone
Mapsembler2 is a targeted assembly software. It can be run alone with the command: `./mapsembler <starters.fasta> <reads.fasta> <option> `.
***Options available:***
* `-E` Extend only: avoid the mapping+substarter generation phase.
* `-t extension_type`:
`1`: a strict sequence: any branching stops the extension (unitig).
`2`: a consensus sequence: contiging approach (contig).
`3`: a strict graph: any branching is conserved in the graph (unitig).
`4`: a consensus graph: "small" polymorphism is merged, but "large" structures are represented (contig).
Example: -t 1
* `-q size_seed`: will use seeds of length size_seed during the mapping process. The value should be in [5-31].
Example: -q 25
* `-k size_kmers`: Size of the k-mers used during the extension phase. Accepted range, depends on the compilation (make k=42 for instance).
Example: -k 31
* `-c min_coverage`: a sequence is covered by at least min_coverage coherent reads.
Example: -c 2
* `-d authorized_distance`: a substarter is distant by at most authorized_distance substitutions.
Example: -d 2
* `-e error_threshold`: a nucleotide is corrected if occurs less than error_threshold times.
Example: -e 2
* `-g estimated_genome_size`: estimation of the size of the genome whose reads come from. It is in bp, does not need to be accurate, only controls memory usage.
Example: -g 30000000
* `-f` value. Set the process of search in the graph (1=Breadth and 2=Depth).
Example: -f 1
* `-x` value. Set the maximal length of nodes.
Example: -x 1000000
* `-y` value. Set the maximal depth of the graph.
Example: -y 10000
* `-i index_name`: stores the index files in files starting with this prefix name. Can be re-used latter. Default: `index` IF THE FILE `index_name.bloom` EXISTS: the index is not re-created
Example: -i "index"
* `-o file_name_prefix`: where to write outputs.
Example: -o "res_mapsembler"
* `-m` file_name: write in file "file_name" the reads mapped on starters.
* `-h` prints this message and exit.
***By default mapsembler2 use :***
* `-t 1`
* `-q 25`
* `-k 31`
* `-c 2`
* `-d 2`
* `-e 2`
* `-g 30000000`
* `-f 1`
* `-x 1000000`
* `-y 10000`
* `-i "index"`
* `-o "res_mapsembler"`
***Mandatory options are:***
* `-t` : needed to define type of output file.
## Run Phaser (KissreadsGraph) alone
Phaser (KissreadsGraph) maps the provided reads on the graph, for example with output graph files of Mapsembler2. It can be run alone with the command: `./phaser <input_graph> <readsC1.fasta/fastq[.gz]> [<readsC2.fasta/fastq[.gz]> [<readsC3.fasta/fastq[.gz] ...] <option> `.
***Options available :***
* `-M` the input is considered as a Mapsembler output, thus composed of multiple independent graphs.
* `-t type`:
* `"coverage"` or `"c"`: outputs an equivalent graph removing uncovered edges and adding:
- for each node: the coverage per sample and per position
- for each edge: the number of mapped reads per sample using this edge
* `"modify"` or `"m"`: outputs the same simplified graph:
- removing low covered edges and nodes (less that min_coverage)
- then phasing simple non branching paths
* `-o file_name`: write obtained graph. Default: standard output.
Example: -o "res_phaser"
* `-k size_seed`: will use seeds of length size_seed.
Example: -k 25
* `-c min_coverage`: Will consider an edge as coherent if coverage (number of reads from all sets using this edge) is at least min_coverage.
Example: -c 2
* `-d max_substitutions`: Will map a read on the graph with at most max_substitutions substitutions.
Example: -d 1
***By default mapsembler2 use :***
* `-o standard output`
* `-k 25`
* `-c 2`
* `-d 1`
***Mandatory options are:***
* `-t type` : needed to define type of output file.
### License
Copyright INRIA - CeCILL License
|