1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338
|
# Maspsembler 2 pipeline
> Mapsembler 2 extends a DNA sequence of interest given a set of reads, by computing a targeted micro assembly.
This package contains 4 tools, Mapsembler2_extremities, Mapsembler2_extend, kissreads and kissreadsGraph, that are combined into a pipeline.
## Quick start:
Use the script *run_mapsembler2_pipeline.sh* which pipelines the needed tools.
###Input
* one or several starter sequence(s) (e.g. contig(s), or conserved sequence(s) from a related organism, or anything you want)
* one or several NGS read set(s)
###Output
Depending on the user choice:
* The assemblies are presented as sequences: the output are classical fasta files.
or
* The assemblies are presented as assembly graphs: the output are json graph files. In this case it can be visualized using an internet browser opening the page `visu/GSV/index.html`
###An important subtlety:
> Each input starter is extended using its two extreme *k*-mers. In case we had use only these two *k*-mers to perform the extensions, it would have been quite simple. However, as we use all *k*-mers distant by at most one error (substitution or indel) from the original extreme *k*-mers, there can be more than one extreme left *k*-mer and one extreme right *k*-mer.
>The visualization tool thus proposes to select a couple of extreme *k*-mers while opening the visualization of a graph.
## My first (toy) example
### Compile the tools:
> `./compile_all_tools.sh`
### Run pipeline basic example
Run the script
> `./run_mapsembler2_pipeline.sh -s sample_example/starter.fa -r "sample_example/reads1.fa.gz sample_example/reads2.fa.gz" -t 3 -p res`
### Result and visualization
The pipeline generates tree json file, and in particular:
> `res_k_31_c_5_t_3_modified_and_covered.json`
This graph can be visualized with Graph of Sequence Viewer using an internet browser opening the page `visu/GSV/index.html`
When the visualizer tool has been launched, click on `Select file`, select this file to visualize the graph.
### Same example generating a fasta file:
Run the script (same than before with -t 1 option instead of -t 3)
> `./run_mapsembler2_pipeline.sh -s sample_example/starter.fa -r "sample_example/reads1.fa.gz sample_example/reads2.fa.gz" -t 1 -p res`
The pipeline generates the following file
> `res_coherent_k_31_c_5_t_1.fasta`
## Mapsembler2 web page
[http://colibread.inria.fr/tools/mapsembler2](http://colibread.inria.fr/tools/mapsembler2 "Mapsembler2 web page")
## Build instructions
Run building script with `./compile_all_tools.sh` or can build each tool separately with the command `make` in directory of each tool (mapsembler2_extremitires, mapsembler2_extend, kissreads and kissreads_graph).
If you wish to use the stand alone Graph Sequences Viewer in addition of the others tools, run `./compile_all_tools.sh -D`, or execute the script `./buildGSV.sh`. In the two case an Internet connection is required to download the software `nodeWebkit` uses to transform web app in a desktop app. Remember that you can also visualize graphs from any (decent) internet browser.
## Run full pipeline
The full pipeline can be run with the script: `./run_mapsembler_and_phaser.sh <options>`.
***Mandatory:***
* `-s file`: Input file containing starters (fasta).
Example: -s data_sample/starters.fasta
* `-r list of reads`: List of reads separated by space, surrounded by the '"' character. Note that reads may be in fasta or fastq format, gzipped or not.
Example: -r "data_sample/reads_sequence1.fasta data_sample\reads_sequence2.fasta.gz"
* `-t kind of assembly`:
`1`: a strict sequence: any branching stops the extension (unitig).
`2`: a consensus sequence: contiging approach (contig).
`3`: a strict graph: any branching is conserved in the graph (unitig).
`4`: a consensus graph: "small" polymorphism is merged, but "large" structures are represented (contig).
Example -t 3
***Options:***
* `-p prefix`: All output files will start with this prefix.
Example: -p resultat_
* `-k` value. Set the length of used kmers. Must fit the compiled value.
Example -k 31
* `-c` value. Set the minimal coverage: Used by Phaser (don't use kmers with lower coverage).
Example: -c 5
* `-d` value. Set the number of authorized substitutions used while mapping reads on finding SNPs (kissreads and kissreads_graph).
Example: -d 1
* `-g` value. Estimated genome size. Used only to control memory usage. e.g. 3 billion (3000000000) uses 4Gb of RAM.
Example: -g 10000000
* `-f` value. Set the process of search in the graph (1=Breadth and 2=Depth).
Example: -f 1
* `-x` value. Set the maximal length of nodes.
Example: -x 40
* `-y` value. Set the maximal depth of the graph.
Example: -y 10000
* `-h` Prints this message and exist
**Default:**
* `-k 31`
* `-c 4`
* `-d 1`
* `-g 10000000`
* `-f 1`
* `-x 40`
* `-y 10000`
**Note:** By default `-x` is fixed with small value (40) to allow he visualization in the Graph Sequences Viewer. With lots of nodes, the graph visualization is difficult.
## Run Graph of Sequences Viewer (GSV)
GSV allows to visualize output files (graph files only: `-t 3` or `-t 4` options).
It can be executed with a `double-click` on GSVDesktop shortcut or with the command `./GSVDesktop`.
This app is a web app so it's possible to put source files on web server. The sources are in `./visu/GSV/` directory.
The index.html file in the `./visu/GSV/` directory, opened with a web browser is an alternative to GSVDesktop app.
For some browser it's necessary to allow access to local files to use properly this web app:
### Safari
Enable the develop menu using the preferences panel, under Advanced -> "Show develop menu in menu bar"
Then from the safari "Develop" menu, select "Disable local file restrictions", it is also worth noting safari has some odd behaviour with caches, so it is advisable to use the "Disable caches" option in the same menu; if you are editing & debugging using safari.
### Chrome
Close all running chrome instances first. Then start Chrome executable with a command line flag:
```
chrome --allow-file-access-from-files
```
### Firefox
1. Go to `about:config`
2. Find `security.fileuri.strict_origin_policy` parameter
3. Set it to `false`
For more information on how to use GSV see doc_GSV.pdf file in docs directory.
## Contacts
> See more details on : [http://colibread.inria.fr/tools/mapsembler2](http://colibread.inria.fr/tools/mapsembler2 "Mapsembler2 web page")
> Any further question, contact us:
> * <pierre.peterlongo@inria.fr>
## License
> Copyright INRIA - A-GPL License
---
## Running tools alone (expert mode)
Mapsembler2 is composed of a set of 4 tools, pipelined thanks to the *run_mapsembler2_pipeline.sh* script. However, any of the tool can be run alone. This sections explains how to run each of these tools.
### Run Mapsembler2_extremities
This modules performs the first step of Mapsembler2_extend. For each starter, it outputs a list of extremities of the starter (of length *k*) are found in the reads, up to 1 indel or mismatch.
The algorithm goes as follows:
for each starter;
for each kmer u at the extremity of this starter
for each k-mer u' where u' is u with an indel or a subsitution or just u
see how many times u' is in the reads
if u' is present more than M times (M=3 by default)
then output u' as a extremal substarter for this starter
It can be run alone with the command:`./mapsembler2_extremities <options>`.
***Options:***
* `--k size_kmers`: kmer size that will be used for mapsembler2.
Example: --k 31
* `--starters input_file_name`: input starters file (fasta).
Example: --starters starters.fasta
* `--reads input_file_name`: reads dataset file name. Several reads sets can be provided, surrounded by " and separated by a space.
Example: --reads "reads1.fa reads2.fasta"
* `--output output_file_name`: output substarters file name.
Example: --output starters_extremities.fasta
* `--min-solid-subkmer`: minimim abundance to keep a subkmer.
* `-debug`: debugging.
* `-nb-cores`: number of cores.
* `-verbose`: verbosity level.
* `-help`: display help and exit.
***Defaults:***
* `--min-solid-subkmer 3`
* `-nb-cores 1`
* `-verbose 1`
***Mandatory:***
* `--k`: define the kmer size.
* `--starters`: define the input starters file.
* `--reads`: define inputs sets of reads.
* `--output`: define the name of the output file.
### Run Mapsembler2_extend alone
Mapsembler2 is a targeted assembly software that allows extending reference sequences from each side with one or more sets of reads. The outputs can be presented either as a single sequence (fasta) either as an assembly graphs (json).
It can be run alone with the following command: `./mapsembler2_extend <starters_extremities.fasta> <reads.fasta> <option> `.
***Options:***
* `-t extension_type`:
`1`: a strict sequence: any branching stops the extension (unitig).
`2`: a consensus sequence: contiging approach (contig).
`3`: a strict graph: any branching is conserved in the graph (unitig).
`4`: a consensus graph: "small" polymorphism is merged, but "large" structures are represented (contig).
Example: -t 1
* `-k size_kmers`: size of the k-mers used during the extension phase. Accepted range, depends on the compilation (make k=42 for instance).
Example: -k 31
* `-c min_coverage`: a sequence is covered by at least min_coverage coherent reads.
Example: -c 2
* `-g estimated_genome_size`: estimation of the size of the genome whose reads come from. It is in bp, does not need to be accurate, only controls memory usage.
Example: -g 30000000
* `-f search_process`: set the process of search in the graph (1=Breadth and 2=Depth).
Example: -f 1
* `-x max_length`: set the maximal length of nodes.
Example: -x 1000000
* `-y max_depth`: set the maximal depth of the graph.
Example: -y 10000
* `-i index_name`: stores the index files in files starting with this prefix name. Can be re-used latter. Default: `index` IF THE FILE `index_name.bloom` EXISTS: the index is not re-created
Example: -i "index"
* `-o file_name_prefix`: where to write outputs.
Example: -o "res_mapsembler"
* `-h` print the help message and exit.
***Defaults:***
* `-t 1`
* `-k 31`
* `-c 2`
* `-g 30000000`
* `-f 1`
* `-x 1000000`
* `-y 10000`
* `-i "index"`
* `-o "res_mapsembler_k_31_c_5.fasta"` or `-o "res_mapsembler_k_31_c_5.json"`
***Mandatory:***
* `-t` : needed to define type of output file.
### Run Kissreads_graph alone
KissreadsGraph maps the provided reads on the graph, for example with output graph files of Mapsembler2. It can be run alone with the command: `./kissreads_graph <input_graph> <readsC1.fasta/fastq[.gz]> [<readsC2.fasta/fastq[.gz]> [<readsC3.fasta/fastq[.gz] ...] <option> `.
***Options:***
* `-M`: the input is considered as a Mapsembler output, thus composed of multiple independent graphs.
* `-t type`:
* `"coverage"` or `"c"`: outputs an equivalent graph removing uncovered edges and adding:
- for each node: the coverage per sample and per position
- for each edge: the number of mapped reads per sample using this edge
* `"modify"` or `"m"`: outputs the same simplified graph:
- removing low covered edges and nodes (less that min_coverage)
- then phasing simple non branching paths
* `-o file_name`: write obtained graph. Default: standard output.
Example: -o "res_kissreadsGraph.json"
* `-k size_seed`: will use seeds of length size_seed.
Example: -k 25
* `-c min_coverage`: Will consider an edge as coherent if coverage (number of reads from all sets using this edge) is at least min_coverage.
Example: -c 2
* `-d max_substitutions`: Will map a read on the graph with at most max_substitutions substitutions.
Example: -d 1
***Default:***
* `-k 25`
* `-c 2`
* `-d 1`
* `-o input_file_name.json`
***Mandatory:***
* `-t type` : needed to define type of operation to perform
### Run Kissreads alone
Kissreads checks for each sequence contained into a fasta file if it is read coherent (each position is covered by at least "min_coverage" read(s)) with sets of reads. A sequence from the fasta file is treated as follows:
if a sequence is coherent with at least one read set, the sequence is writting in output file as follows:
>original fasta comment|C1:min<avg-coverage_avg<max|C2:min<avg-coverage_avg<max|C3...
>sequence
>...
Where :
* `min` : value of the position having minimal coverage in the reads file.
* `avg`: average coverage in the reads file.
* `coverage_avg`: R-squarred corrected average in the reads file.
* `max`: value of the position having maximal coverage in the reads file.
The coverage is the number of reads that perfectly mapped a position.
Kissreads can be run alone with the command: `./kissreads <toCheck.fasta> <readsC1.fasta> [<readsC2.fasta> [<readsC3.fasta] ...] <options>`.
***Options:***
* `-k size_seed`: Will use seeds of length size_seed.
Example: 25
* `-c min_coverage`: A sequence is covered by at least min_coverage coherent reads.
Example: 2
* `-d max_substitutions`: Maximal number of substitutions authorized between a read and a fragment. Note that no substitution is allowed on the central position while anaylizing the kissnp output.
Example: 1
* `-o file_name`: write read-coherent outputs.
Example: "res_kissreads_coherent.fasta"
* `-u file_name`: write unread-coherent outputs.
Example: "res_kissreads_uncoherent.fasta"
* `-n`: the input file is a kissnp output (incompatible with -I option).
in this case: 1/ only the upper characters are considered (no mapping done on the extensions) and 2/ the central position (where the SNP occurs) is strictly mapped, no subsitution is authorized on this position.
* `-I`: the input file is an Intl output (incompatible with -n option).
* `-i index_stride`: This is a heuristic for limiting the memory footprint. Instead of indexing each kmer of the sequences contained into the toCheck.fasta, kissreads indexes kmers occurring each "index_stride" position.
Example = 1
* `-t max_nb_threads`: max number of threads (also limited by number of input files).
Example = 1
* `-m align_file`: write a file of reads mapped to sequences in file align_file.
* `-f` outputs coherent events in a standard fasta file format.
* `-s` silent mode.
* `-h` print the help message and exit.
* `--version` get the kissReads version and exit.
***Default :***
* `-k 25`
* `-c 2`
* `-d 1`
* `-i 1` (no heuristic)
* `-t 1`
|