1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458
|
Basic usage
===========
In this manual it is assumed that "Sibelia" is properly installed and the
directory with "Sibelia.py" is in your "PATH" environment variable or input
files are in the same folder with "Sibelia" executable.
Directory "examples/Sibelia" contains two sets of bacterial genomes. The easiest
way to run "Sibelia" is to type:
Sibelia -s loose <input FASTA file(s)>
For example:
Sibelia -s loose Helicobacter_pylori.fasta
Important note -- "Sibelia" requires some free space on HDD to run. If you
experience any problems and see error messages that mention temporary files,
try to change directory used for temporary files (see section "Directory for
temporary files").
Above commands will run "Sibelia" on the file "Helicobacter_pylori.fasta" with
the "loose" simplification parameters. There are two another simplification
parameters sets, called "fine" and "far". To run "Sibelia" on "Helicobacter_pylori.fasta"
with "fine" parameters set, type:
Sibelia -s fine Helicobacter_pylori.fasta
The difference between "loose" and "fine" set is that "loose" usually produces
fewer blocks, but longer. And it may lose some small synteny blocks
(shorter than 15 000 BP), while "fine" option produces more blocks (but shorter)
and their coverage is worse. Usually "loose" is the best choice, but if you do
not want to lose information about small-scale rearrangements, use "fine". See
also section "Output description" for detailed depiction of the output format.
The "far" parameters set can be used for analysis of distantly-related genomes.
However, usage of this set requires more computational time and space and may
be not appropriate in case of many genomes.
If you are not satisfied by the results (poor coverage, for example), try to set
simplification parameters manually (see section "Fine tuning").
By default, Sibelia filters out synteny blocks shorter than 5 000 BP. You can
change this behaviour, see section "Minimum block size".
You also may be interested in blocks that occur exactly once in each input
sequence, such blocks are used as input for MGRA algorithm, for example. To get
such output use option "-a":
Sibelia -s loose -a Helicobacter_pylori.fasta
Synteny blocks are visualized with an interactive diagram (see section "d3"
visualization"). The blocks are also can be visualized with "Circos" (see
section "Circos" visualization"). While "Circos" is better for publications,
"d3" diagram is more suitable for analysis. Please note that sequences that do
not contain any synteny blocks instances are not shown on these diagrams.
Genomes from the "examples/Sibelia" dir were taken from [5, 6]. Note that you
can specify multiple FASTA files, just separate them with spaces.
Technical parameters
====================
Directory for output files
--------------------------
Default directory = "." You can change this by setting cmd parameter:
-o <dir name> or --outdir <dir name>
By default, "Sibelia" places output files in current working directory. Setting
this parameter will change output directory.
Directory for temporary files
-----------------------------
Default directory is output directory. You can change this by setting cmd parameter:
-t <dir name> or --tempdir <dir name>
"Sibelia" creates some temporary files while running. By default these files
are placed in the output directory. If you want to place temporary files in
another folder due to some reasons, use this parameter. Although the files
exist only for a very short period of time, they can be quite big -- ~20*N
bytes, where N is the total size of all input genomes. You can also use switch
-r or --inram
that will force "Sibelia" to not create any temporary files and store all it's
data in RAM.
Output description
==================
By default, "Sibelia" produces following files:
1. Blocks coordinates
2. Genomes represented as permutations of the synteny blocks
3. Coverage report
4. Files for generating a "Circos" picture
5. Interactive html-diagram of synteny blocks
There are also optional output files:
1. Sequences file
2. Dot file with resulting de Bruijn graph
All these files are described below in details.
Blocks coordinates
------------------
File name = "blocks_coords.txt". First part of this file lists input sequences,
their IDs, sizes and descriptions. IDs are just index numbers of sequences (in
the same order as they apper in the input files).
Second part of the file describes synteny blocks in sections separated by
dashes. Each section starts with the synteny block ID. Then all instances
of this block are listed in tabular format. Each row of a table depicts
one instance of this synteny block. Columns of the table designate following:
1. Seq_id -- ID of the sequence, that the instance belongs to
2. Strand -- strand of the synteny block instance, either '+' or '-'. Input
sequences are treated as positive strands
3. Start -- one-based index of the first base pair of the instance
4. End -- one-based index of the last base pair of the instance
5. Length -- length of the instance of the synteny block
Note that all coordinates are given relatively to POSITIVE strand of the
sequence. If an instance of a synteny block is located on the positive strand,
then start < end, otherwise start > end.
If you wish, you can obtain coordinates in GFF format. If you use the flag:
-gff
Then coordinates of synteny blocks are listed in file "blocks_coords.gff". For
description of this format, see:
https://cgwb.nci.nih.gov/FAQ/FAQformat.html#format3
Each record represents different copies of a synteny block. Copies having the
same number in the "tag" field (last column) are instances of the same synteny
block.
Genomes permutations
--------------------
File name = "genomes_permutations.txt".
This file contains input sequences represented as permutations of the synteny
block. It has two lines for each input sequence:
1. Header line -- FASTA header of the sequence (starting with '>')
2. Genome line -- sequence of synteny blocks instances as they appear on the
positive strand of the sequence. Each instance is represented as a signed
integer, where '+' sign depicts direct version of the block, and '-' depicts
reversed block
Coverage report
---------------
File name = "coverage_report.txt".
The file describes portion of the genomes, that found synteny block cover.
First part of the file describes input sequencess (see "Blocks coordinates"
section). Second part of the file is a table with the following columns:
1. Degree -- multiplicity of a synteny block. For example, if a synteny block
has degree = 3, then the are three instances of this block in the input
sequence
2. Count -- number of synteny blocks with a given degree
3. Total -- portion of all the input sequences that are covered by the blocks
with a given degree
4. Seq <n> -- portion of the sequence with id <n> that is covered by blocks
with a given degree
Here is an example of such table from a report file.
| Degree | Count | Total | Seq 1 | Seq 2 | Seq 3 | Seq 4 |
| :----- | :---: | :-----: | :-----: | :-----: | :-----: | :-----: |
| 2 | 11 | 3.82% | 6.59% | 2.41% | 2.96% | 3.30% |
| 3 | 4 | 1.68% | 2.24% | 2.19% | 1.44% | 0.85% |
| 4 | 21 | 91.93% | 91.34% | 94.71% | 87.54% | 94.53% |
| All | 36 | 95.66% | 97.44% | 97.98% | 90.67% | 96.89% |
This table contains one row for each degree (2, 3, 4) and one ("All") row for
the overall coverage. It means that there are 11 blocks with degree = 2, i.e.
11 * 2 instances, and they cover 3.82% of all four genomes, 6.59% of Seq 1 and
2.41% of Seq 2. And also there are 21 synteny blocks with degree = 4, i.e.
4 * 21 instances and they cover 91.93% of all genomes. All the blocks cover
95.66% of all the input sequences.
Sequences file
--------------
File name = "blocks_sequences.fasta". By default this file is not written. To
output this file, set cmd parameter:
-q or --sequencesfile
This FASTA file contains sequences of instances of the synteny block. Each
sequence has header in following format:
>Seq="<header>",Strand=<sign>,Block_id=<block id>,Start=<x>,End=<y>
Where "<header>" is a header of the FASTA sequence where the block instance
is located. Other fields are described in section "Coordinates file".
Sequences of synteny blocks are also written in SAM format in the file
"blocks_sequences.sam".
"d3" visualization
------------------
File name = "d3_blocks_diagram.html".
"Sibelia" generates an interactive html diagram that shows found synteny blocks.
Coordinates follow the same convention as described in section "Coordinates
file".
"Circos" visualization
----------------------
You can visualize synteny blocks with a colorful circular diagram by using
the "Circos" software [3]. Files for generating such diagram are written in
directory "circos" inside the output directory. To generate "Circos" diagram
do following:
1. Download and install "Circos" software
2. Run "Sibelia"
3. Run "Circos" in "circos" directory
For example of such diagrams (generated from "Helicobacter_pylori.fasta),
see "examples/Sibelia/Helicobacter_pylori/circos/circos.png". Also note that
such diagrams can become very piled with larger number of genomes. To overcome
this, plot only big blocks, see section "Minimum block size". Blocks located
on the positive strand are colored green, while blocks from negative strand
are red.
By default, "Sibelia" plots only blocks obtained after the last stage.
You can also view blocks at the intermediate stages by using switch:
-v or --visualize
On the resulting diagram the outermost circle shows blocks obtained at the
first stage, then the second stage and so on. Please note that this option
slows down the computation.
Resulting de Bruijn graph
-------------------------
File name = "de_bruijn_graph.dot". By default this file is not written. To
output this file, set cmd parameter:
-g or --graphfile
If you are a curious person, you can also view condensed de Bruijn graph that
is used for generating synteny blocks. To understand the graph, see [1].
Condensed means that only bifurcations in the graph are plotted and
non-branching paths are collapsed into a single edge. Blue edges are generated
from positive strand and red edges are from negative strand respectively. Note
that this graph is generated for K = min(Kn, MinimumBlockSize) or for value of
"--lastk" cmd parameter if it is set, where Kn is the value of K used for the
last stage (see section "Fine tuning").
If one is interested in graph output only, he or she can use the "--noblocks"
option:
--noblocks
In this case Sibelia doesn't compute the synteny blocks and doesn't ouput them,
but can output the graph. For example, to get only non-modified compressed de
Bruijn grahp for k=25, one can use the following command line:
Sibelia -k run.stage --noblocks -g -m 25 <input_file>
Where "run.stage" contains single number "0".
Fine tuning
===========
Here we will describe parameters that can affect computation results.
Minimum block size
------------------
Default value = 5000. To change this value, set cmd parameter:
-m <integer> or --minblocksize <integer>
If you are interested only in big synteny blocks, like > 100 000 BP, set
this parameter to an appropriate value.
Outputting only shared blocks
-----------------------------
Default = not set. Add flag to cmd parameters to set:
-a or --sharedonly
Output only blocks that occur exactly once in each input sequence. This option
assumes that all input genomes contain single chromosome.
Postprocessing
--------------
By default, synteny blocks are postprocessed after computation by gluing
"stripes" consisting of the same synteny blocks. For example, if each
occurence of synteny block 1 is followed by synteny block 2 and vice a versa,
their directions are consistent, then they are "glued" together to form a
single synteny block. Postprocessing could be turned off by specifying flag:
--nopostprocess
Output blocks from all stages
-----------------------------
"Sibelia" performs computations in multiple stages. Every stage produces it's
own synteny blocks. You can get blocks from all stages by specifying flag:
--allstages
Files "blocks_coordsN.(txt|gff)" will contain their coordinates, where N is the
number of stage. Zero corresponds to blocks obtained without any simplification.
Boundaries correction
---------------------
Algorithm of "Sibelia" depends on presence of solid k-mers within syntenic
regions. If such regions contain variations close to their borders, they will
be truncated. In case of two genomes, some synteny blocks (of multiplicity two)
could be corrected using local alinment algorithms. Use flag:
--correctboundaries
This flag is used by "C-Sibelia".
Parameters set
--------------
Default value = not set. To select the parameters set, use cmd parameter:
-s <loose|fine|far> or --parameters <loose|fine>
This option is incompatible with "-k", you must specify one of these, not both.
Approach used in "Sibelia" is parameter dependent. To understand the details,
please see the next section and [1]. The "loose" option produces longer blocks
and better coverage, while "fine" can capture small-scale rearrangements, for
example, inversions of size < 15 000 BP. The "far" set is for distant genomes.
Custom parameters set
---------------------
Default value = not set. To specify the file that contains custom parameters
set, use cmd parameter:
-k <file name> or --stagefile <file name>
The algorithm consists of several stages of computations. Each stage has two
parameters, K and D. Let's call K-mer a substring of length K. At each stage
"Sibelia" constructs so called de Bruijn graph, graph of K-mers that occur
in the genome, and simplifies it by removing special type of undirected cycles
called "bulges", see [1] for more details.
Graph is a good model for describing the algorithm, but to understand "physical
meaning" of the parameters it is useful to consider operations that are
actually performed with the genome behind the graph model. Suppose that
somewhere in the genome exist two pairs of K-mers K1 and K2:
1st pair: ... K1 ABCD K2 ...
2nd pair: ... K1 FGHE K2 ...
If the distance between K1 and K2 within each pair is less than D, then "Sibelia"
replaces FGHE with ABCD to obtain longer "synteny block":
1st pair: ... K1 ABCD K2 ...
2nd pair: ... K1 ABCD K2 ...
More concrete example. Suppose that K = 3, D = 5 and somewhere in the genome we
find:
1st pair: ... act gaga ggc ...
2nd pair: ... act gatg ggc ...
As we see, distance between "act" and "ggc" is less than 5 nucleotides so we
replace "gatg" by "gaga":
1st pair: ... act gaga ggc ...
2nd pair: ... act gaga ggc ...
"Sibelia" keeps track of all changes so it is able to locate original locations
of the synteny blocks obtained by the simplification. This process continues
step by step, we start with small values of K to obtain longer K-mers shared
between synteny regions and then increase K and D. The "loose" parameters set
has 4 stages:
| K | D |
| :------- | --------: |
| 30 | 150 |
| 100 | 1000 |
| 1000 | 5000 |
| 5000 | 15000 |
The "fine" set consists of 3 stages and it's final values are less:
| K | D |
| :------- | --------: |
| 30 | 150 |
| 100 | 500 |
| 500 | 1500 |
The "far" set is for distant genomes:
| K | D |
| :------- | --------: |
| 15 | 120 |
| 100 | 500 |
| 500 | 1500 |
As you can see, "loose" set is more aggressive -- at it's final stage it glues
together 5000-mers that are separated from each other by at most 15000 symbols.
Although this description is very simplified and lacks many important technical
details, it is enough to infer your own parameter set. Stage file that you may
use to specify your own parameters has following simple format:
M
K1 D1
K2 D2
...
KM KM
Where M is the number of stages. So, running with the stage file:
4
30 150
100 1000
1000 5000
5000 15000
Is equivalent to running with the -s "loose" cmd option. As you may notice, the
algorithm relies on exact K-mers shared between the genomes. If input genomes
doesn't have such shared substrings, then "Sibelia" won't be able to locate the
synteny blocks.
If you cannot find synteny blocks with the default parameters, try to start
with smaller values of K (~20), increase D values or vary number of stages.
Last value of K
---------------
In "Sibelia" de Bruijn graph is constructed (N + 1) times, where N is the
number of simplification stages. De Bruijn graph which is constructed last is
used for inferring synteny blocks. Value of K for this grahp is determined by
min(Kn, MinimumBlockSize) where Kn is the value of K used for the last stage.
You can override this value by setting the "--lastk" parameter:
--lastk <integer > 1>
Maximum number of iterations
----------------------------
Default value = 4. Tho change this value, set cmd parameter:
-i <integer> or --maxiterations <integer>
Maximum number of iterations during a stage of simplification. Increasing
this parameter may slightly improve coverage.
References
==========
1. Ilya Minkin, Nikolay Vyahhi, Son Pham. "SyntenyFinder: A Synteny Blocks
Generation and Genome Comparison Tool" (poster), WABI 2012
http://bioinf.spbau.ru/sites/default/files/SyntenyFinder.pdf
2. Max A. Alekseyev and Pavel A. Pevzner. "Breakpoint graphs and ancestral
genome reconstructions", Genome Res. 2009. 19: 943-957.
3. Circos. http://circos.ca
4. D3. http://d3js.org/
5. Helicobacter pylori. http://www.ncbi.nlm.nih.gov/genome/169
6. Staphylococcus aureus. http://www.ncbi.nlm.nih.gov/genome/154
|