File: SIBELIA.md

package info (click to toggle)
sibelia 3.0.7%2Bdfsg-3
links: PTS, VCS
area: main
in suites: bookworm, bullseye, forky, sid, trixie
size: 25,068 kB
sloc: cpp: 12,662; python: 772; sh: 41; makefile: 17
file content (458 lines) | stat: -rw-r--r-- 17,786 bytes
parent folder | download | duplicates (2)
Basic usage
===========
In this manual it is assumed that "Sibelia" is properly installed and the
directory with "Sibelia.py" is in your "PATH" environment variable or input
files are in the same folder with "Sibelia" executable.

Directory "examples/Sibelia" contains two sets of bacterial genomes. The easiest
way to run "Sibelia" is to type:

	Sibelia -s loose <input FASTA file(s)>

For example:

	Sibelia -s loose Helicobacter_pylori.fasta

Important note -- "Sibelia" requires some free space on HDD to run. If you
experience any problems and see error messages that mention temporary files,
try to change directory used for temporary files (see section "Directory for
temporary files").

Above commands will run "Sibelia" on the file "Helicobacter_pylori.fasta" with
the "loose" simplification parameters. There are two another simplification
parameters sets, called "fine" and "far". To run "Sibelia" on "Helicobacter_pylori.fasta"
with "fine" parameters set, type:

	Sibelia -s fine Helicobacter_pylori.fasta

The difference between "loose" and "fine" set is that "loose" usually produces
fewer blocks, but longer. And it may lose some small synteny blocks 
(shorter than 15 000 BP), while "fine" option produces more blocks (but shorter)
and their coverage is worse. Usually "loose" is the best choice, but if you do
not want to lose information about small-scale rearrangements,  use "fine". See
also section "Output description" for detailed depiction of the output format.
The "far" parameters set can be used for analysis of distantly-related genomes.
However, usage of this set requires more computational time and space and may
be not appropriate in case of many genomes.

If you are not satisfied by the results (poor coverage, for example), try to set
simplification parameters manually (see section "Fine tuning"). 

By default, Sibelia filters out synteny blocks shorter than 5 000 BP. You can
change this behaviour, see section "Minimum block size".

You also may be interested in blocks that occur exactly once in each input
sequence, such blocks are used as input for MGRA algorithm, for example. To get
such output use option "-a":

	Sibelia -s loose -a Helicobacter_pylori.fasta

Synteny blocks are visualized with an interactive diagram (see section "d3"
visualization"). The blocks are also can be visualized with "Circos" (see 
section "Circos" visualization"). While "Circos" is better for publications,
"d3" diagram is more suitable for analysis. Please note that sequences that do
not contain any synteny blocks instances are not shown on these diagrams.

Genomes from the "examples/Sibelia" dir were taken from [5, 6]. Note that you
can specify multiple FASTA files, just separate them with spaces.

Technical parameters
====================

Directory for output files
--------------------------
Default directory = "." You can change this by setting cmd parameter:

	-o <dir name> or --outdir <dir name>

By default, "Sibelia" places output files in current working directory. Setting
this parameter will change output directory.

Directory for temporary files
-----------------------------
Default directory is output directory. You can change this by setting cmd parameter:

	-t <dir name> or --tempdir <dir name>

"Sibelia" creates some temporary files while running. By default these files
are placed in the output directory. If you want to place temporary files in
another folder due to some reasons, use this parameter. Although the files
exist only for a very short period of time, they can be quite big -- ~20*N
bytes, where N is the total size of all input genomes. You can also use switch

	-r or --inram
	
that will force "Sibelia" to not create any temporary files and store all it's
data in RAM.

Output description
==================
By default, "Sibelia" produces following files: 

1. Blocks coordinates
2. Genomes represented as permutations of the synteny blocks
3. Coverage report
4. Files for generating a "Circos" picture
5. Interactive html-diagram of synteny blocks

There are also optional output files:

1. Sequences file
2. Dot file with resulting de Bruijn graph

All these files are described below in details.

Blocks coordinates
------------------
File name = "blocks_coords.txt". First part of this file lists input sequences,
their IDs, sizes and descriptions. IDs are just index numbers of sequences (in
the same order as they apper in the input files).

Second part of the file describes synteny blocks in sections separated by 
dashes. Each section starts with the synteny block ID. Then all instances
of this block are listed in tabular format. Each row of a table depicts
one instance of this synteny block. Columns of the table designate following:

1. Seq_id -- ID of the sequence, that the instance belongs to
2. Strand -- strand of the synteny block instance, either '+' or '-'. Input
sequences are treated as positive strands
3. Start -- one-based index of the first base pair of the instance
4. End -- one-based index of the last base pair of the instance
5. Length -- length of the instance of the synteny block

Note that all coordinates are given relatively to POSITIVE strand of the
sequence. If an instance of a synteny block is located on the positive strand,
then start < end, otherwise start > end. 

If you wish, you can obtain coordinates in GFF format. If you use the flag:

	-gff

Then coordinates of synteny blocks are listed in file "blocks_coords.gff". For
description of this format, see:

	https://cgwb.nci.nih.gov/FAQ/FAQformat.html#format3

Each record represents different copies of a synteny block. Copies having the
same number in the "tag" field (last column) are instances of the same synteny
block.

Genomes permutations
--------------------
File name = "genomes_permutations.txt".

This file contains input sequences represented as permutations of the synteny
block. It has two lines for each input sequence:

1. Header line -- FASTA header of the sequence (starting with '>')
2. Genome line -- sequence of synteny blocks instances as they appear on the 
positive strand of the sequence. Each instance is represented as a signed
integer, where '+' sign depicts direct version of the block, and '-' depicts
reversed block

Coverage report
---------------
File name = "coverage_report.txt".

The file describes portion of the genomes, that found synteny block cover.
First part of the file describes input sequencess (see "Blocks coordinates"
section). Second part of the file is a table with the following columns:

1. Degree -- multiplicity of a synteny block. For example, if a synteny block
has degree = 3, then the are three instances of this block in the input
sequence
2. Count -- number of synteny blocks with a given degree
3. Total -- portion of all the input sequences that are covered by the blocks
with a given degree
4. Seq <n> -- portion of the sequence with id <n> that is covered by blocks
with a given degree

Here is an example of such table from a report file.

| Degree | Count | Total   | Seq 1   | Seq 2   | Seq 3   | Seq 4   |
| :----- | :---: | :-----: | :-----: | :-----: | :-----: | :-----: |
| 2	 | 11	 | 3.82%   | 6.59%   | 2.41%   | 2.96%   | 3.30%   |
| 3	 | 4	 | 1.68%   | 2.24%   | 2.19%   | 1.44%   | 0.85%   |
| 4	 | 21	 | 91.93%  | 91.34%  | 94.71%  | 87.54%  | 94.53%  |
| All	 | 36	 | 95.66%  | 97.44%  | 97.98%  | 90.67%  | 96.89%  |

This table contains one row for each degree (2, 3, 4) and one ("All") row for
the overall coverage. It means that there are 11 blocks with degree = 2, i.e.
11 * 2 instances, and they cover 3.82% of all four genomes, 6.59% of Seq 1 and
2.41% of Seq 2. And also there are 21 synteny blocks with degree = 4, i.e.
4 * 21 instances and they cover 91.93% of all genomes. All the blocks cover
95.66% of all the input sequences.

Sequences file
--------------
File name = "blocks_sequences.fasta". By default this file is not written. To
output this file, set cmd parameter:

	-q or --sequencesfile

This FASTA file contains sequences of instances of the synteny block. Each
sequence has header in following format:

	>Seq="<header>",Strand=<sign>,Block_id=<block id>,Start=<x>,End=<y>

Where "<header>" is a header of the FASTA sequence where the block instance
is located. Other fields are described in section "Coordinates file".
Sequences of synteny blocks are also written in SAM format in the file
"blocks_sequences.sam".

"d3" visualization
------------------
File name = "d3_blocks_diagram.html".

"Sibelia" generates an interactive html diagram that shows found synteny blocks.
Coordinates follow the same convention as described in section "Coordinates 
file".

"Circos" visualization
----------------------
You can visualize synteny blocks with a colorful circular diagram by using
the "Circos" software [3]. Files for generating such diagram are written in
directory "circos" inside the output directory. To generate "Circos" diagram
do following:

1. Download and install "Circos" software
2. Run "Sibelia"
3. Run "Circos" in "circos" directory

For example of such diagrams (generated from "Helicobacter_pylori.fasta),
see "examples/Sibelia/Helicobacter_pylori/circos/circos.png". Also note that
such diagrams can become very piled with larger number of genomes. To overcome
this, plot only big blocks, see section "Minimum block size". Blocks located
on the positive strand are colored green, while blocks from negative strand
are red.

By default, "Sibelia" plots only blocks obtained after the last stage.
You can also view blocks at the intermediate stages by using switch:

	-v or --visualize

On the resulting diagram the outermost circle shows blocks obtained at the
first stage, then the second stage and so on. Please note that this option
slows down the computation.

Resulting de Bruijn graph
-------------------------
File name = "de_bruijn_graph.dot". By default this file is not written. To
output this file, set cmd parameter:

	-g or --graphfile
	
If you are a curious person, you can also view condensed de Bruijn graph that
is used for generating synteny blocks. To understand the graph, see [1].
Condensed means that only bifurcations in the graph are plotted and 
non-branching paths are collapsed into a single edge. Blue edges are generated
from positive strand and red edges are from negative strand respectively. Note
that this graph is generated for K = min(Kn, MinimumBlockSize) or for value of
"--lastk" cmd parameter if it is set, where Kn is the value of K used for the
last stage (see section "Fine tuning").

If one is interested in graph output only, he or she can use the "--noblocks"
option:

	--noblocks

In this case Sibelia doesn't compute the synteny blocks and doesn't ouput them,
but can output the graph. For example, to get only non-modified compressed de
Bruijn grahp for k=25, one can use the following command line:

	Sibelia -k run.stage --noblocks -g -m 25 <input_file>

Where "run.stage" contains single number "0".

Fine tuning
===========
Here we will describe parameters that can affect computation results.

Minimum block size
------------------
Default value = 5000. To change this value, set cmd parameter:

	-m <integer> or --minblocksize <integer>

If you are interested only in big synteny blocks, like > 100 000 BP, set
this parameter to an appropriate value.

Outputting only shared blocks
-----------------------------
Default = not set. Add flag to cmd parameters to set:

	-a or --sharedonly

Output only blocks that occur exactly once in each input sequence. This option
assumes that all input genomes contain single chromosome.

Postprocessing
--------------
By default, synteny blocks are postprocessed after computation by gluing
"stripes" consisting of the same synteny blocks. For example, if each 
occurence of synteny block 1 is followed by synteny block 2 and vice a versa,
their directions are consistent, then they are "glued" together to form a
single synteny block. Postprocessing could be turned off by specifying flag:

	--nopostprocess

Output blocks from all stages
-----------------------------
"Sibelia" performs computations in multiple stages. Every stage produces it's
own synteny blocks. You can get blocks from all stages by specifying flag:

	--allstages

Files "blocks_coordsN.(txt|gff)" will contain their coordinates, where N is the
number of stage. Zero corresponds to blocks obtained without any simplification.

Boundaries correction
---------------------
Algorithm of "Sibelia" depends on presence of solid k-mers within syntenic
regions. If such regions contain variations close to their borders, they will
be truncated. In case of two genomes, some synteny blocks (of multiplicity two)
could be corrected using local alinment algorithms. Use flag:

	--correctboundaries

This flag is used by "C-Sibelia".

Parameters set
--------------
Default value = not set. To select the parameters set, use cmd parameter:

	-s <loose|fine|far> or --parameters <loose|fine>

This option is incompatible with "-k", you must specify one of these, not both.
Approach used in "Sibelia" is parameter dependent. To understand the details,
please see the next section and [1]. The "loose" option produces longer blocks
and better coverage, while "fine" can capture small-scale rearrangements, for
example, inversions of size < 15 000 BP. The "far" set is for distant genomes.

Custom parameters set
---------------------
Default value = not set. To specify the file that contains custom parameters
set, use cmd parameter:

	-k <file name> or --stagefile <file name>

The algorithm consists of several stages of computations. Each stage has two 
parameters, K and D. Let's call K-mer a substring of length K. At each stage
"Sibelia" constructs so called de Bruijn graph, graph of K-mers that occur
in the genome, and simplifies it by removing special type of undirected cycles
called "bulges", see [1] for more details.

Graph is a good model for describing the algorithm, but to understand "physical
meaning" of the parameters it is useful to consider operations that are 
actually performed with the genome behind the graph model. Suppose that 
somewhere in the genome exist two pairs of K-mers K1 and K2:

1st pair: ... K1 ABCD K2 ...  
2nd pair: ... K1 FGHE K2 ...  

If the distance between K1 and K2 within each pair is less than D, then "Sibelia"
replaces FGHE with ABCD to obtain longer "synteny block":

1st pair: ... K1 ABCD K2 ...  
2nd pair: ... K1 ABCD K2 ...  

More concrete example. Suppose that K = 3, D = 5 and somewhere in the genome we
find:

1st pair: ... act gaga ggc ...  
2nd pair: ... act gatg ggc ...  

As we see, distance between "act" and "ggc" is less than 5 nucleotides so we
replace "gatg" by "gaga":

1st pair: ... act gaga ggc ...  
2nd pair: ... act gaga ggc ...  

"Sibelia" keeps track of all changes so it is able to locate original locations
of the synteny blocks obtained by the simplification. This process continues 
step by step, we start with small values of K to obtain longer K-mers shared
between synteny regions and then increase K and D. The "loose" parameters set
has 4 stages:

| K        | D         |
| :------- | --------: |
| 30       | 150       |
| 100      | 1000      |
| 1000     | 5000      |
| 5000     | 15000     |

The "fine" set consists of 3 stages and it's final values are less:

| K        | D         |
| :------- | --------: |
| 30       | 150       |
| 100      | 500       |
| 500      | 1500      |

The "far" set is for distant genomes:

| K        | D         |
| :------- | --------: |
| 15       | 120       |
| 100      | 500       |
| 500      | 1500      |

As you can see, "loose" set is more aggressive -- at it's final stage it glues
together 5000-mers that are separated from each other by at most 15000 symbols.
Although this description is very simplified and lacks many important technical
details, it is enough to infer your own parameter set. Stage file that you may
use to specify your own parameters has following simple format:

M  
K1 D1  
K2 D2  
...  
KM KM  

Where M is the number of stages. So, running with the stage file:

4  
30 150  
100 1000  
1000 5000  
5000 15000  

Is equivalent to running with the -s "loose" cmd option. As you may notice, the
algorithm relies on exact K-mers shared between the genomes. If input genomes
doesn't have such shared substrings, then "Sibelia" won't be able to locate the
synteny blocks.

If you cannot find synteny blocks with the default parameters, try to start
with smaller values of K (~20), increase D values or vary number of stages.

Last value of K
---------------
In "Sibelia" de Bruijn graph is constructed (N + 1) times, where N is the
number of simplification stages. De Bruijn graph which is constructed last is
used for inferring synteny blocks. Value of K for this grahp is determined by
min(Kn, MinimumBlockSize)  where Kn is the value of K used for the last stage.
You can override this value by setting the "--lastk" parameter:

   --lastk <integer > 1>


Maximum number of iterations
----------------------------
Default value = 4. Tho change this value, set cmd parameter:

	-i <integer> or --maxiterations <integer>

Maximum number of iterations during a stage of simplification. Increasing
this parameter may slightly improve coverage.

References
==========
1. Ilya Minkin, Nikolay Vyahhi, Son Pham. "SyntenyFinder: A Synteny Blocks 
Generation and Genome Comparison Tool" (poster), WABI 2012
http://bioinf.spbau.ru/sites/default/files/SyntenyFinder.pdf
2. Max A. Alekseyev and Pavel A. Pevzner. "Breakpoint graphs and ancestral
genome reconstructions", Genome Res. 2009. 19: 943-957.
3. Circos. http://circos.ca
4. D3. http://d3js.org/
5. Helicobacter pylori. http://www.ncbi.nlm.nih.gov/genome/169
6. Staphylococcus aureus. http://www.ncbi.nlm.nih.gov/genome/154