1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399
|
Gustaf - Generic mUlti-SpliT Alignment Finder
https://www.seqan.de/apps/gustaf.html
Version 1.0
August, 2014
---------------------------------------------------------------------------
Table of Contents
---------------------------------------------------------------------------
1. Overview
2. Installation
3. Usage
4. Output Format
5. Example
6. Gustaf paired-end joining app gustaf_mate_joining
7. Contact
---------------------------------------------------------------------------
1. Overview
---------------------------------------------------------------------------
Gustaf is a tool primarily designed for multi-split mapping of sequencing reads.
We present GUSTAF, a sound generic multi-split detection method implemented in
the C++ library SeqAn. GUSTAF uses SeqAn's exact local aligner Stellar to find
partial read alignments. Compatible partial alignments are identified, and a
split-read graph storing all compatibility information is constructed for each
read. Vertices in the graph represent partial alignments, edges represent
possible split positions. Using an exact dynamic programming approach, we
refine the alignments around possible split positions to determine precise
breakpoint locations at single-nucleotide level. We use a DAG shortest path
algorithm to determine the best combination of refined alignments, and report
those breakpoints supported by multiple reads.
Usage note: STELLAR is not a read mapper, and hence, GUSTAF is not designed to
replace any read mapper pipeline with SV detection on top. We recommend doing
read mapping with your favourit read mapper and then first run STELLAR and then
GUSTAF on the remaining unmappable reads.
---------------------------------------------------------------------------
2. Installation
---------------------------------------------------------------------------
Precompiled binaries (Linux 64-bit, Windows, Mac OS X) of Gustaf can be
downloaded from the SeqAn projects download page:
https://www.seqan.de/downloads/projects.html
Gustaf is distributed with SeqAn - The C++ Sequence Analysis Library (see
https://www.seqan.de). To build Gustaf yourself, you can check out the latest
Git version of Gustaf and SeqAn with:
1) git clone https://github.com/seqan/seqan.git
2) mkdir seqan/buld; cd seqan/build
3) cmake .. -DCMAKE_BUILD_TYPE=Release
4) make gustaf
5) ./bin/gustaf --help
If succesful, an executable file gustaf was built and a brief usage
description is dumped.
---------------------------------------------------------------------------
3. Usage
---------------------------------------------------------------------------
STELLAR is not a read mapper, and hence, GUSTAF is not designed to
replace any read mapper pipeline with SV detection on top. We recommend doing
read mapping with your favourit read mapper and then first run STELLAR and then
GUSTAF on the remaining unmappable reads.
To get a short usage description of Gustaf, you can execute gustaf -h or
gustaf --help.
Usage: gustaf <GENOME FASTA FILE> <READ FASTA FILE> [Options]
gustaf <GENOME FASTA FILE> <READ FASTA FILE> -m <GFF MATCH FILE> [Options]
gustaf <GENOME FASTA FILE> <MATES1 FASTA FILE> <MATES2 FASTA FILE> -m <GFF MATCH FILE> [Options]
Using the first line, Gustaf will first run Stellar internally on the
given input files. Please have a look at the Stellar options and default
values for match length, error rate, etc.
Using the --matchfile (-m) option, Gustaf skips running Stellar and
directly uses the matches, preferable precalculated using Stellar, from
the given GFF file. We recommend using this option for larger data sets.
Given two read pair input files, Gustaf is run in paired-end mode.
Note that option -m is mandatory in this case.
Remarks: There are different versions paired-end (or mate pair) data can
come in. Gustaf always expect two files, and in default expects the second
mates (MATES2) to be reverse complemented in relation to the first mates
(MATES1), see option -rc to turn this behaviour off.
---------------------------------------------------------------------------
3.1. Non-optional arguments
---------------------------------------------------------------------------
Gustaf always expects both a database and a query file in Fasta format.
All query sequences will be compared to all database sequences.
Important: Following conventions, all Ids (line starting
with '>') have to be unique already within the first part until the
first whitespace! For example:
>Gustaf|group=6|reads=9 readId=1
>Gustaf|group=6|reads=9 readId=2
are not unique.
Without any additional parameters, Gustaf would call Stellar and then chain
those matches of each read, that have either an overlap of at least 0.5 (50%
of each match length) or a distance between the matches of at most 10 bp,
and that miss at most 15 bp at the end or beginning of the read.
The program calls Stellar with default options: Stellar would
compare the query sequence(s) to both strands of the database sequence(s)
with an error rate of 0.05 (i.e. 5% errors, an identity of 95%) and a
minimal length of 100, and dump all local alignments in an output file named
"stellar.gff" (see Stellar Readme).
Two output files will be generated: breakpoint.gff includes all structural
variant breakpoints in GFF format, breakpoint.vcf in VCF format.
The default behaviour can be modified by adding the following options to
the command line.
---------------------------------------------------------------------------
3.2. Main Options
---------------------------------------------------------------------------
---------------------------------------------------------------------------
3.2.1. Gustaf-Specific Main Options
---------------------------------------------------------------------------
[ -m FILE ], [ --matchfile FILE ]
Set the name of a file containing matches in GFF format.
When setting option --matchfile (-m), Gustaf uses the matches from the
given input file instead of calling Stellar (practical for large datasets
or running multiple tests on the same data). See also sample file stellar.gff.
[ -oth REAL ], [ --overlapThresh REAL ]
Required overlap for matches of one read to be considered for chaining
(main criterion, default 0.5).
[ -gth NUM ], [ --gapThresh NUM ]
Maximal allowed distance between two matches to be considered for
chaining. This criterion only applies if the matches do not overlap
(default 10).
[ -ith NUM], [ --initGapThresh NUM ]
Maximal allowed length of leading or ending gap for the whole read(!)
(default 15 for each).
[ -st NUM], [ --support NUM ]
Required number of supporting reads (or contigs) for a breakpoint to be
written to the output file (see breakpoint output, default 2).
[ -tp NUM], [ --transPen NUM ]
Interchromosomal translocation penalty (default 5). Added to edge weight
between matches that map to different database sequences.
[ -ip NUM], [ --invPen NUM ]
Inversion penalty (default 5). Added to edge weight between matches that
map to different strands of the same database sequence. Note: Penalty is
only added if there is no transPen already on the edge.
[ -op NUM], [ --orderPen NUM ]
Order penalty (default 5). Added to edge weight between matches that
map to a database sequence in a different order than they are in the read
sequence. Note: Penalty is only added if there is no transPen or invPen
already on the edge.
[ -ll NUM], [ --libLength NUM ]
Insert size of mate pairs when running paired-end mode.
[ -le NUM], [ --libError NUM ]
Maximal deviation in bp from the insert size.
---------------------------------------------------------------------------
3.2.2. Stellar-Inherited Options
---------------------------------------------------------------------------
This is only a tiny snapshot showing those Stellars options most crucial
to get started using Gustaf. For more Stellar options, please see Stellar
documentation and README.
[ -e NUM ], [ --epsilon NUM ]
Set the maximal error rate for local alignments. NUM must be a floating
point number between 0 and 0.25. The default value is 0.05. NUM is the
number of edit operations needed to transform the aligned query substring
to the aligned database substring divided by the length of the local
alignment. For example specify '-e 0.1' for a maximal error rate of 10%
(90% identity).
[ -l NUM ], [ --minLength NUM ]
Set the minimal length of a local alignment. The default value is 100.
NUM is the minimal length of an alignment, not the length of the
aligned substrings.
[ -f ], [ --forward ]
Only compare query sequence(s) to the positive/forward strand of the
database sequence(s). By default, both strands are scanned.
[ -r ], [ --reverse ]
Only compare query sequence(s) to the negative/reverse-complemented
database sequence(s). By default, both strands are scanned.
---------------------------------------------------------------------------
3.3 Input and Output Options
---------------------------------------------------------------------------
[ -nth ], [ --numThreads ]
Specify number of threads for parallelization of I/O.
[ -gff ], [ --gffOut ]
Option to set name of GFF output file.
[ -vcf ], [ --vcfOut ]
Option to set name of VCF output file.
[ -j ], [ --jobName ]
Optional jobname that will be added to all dot output files in the format
read#_jobname.dot. # stands for the sequence number of the read in the
query input file.
[ -do ], [ --dots ]
Switches output of DOT files on and off (default off). Each DOT file
includes the graph representation of one read/query.
---------------------------------------------------------------------------
4. Output Formats
---------------------------------------------------------------------------
Gustaf currently supports the GFF and the VCF output format for reporting
breakpoints.
---------------------------------------------------------------------------
4.1. General Feature Format (GFF)
---------------------------------------------------------------------------
The General Feature Format is specified by the Sanger Institute as a tab-
delimited text format with the following columns:
<seqname> <src> <feat> <start> <end> <score> <strand> <frame> [attr] [cmts]
See also: https://www.sanger.ac.uk/Software/formats/GFF/GFF_Spec.shtml
Consistent with this specification Gustaf GFF output looks as follows:
GNAME Gustaf SVTYPE GBEGIN GEND . GSTRAND . ATTRIBUTES
StartSeqId Label SV type sPos sPos . +/- .
Match value description:
GNAME Name of the genome sequence (see --genome-naming)
GUSTAF Constant
SVTYPE SV type (deletion, insertion, inversion, translocation)
GBEGIN Beginning position in the genome sequence
(positions are counted from 1)
GEND End position in the genome sequence (included!)
. Constant
GSTRAND '+'=forward strand or '-'=reverse strand
. Constant
ATTRIBUTES A list of attributes in the format <tag_name>[=<tag>]
separated by ';'
Attributes are:
ID= Random number as identifier
size= Indel size (deletion and insertion only), "-" before the
size indicates an insertion
seq= Sequence content of insertion (obviously insertion only).
endChr= Name of the second genome sequence (inversion and
interchromosomal translocation only)
endPos= First position in second genome sequence (inversion and
interchromosomal translocation only)
endStrand= '+'=forward strand or '-'=reverse strand of second genome
sequence (inversion and interchromosomal translocation only)
support= Number of supporting reads for this breakpoint
supportIds= IDs of supporting reads
breakpoint: Breakpoint position within read
For matches on the reverse strand, GBEGIN and GEND are positions on the
related forward strand. It holds GBEGIN < GEND, regardless of GSTRAND.
---------------------------------------------------------------------------
4.2. Variant Call Format (VCF)
---------------------------------------------------------------------------
See https://www.1000genomes.org/wiki/Analysis/Variant%20Call%20Format/vcf-
variant-call-format-version-41
for information about the VCF file format specifications.
---------------------------------------------------------------------------
5. Example
---------------------------------------------------------------------------
See the 'example' folder for example runs of Gustaf. There is a genome file
'adeno.fa' and a modified genome file 'adeno_modified.fa', from the latter
we created the read file 'adeno_modified_reads.fa' for single-end mode and
the files 'adeno_modified_reads_mates1' and 'adeno_modified_reads_mates2'
for paired-end mode.
---------------------------------------------------------------------------
5.1. Single-end Mode
---------------------------------------------------------------------------
There is a pre-calculated Stellar output file 'stellar.gff' for the
single-end read file computed by calling
./stellar adeno.fa adeno_modified_reads.fa -l 30 -o stellar.gff
The default calls for Gustaf would then be
./gustaf adeno.fa adeno_modified_reads.fa -st 1 -l 30 -gff st1_l30.gff \
-vcf st1_l30.vcf
or
./gustaf adeno.fa adeno_modified_reads.fa -st 1 -m stellar.gff \
-gff st1_l30_m.gff -vcf st1_l30.vcf
Both calls produce an output file containing the same breakpoints.
In the first run, Gustaf internally calls Stellar with parameter -l 30.
In the second run, Gustaf used the pre-calculated file with Stellar matches
(use this option for larger datasets where you want to run Stellar separately
and reuse the Stellar output for multiple Gustaf runs).
---------------------------------------------------------------------------
5.2. Paired-end Mode
---------------------------------------------------------------------------
In paired-end mode, we join both read pair files before calling Stellar.
This can be done using the app 'gustaf_mate_joining' by calling
./gustaf_mate_joining adeno_modified_reads_mates1.fa adeno_modified_reads_mates2.fa \
-rc -o adeno_modified_reads_joinedMates.fa
There is a pre-calculated Stellar output file 'stellar_joinedMates.gff' for
the paired-end read file 'adeno_modified_reads_joinedMates.fa computed by calling
./stellar adeno.fa adeno.fa adeno_modified_reads_joinedMates.fa -l 30 \
-o stellar_joinedMates_l30.gff
The Gustaf call would then look like this
./gustaf adeno.fa adeno_modified_reads_mates1.fa adeno_modified_reads_mates2.fa \
-m stellar_joinedMates_l30.gff -st 1 -ll 1000 -le 30 -rc \
-gff gustaf_adeno_pairedend_ll1000le30.gff \
-vcf gustaf_adeno_pairedend_ll1000le30.vcf
Note the -rc parameter. In this simulated data, mate1 and mate2 have the
same orientation, so we prevent the second file from beeing reverse
complemented.
---------------------------------------------------------------------------
6. Joining Paired-end Reads With Gustaf_mate_joining
---------------------------------------------------------------------------
The Gustaf directory includes another app called gustaf_mate_joining that can
be build using the make gustaf_mate_joining command. Gustaf_mate_joining is a
small app that helps prepare paired-end data for usage with Gustaf.
This simple program takes as input two mate pair or paired-end files and outputs
a file where both mate sequences have been joined together. The FASTA file with
joined mates is an required input file for the paired-end mode of Gustaf.
The tool assumes the mates in the second file to be reverse complemented
compared to the first file. This behaviour can be turned off using the command
line argument "-rc".
Given only one input file and two output files, the program will split the reads
from the input files at half length, and write the first half of each sequence
as mates1 into the first output file and the reversed complemented second half
of each sequence as mates2 into the second output file. Reverse complementing
the sequences can again be turned off using "-rc".
To prepare the joined mate file for the paired-end example above, call
./gustaf_mate_joining adeno_modified_reads_mates1.fa \
adeno_modified_reads_mates2.fa -rc -o adeno_modified_reads_joinedMates.fa
The mates in this small example are both from the same strand, so we avoid reverse
complementing the second input file by using -rc.
---------------------------------------------------------------------------
7. Contact
---------------------------------------------------------------------------
For questions or comments, contact:
Kathrin Trappe <kathrin.trappe@fu-berlin.de>
|