1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139
|
=================================
RABEMA - Read Alignment BEnchMArk
=================================
Rabema is a program that supports a new read mapper benchmark methodology. The
methodology is based on a strict definition of the read mapping problem and
allows the evaluation of arbitrary read mapping programs that create SAM
output.
The paper gives an explanation of the theory behind the benchmark methodology
with an example evaluatio
For installation instruction and online material, see the project website:
https://seqan.de/projects/rabema.html
For the Impatient (Getting Started)
-----------------------------------
After building the Rabema and Razers 3 programs (currently in “extras”, soon
to go to “core”), you can build the gold standard for one of the example
datasets from the rabema-data.tar.bz file (see above). We have to do the
following: (1) Create a perfect mapping SAM file using RazerS 3. (2) Prepare
SAM file (by replacing * for duplicative sequences by the actual sequence. (3)
Obtain a copy of the file sorted by coordinate. Sorting is easiest done using
samtools after conversion to BAM. (4) Build the gold standard using Rabema
(rabema_build_gold_standard). Details on the parameters can be found in the
Rabema manual.
./bin/razers3 --dont-shrink-alignments --verbose \
--recognition-rate 100 \
--percent-identity 92 \
--max-hits 10000000 \
--output out_gold.sam \
rabema-data/data/saccharomyces/genome.fasta \
rabema-data/data/saccharomyces/reads_454/SRR000853.10k.fastq
./bin/rabema_prepare_sam \
-i out_gold.sam -o out_gold.prep.sam
Note that by default rabema_prepare_sam assumes that the input SAM file is
sorted by query name in the same order as produced by samtools sort -n. If
your file is not sorted in this way then rabema_prepare_sam will stop with a
failing sanity check. In this case, call it with the --dont-check-sorting
parameter like this:
./bin/rabema_prepare_sam -i out_gold.sam \
--dont-check-sorting -o out_gold.prep.sam.
samtools view -Sb out_gold.prep.sam >out_gold.bam
samtools sort out_gold.bam out_gold.by_coord
./bin/rabema_build_gold_standard \
--max-error 8 \
--distance-metric edit \
--out-gsi gold_standard.gsi \
--reference rabema-data/data/saccharomyces/genome.fasta \
--in-bam out_gold.by_coord.bam
Next, compare the results of another read mapper against the gold
standard. Here, we use the RazerS with default parameters. First, run RazerS.
./bin/razers3 -vv -of sam -o out_default.sam \
rabema-data/data/saccharomyces/genome.fasta \
rabema-data/data/saccharomyces/reads_454/SRR000853.10k.fastq
Then, compare the result with the gold standard. Note that the error rate used
here is independent of the one built in the standard and only has to be less
than or equal to the error rate used there. Let’s have a look at the output in
the category all.
./bin/rabema_evaluate \
--max-error 8 \
--distance-metric edit \
--benchmark-category all \
--reference rabema-data/data/saccharomyces/genome.fasta \
--in-sam out_default.sam \
--in-gsi gold_standard.gsi
At the end of the output, there will be the following report:
Intervals to find: 10839
Intervals found: 10839
Intervals found [%] 100
Invalid alignments: 312
Additional Hits: 0
Number of reads: 8840
Number of reads with intervals: 7723
Mapped reads: 7723
Mapped reads [% of total]: 87.3643
Mapped reads [% of mappable]: 100
Normalized intervals found: 7723
Normalized intervals found [%]: 100
Found Intervals By Error Rate
ERR #max #found %found norm max norm found norm found [%]
------------------------------------------------------------------------------------
0 2780 2780 100.00 2095.80 2095.80 100.00
1 4221 4221 100.00 3083.12 3083.12 100.00
2 2721 2721 100.00 1849.09 1849.09 100.00
3 1117 1117 100.00 694.99 694.99 100.00
In total, there were 10,839 intervals with up to 3% errors to be found and
RazerS found all of them (the reads are very long so there is fewer room for
ambiguity). The number of invalid alignments is 312. These alignments are
caused by the defaults setting for allowed error rate of RazerS being 8%. The
term “incorrect” is fixed to the given error rate in the benchmark. If we
passed an error rate of 8% to the evaluation then there would be no invalid
alignments. The number of additional hits is 0. This is the number of hits in
the read mapper output with a valid error rate (below 3% in this case) that
are not found in the gold standard. If this number is greater than zero then
an error occurred while building the gold standard or in the evaluation
program. If you get such a number then please contact the Rabema authors.
The total number of reads is 8,840, the number of reads having an alignment
with less than or equal to 3% error is 7,723 which also is the largest number
of “normalized intervals” to be found. Each read contributes at most one point
to the “normalized intervals score” achieved by a read mapper. Each interval
for a read contributes 1/x points where x is the number of alignments for the
read. A total of 7,723 reads could be mapped, which is 100% of all mappable
reads and 87.4% of all reads. The table below gives a further breakdown of
the found intervals and normalized found intervals by error rate of the
alignment. The data is broken down by error rate of the given alignment. For
example, there were 2,780 of 2,780 intervals found with error rate 0, which
amounts to 100% of such reads. There were 2,095.8 normalized such intervals of
which all were found.
A detailed description of how to use Rabema can be found in Rabema Manual. The
programs’ command line interface is documented in Rabema Command Line
Interface and Description of Rabema Reports contains an annotation of the
reports generated by Rabema.
Contact
------
Manuel Holtgrewe <manuel.holtgrewe@fu-berlin.de>
|