File: README

package info (click to toggle)
seqan2 2.5.2-1
links: PTS, VCS
area: main
in suites: forky, sid
size: 228,748 kB
sloc: cpp: 257,602; ansic: 91,967; python: 8,326; sh: 1,056; xml: 570; makefile: 229; awk: 51; javascript: 21
file content (139 lines) | stat: -rw-r--r-- 6,245 bytes
parent folder | download | duplicates (2)
=================================
RABEMA - Read Alignment BEnchMArk
=================================

Rabema is a program that supports a new read mapper benchmark methodology. The
methodology is based on a strict definition of the read mapping problem and
allows the evaluation of arbitrary read mapping programs that create SAM
output.

The paper gives an explanation of the theory behind the benchmark methodology
with an example evaluatio

For installation instruction and online material, see the project website:

  https://seqan.de/projects/rabema.html

For the Impatient (Getting Started)
-----------------------------------

After building the Rabema and Razers 3 programs (currently in “extras”, soon
to go to “core”), you can build the gold standard for one of the example
datasets from the rabema-data.tar.bz file (see above).  We have to do the
following: (1) Create a perfect mapping SAM file using RazerS 3. (2) Prepare
SAM file (by replacing * for duplicative sequences by the actual sequence. (3)
Obtain a copy of the file sorted by coordinate. Sorting is easiest done using
samtools after conversion to BAM. (4) Build the gold standard using Rabema
(rabema_build_gold_standard). Details on the parameters can be found in the
Rabema manual.

  ./bin/razers3 --dont-shrink-alignments --verbose \
    --recognition-rate 100 \
    --percent-identity 92 \
    --max-hits 10000000 \
    --output out_gold.sam \
    rabema-data/data/saccharomyces/genome.fasta \
    rabema-data/data/saccharomyces/reads_454/SRR000853.10k.fastq

  ./bin/rabema_prepare_sam \
    -i out_gold.sam -o out_gold.prep.sam

Note that by default rabema_prepare_sam assumes that the input SAM file is
sorted by query name in the same order as produced by samtools sort -n. If
your file is not sorted in this way then rabema_prepare_sam will stop with a
failing sanity check. In this case, call it with the --dont-check-sorting
parameter like this:

  ./bin/rabema_prepare_sam -i out_gold.sam \
    --dont-check-sorting -o out_gold.prep.sam.
  samtools view -Sb out_gold.prep.sam >out_gold.bam
  samtools sort out_gold.bam out_gold.by_coord

  ./bin/rabema_build_gold_standard \
    --max-error 8 \
    --distance-metric edit \
    --out-gsi gold_standard.gsi \
    --reference rabema-data/data/saccharomyces/genome.fasta \
    --in-bam out_gold.by_coord.bam

Next, compare the results of another read mapper against the gold
standard. Here, we use the RazerS with default parameters. First, run RazerS.

  ./bin/razers3 -vv -of sam -o out_default.sam \
    rabema-data/data/saccharomyces/genome.fasta \
    rabema-data/data/saccharomyces/reads_454/SRR000853.10k.fastq

Then, compare the result with the gold standard. Note that the error rate used
here is independent of the one built in the standard and only has to be less
than or equal to the error rate used there. Let’s have a look at the output in
the category all.

  ./bin/rabema_evaluate \
    --max-error 8 \
    --distance-metric edit \
    --benchmark-category all \
    --reference rabema-data/data/saccharomyces/genome.fasta \
    --in-sam out_default.sam \
    --in-gsi gold_standard.gsi

At the end of the output, there will be the following report:

  Intervals to find:              10839
  Intervals found:                10839
  Intervals found [%]             100
  Invalid alignments:             312
  Additional Hits:                0

  Number of reads:                8840
  Number of reads with intervals: 7723

  Mapped reads:                   7723
  Mapped reads [% of total]:      87.3643
  Mapped reads [% of mappable]:   100

  Normalized intervals found:     7723
  Normalized intervals found [%]: 100

  Found Intervals By Error Rate

    ERR       #max      #found      %found    norm max    norm found  norm found [%]
  ------------------------------------------------------------------------------------
      0       2780        2780      100.00     2095.80       2095.80      100.00
      1       4221        4221      100.00     3083.12       3083.12      100.00
      2       2721        2721      100.00     1849.09       1849.09      100.00
      3       1117        1117      100.00      694.99        694.99      100.00

In total, there were 10,839 intervals with up to 3% errors to be found and
RazerS found all of them (the reads are very long so there is fewer room for
ambiguity). The number of invalid alignments is 312. These alignments are
caused by the defaults setting for allowed error rate of RazerS being 8%. The
term “incorrect” is fixed to the given error rate in the benchmark. If we
passed an error rate of 8% to the evaluation then there would be no invalid
alignments. The number of additional hits is 0. This is the number of hits in
the read mapper output with a valid error rate (below 3% in this case) that
are not found in the gold standard. If this number is greater than zero then
an error occurred while building the gold standard or in the evaluation
program. If you get such a number then please contact the Rabema authors.

The total number of reads is 8,840, the number of reads having an alignment
with less than or equal to 3% error is 7,723 which also is the largest number
of “normalized intervals” to be found. Each read contributes at most one point
to the “normalized intervals score” achieved by a read mapper. Each interval
for a read contributes 1/x points where x is the number of alignments for the
read. A total of 7,723 reads could be mapped, which is 100% of all mappable
reads and 87.4% of all reads.  The table below gives a further breakdown of
the found intervals and normalized found intervals by error rate of the
alignment. The data is broken down by error rate of the given alignment. For
example, there were 2,780 of 2,780 intervals found with error rate 0, which
amounts to 100% of such reads. There were 2,095.8 normalized such intervals of
which all were found.

A detailed description of how to use Rabema can be found in Rabema Manual. The
programs’ command line interface is documented in Rabema Command Line
Interface and Description of Rabema Reports contains an annotation of the
reports generated by Rabema.

Contact
------

  Manuel Holtgrewe <manuel.holtgrewe@fu-berlin.de>