File: README

package info (click to toggle)
seqan2 2.5.2-1
links: PTS, VCS
area: main
in suites: forky, sid
size: 228,748 kB
sloc: cpp: 257,602; ansic: 91,967; python: 8,326; sh: 1,056; xml: 570; makefile: 229; awk: 51; javascript: 21
file content (239 lines) | stat: -rw-r--r-- 9,206 bytes
parent folder | download | duplicates (2)
*** MicroRazerS - Rapid Alignment of Small RNA Reads ***
https://www.seqan.de/apps/microRazers.html

---------------------------------------------------------------------------
Table of Contents
---------------------------------------------------------------------------
  1.   Overview
  2.   Installation
  3.   Usage
  4.   Output Format
  5.   Contact

---------------------------------------------------------------------------
1. Overview
---------------------------------------------------------------------------

MicroRazerS is a tool for mapping millions of short reads obtained from small
RNA sequencing onto a reference genome. MicroRazerS searches for the longest
perfect prefix match of each read where the minimum prefix match length (the
seed length) can currently be varied between 14 and 22. Optionally, one
mismatch can be tolerated in the seed. MicroRazerS guarantees to find all
matches and reports a configurable maximum number of equally-best matches.
Perfect matches are given preference over matches containing mismatches, even
if this means mapping a shorter prefix.
Similar to RazerS, MicroRazerS uses a k-mer index of all reads and counts
common k-mers of reads and the reference genome in parallelograms. In
MicroRazerS, this index is built over the first seedlength many bases of each
read only. Each parallelogram with a k-mer count above a certain threshold
triggers a verification. On success, the genomic subsequence and the read
number are stored and later written to the output file.

---------------------------------------------------------------------------
2. Installation
---------------------------------------------------------------------------

MicroRazerS is distributed with SeqAn - The C++ Sequence Analysis Library
(see https://www.seqan.de). To build MicroRazerS do the following:

  1)  Download the latest snapshot of SeqAn
  2)  Unzip it to a directory of your choice (e.g. snapshot)
  3)  cd snapshot/apps
  4)  make micro_razers
  5)  cd micro_razers
  6)  ./micro_razers --help

Alternatively you can check out the latest Git version of MicroRazerS and SeqAn
with:

  1)  git clone https://github.com/seqan/seqan.git
  2)  mkdir seqan/buld; cd seqan/build
  3)  cmake .. -DCMAKE_BUILD_TYPE=Release
  4)  make micro_razers
  5)  ./bin/micro_razers --help

On success, an executable file micro_razers was built and a brief usage
description was dumped.

---------------------------------------------------------------------------
3. Usage
---------------------------------------------------------------------------

To get a short usage description of MicroRazerS, you can execute micro_razers -h or
micro_razers --help.

Usage: micro_razers [OPTION]... <GENOME FILE> <READS FILE>

MicroRazerS expects the names of two DNA (multi-)Fasta files. The first contains
a reference genome and the second contains the reads that should be
mapped against the reference. Without any additional parameters MicroRazerS
would map all reads against both strands of the reference genome requiring a perfect
prefix seed match of length >= 16. The up tp 100 equally best (longest) matches
would then be dumped in the default output file, the read file name extended by the
suffix ".result".
The default behaviour can be modified by adding the following options to
the command line:

---------------------------------------------------------------------------
3.1. Main Options
---------------------------------------------------------------------------

  [ -sL NUM ],  [ --seed-length NUM ]

  Seed length parameter. Minimum length of prefix match.

  [ -sE ],  [ --seed-error ]

  Allow for one mismatch in the prefix seed.

  [ -f ],  [ --forward ]

  Only map reads against the positive/forward strand of the genome. By
  default, both strands are scanned.

  [ -r ],  [ --reverse ]

  Only map reads against the negative/reverse-complement strand of the
  genome. By default, both strands are scanned.

  [ -m NUM ],  [ --max-hits NUM ]

  Output at most NUM of the best matches.

  [ -o FILE ],  [ --output FILE ]

  Change the output filename to FILE. By default, this is the read file
  name extended by the suffix ".result".

  [ -mN ],  [ --match-N ]

  By default, 'N' characters in read or genome sequences equal to nothing,
  not even to another 'N'. They are considered as errors. By activating this
  option, 'N' equals to every other character and produces no mismatch in
  the verification process. The filtration is not affected by this option.

  [ -pa ],  [ --purge-ambiguous ]

  Omit reads with more than #max-hits many equally-best matches.

  [ -v ],  [ --verbose ]

  Verbose. Print extra information and running times.

  [ -vv ],  [ --vverbose ]

  Very verbose. Like -v, but also print filtering statistics like true and
  false positives (TP/FP).

  [ -V ],  [ --version ]

  Print version information.

  [ -h ],  [ --help ]

  Print a brief usage summary.

---------------------------------------------------------------------------
3.2. Output Format Options
---------------------------------------------------------------------------

  [ -of NUM ],  [ --outputFormat ]

  Specify the desired output format:
  NUM = 0   => MicroRazerS default format
  NUM = 1   => SAM format

  [ -a ],  [ --alignment ]

  Dump the alignment for each match in the ".result" file, only possible for
  -of 0, i.e. MicroRazerS default output format. The alignment is written
  directly after the match and has the following format:
  #Read:   CAGGAGATAAGCTGGATCTTT
  #Genome: CAGGAGATAAGCTGGATCTTT

  [ -gn NUM ],  [ --genome-naming NUM ]

  Select how genomes are named in the output file. If NUM is 0, the Fasta
  ids of the genome sequences are used (default). If NUM is 1, the genome
  sequences are enumerated beginning with 1.

  [ -rn NUM ],  [ --read-naming NUM ]

  Select how reads are named in the output file. If NUM is 0, the Fasta ids
  of the reads are used (default). If NUM is 1, the reads are enumerated
  beginning with 1. If NUM is 2, the read sequence itself is used.

  [ -so NUM ],  [ --sort-order NUM ]

  Select how matches are ordered in the output file.
  If NUM is 0, matches are sorted primarily by the read number and
  secondarily by their position in the genome sequence (default).
  If NUM is 1, matches are sorted primarily by their position in the genome
  sequence and secondarily by the read number.

  [ -pf NUM ],  [ --position-format NUM ]

  Select how positions are stored in the output file (only relevant for
  MicroRazerS default output format, i.e. -of 0).
  If NUM is 0, the gap space is used, i.e. gaps around characters are
  enumerated beginning with 0 and the beginning and end position is the
  postion of the gap before and after a match (default).
  If NUM is 1, the position space is used, i.e. characters are enumerated
  beginning with 1 and the beginning and end position is the postion of the
  first and last character involved in a match.

  Example: Consider the string CONCAT. The beginning and end positions
  of the substring CAT are (3,6) in gap space and (4,6) in position space.



---------------------------------------------------------------------------
4.1. Default Output Format
---------------------------------------------------------------------------

The default output file is a text file whose lines represent matches. A line
consists of different tab-separated match values. In the following format:

RName 0 RLength GStrand GName GBegin GEnd PercID MatchLen

Match value description:

  RName        Name of the read sequence (see --read-naming)
  RLength      Length of the read
  GStrand      'F'=forward strand or 'R'=reverse strand
  GName        Name of the genome sequence (see --genome-naming)
  GBegin       Beginning position in the genome sequence
  GEnd         End position in the genome sequence
  PercID       Percent sequence identity within matched prefix
  MatchLen     Length of prefix match

For matches on the reverse strand, GBegin and GEnd are positions on the
related forward strand.


---------------------------------------------------------------------------
4.2. SAM Output Format
---------------------------------------------------------------------------

If -of 1 is specified, the resulting matches will be written out in SAM
format. For a full specification of the SAM format please see
https://samtools.sourceforge.net.

MicroRazerS assigns each read with multiple best matches a mapping quality
of "0". If only one best match was found, it will receive a "255" in the
mapping quality column. Column 12 informs about the sequence identity in
the matched part of the read, e.g. AS:i:100 means that the read match has
100% sequence identity, i.e. does not contain any mismatches. Unmapped
read suffixes will be given as soft-clipped sequence in the Cigar string.
For example, a 30bp read with a 22bp prefix match will have the Cigar string
"22M8S", or "8S22M" if it matches the reverse strand.



---------------------------------------------------------------------------
5. Contact
---------------------------------------------------------------------------

For questions or comments, contact:
  Anne-Katrin Emde <emde@inf.fu-berlin.de>