1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239
|
*** MicroRazerS - Rapid Alignment of Small RNA Reads ***
https://www.seqan.de/apps/microRazers.html
---------------------------------------------------------------------------
Table of Contents
---------------------------------------------------------------------------
1. Overview
2. Installation
3. Usage
4. Output Format
5. Contact
---------------------------------------------------------------------------
1. Overview
---------------------------------------------------------------------------
MicroRazerS is a tool for mapping millions of short reads obtained from small
RNA sequencing onto a reference genome. MicroRazerS searches for the longest
perfect prefix match of each read where the minimum prefix match length (the
seed length) can currently be varied between 14 and 22. Optionally, one
mismatch can be tolerated in the seed. MicroRazerS guarantees to find all
matches and reports a configurable maximum number of equally-best matches.
Perfect matches are given preference over matches containing mismatches, even
if this means mapping a shorter prefix.
Similar to RazerS, MicroRazerS uses a k-mer index of all reads and counts
common k-mers of reads and the reference genome in parallelograms. In
MicroRazerS, this index is built over the first seedlength many bases of each
read only. Each parallelogram with a k-mer count above a certain threshold
triggers a verification. On success, the genomic subsequence and the read
number are stored and later written to the output file.
---------------------------------------------------------------------------
2. Installation
---------------------------------------------------------------------------
MicroRazerS is distributed with SeqAn - The C++ Sequence Analysis Library
(see https://www.seqan.de). To build MicroRazerS do the following:
1) Download the latest snapshot of SeqAn
2) Unzip it to a directory of your choice (e.g. snapshot)
3) cd snapshot/apps
4) make micro_razers
5) cd micro_razers
6) ./micro_razers --help
Alternatively you can check out the latest Git version of MicroRazerS and SeqAn
with:
1) git clone https://github.com/seqan/seqan.git
2) mkdir seqan/buld; cd seqan/build
3) cmake .. -DCMAKE_BUILD_TYPE=Release
4) make micro_razers
5) ./bin/micro_razers --help
On success, an executable file micro_razers was built and a brief usage
description was dumped.
---------------------------------------------------------------------------
3. Usage
---------------------------------------------------------------------------
To get a short usage description of MicroRazerS, you can execute micro_razers -h or
micro_razers --help.
Usage: micro_razers [OPTION]... <GENOME FILE> <READS FILE>
MicroRazerS expects the names of two DNA (multi-)Fasta files. The first contains
a reference genome and the second contains the reads that should be
mapped against the reference. Without any additional parameters MicroRazerS
would map all reads against both strands of the reference genome requiring a perfect
prefix seed match of length >= 16. The up tp 100 equally best (longest) matches
would then be dumped in the default output file, the read file name extended by the
suffix ".result".
The default behaviour can be modified by adding the following options to
the command line:
---------------------------------------------------------------------------
3.1. Main Options
---------------------------------------------------------------------------
[ -sL NUM ], [ --seed-length NUM ]
Seed length parameter. Minimum length of prefix match.
[ -sE ], [ --seed-error ]
Allow for one mismatch in the prefix seed.
[ -f ], [ --forward ]
Only map reads against the positive/forward strand of the genome. By
default, both strands are scanned.
[ -r ], [ --reverse ]
Only map reads against the negative/reverse-complement strand of the
genome. By default, both strands are scanned.
[ -m NUM ], [ --max-hits NUM ]
Output at most NUM of the best matches.
[ -o FILE ], [ --output FILE ]
Change the output filename to FILE. By default, this is the read file
name extended by the suffix ".result".
[ -mN ], [ --match-N ]
By default, 'N' characters in read or genome sequences equal to nothing,
not even to another 'N'. They are considered as errors. By activating this
option, 'N' equals to every other character and produces no mismatch in
the verification process. The filtration is not affected by this option.
[ -pa ], [ --purge-ambiguous ]
Omit reads with more than #max-hits many equally-best matches.
[ -v ], [ --verbose ]
Verbose. Print extra information and running times.
[ -vv ], [ --vverbose ]
Very verbose. Like -v, but also print filtering statistics like true and
false positives (TP/FP).
[ -V ], [ --version ]
Print version information.
[ -h ], [ --help ]
Print a brief usage summary.
---------------------------------------------------------------------------
3.2. Output Format Options
---------------------------------------------------------------------------
[ -of NUM ], [ --outputFormat ]
Specify the desired output format:
NUM = 0 => MicroRazerS default format
NUM = 1 => SAM format
[ -a ], [ --alignment ]
Dump the alignment for each match in the ".result" file, only possible for
-of 0, i.e. MicroRazerS default output format. The alignment is written
directly after the match and has the following format:
#Read: CAGGAGATAAGCTGGATCTTT
#Genome: CAGGAGATAAGCTGGATCTTT
[ -gn NUM ], [ --genome-naming NUM ]
Select how genomes are named in the output file. If NUM is 0, the Fasta
ids of the genome sequences are used (default). If NUM is 1, the genome
sequences are enumerated beginning with 1.
[ -rn NUM ], [ --read-naming NUM ]
Select how reads are named in the output file. If NUM is 0, the Fasta ids
of the reads are used (default). If NUM is 1, the reads are enumerated
beginning with 1. If NUM is 2, the read sequence itself is used.
[ -so NUM ], [ --sort-order NUM ]
Select how matches are ordered in the output file.
If NUM is 0, matches are sorted primarily by the read number and
secondarily by their position in the genome sequence (default).
If NUM is 1, matches are sorted primarily by their position in the genome
sequence and secondarily by the read number.
[ -pf NUM ], [ --position-format NUM ]
Select how positions are stored in the output file (only relevant for
MicroRazerS default output format, i.e. -of 0).
If NUM is 0, the gap space is used, i.e. gaps around characters are
enumerated beginning with 0 and the beginning and end position is the
postion of the gap before and after a match (default).
If NUM is 1, the position space is used, i.e. characters are enumerated
beginning with 1 and the beginning and end position is the postion of the
first and last character involved in a match.
Example: Consider the string CONCAT. The beginning and end positions
of the substring CAT are (3,6) in gap space and (4,6) in position space.
---------------------------------------------------------------------------
4.1. Default Output Format
---------------------------------------------------------------------------
The default output file is a text file whose lines represent matches. A line
consists of different tab-separated match values. In the following format:
RName 0 RLength GStrand GName GBegin GEnd PercID MatchLen
Match value description:
RName Name of the read sequence (see --read-naming)
RLength Length of the read
GStrand 'F'=forward strand or 'R'=reverse strand
GName Name of the genome sequence (see --genome-naming)
GBegin Beginning position in the genome sequence
GEnd End position in the genome sequence
PercID Percent sequence identity within matched prefix
MatchLen Length of prefix match
For matches on the reverse strand, GBegin and GEnd are positions on the
related forward strand.
---------------------------------------------------------------------------
4.2. SAM Output Format
---------------------------------------------------------------------------
If -of 1 is specified, the resulting matches will be written out in SAM
format. For a full specification of the SAM format please see
https://samtools.sourceforge.net.
MicroRazerS assigns each read with multiple best matches a mapping quality
of "0". If only one best match was found, it will receive a "255" in the
mapping quality column. Column 12 informs about the sequence identity in
the matched part of the read, e.g. AS:i:100 means that the read match has
100% sequence identity, i.e. does not contain any mismatches. Unmapped
read suffixes will be given as soft-clipped sequence in the Cigar string.
For example, a 30bp read with a 22bp prefix match will have the Cigar string
"22M8S", or "8S22M" if it matches the reverse strand.
---------------------------------------------------------------------------
5. Contact
---------------------------------------------------------------------------
For questions or comments, contact:
Anne-Katrin Emde <emde@inf.fu-berlin.de>
|