File: README

package info (click to toggle)
dascrubber 0~20160601-2
  • links: PTS, VCS
  • area: main
  • in suites: stretch
  • size: 436 kB
  • sloc: ansic: 8,947; makefile: 66
file content (47 lines) | stat: -rw-r--r-- 2,237 bytes parent folder | download | duplicates (2)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47



*** PLEASE GO TO THE DAZZLER BLOG (https://dazzlerblog.wordpress.com) FOR TYPESET ***
         DOCUMENTATION, EXAMPLES OF USE, AND DESIGN PHILOSOPHY.



                The Dazzler Scrubbing Suite: DAS SCRUBBER

                            Author:  Gene Myers
                            First:   October 12, 2014
                            Recent:  March 27, 2016

  The goal of scrubbing is to produce a set of edited reads that are guaranteed to
(a) be continuous stretches of the underlying genome (i.e. no unremoved adapters
and not chimers), and (b) have no very low quality stretches (i.e. the error rate
never exceeds some reasonable maximum, 20% or so in the case of Pacbio data).  The
secondary goal of scrubbing is to do so with the minimum removal of data and splitting
of reads.

  The "DAS" suite will consist of a pipeline of several programs that will accomplish
the task of scrubbing.  At this time, we are releasing the first program which assigns
intrinsic quality values to every trace point interval of a read.

1. DASqv [-v] -c<int> <source:db> <overlaps:las>

   DASqv takes as input a database <source> and the local alignments, <overlaps>, for
said database or a block thereof.  Note carefully that <source> must always refer to
the entire DB, only <overlaps> can involve a block number.

   Using the local alignment-pile for each A-read, DASqv produces a QV value for each
complete segment of TRACE_SPACING bases (e.g. 100bp, the -s parameter to daligner).
The quality value of the average percentile of the best 25-5-% alignment matches
covering it depending on the coverage estimate -c.  One must supply the -c parameter
to the expected coverage of the genome in question.  All quality values over 50 are
clipped to 50.
   
   The quality values are written to a .qual track, that can be viewed by calling
DBdump with the -i option set ("i" for "intrinsic QV").

   The -v option prints out a histogram of the segment align matches, and the quality
values produced.  This histgram is usefull in assessing, for a given data set, what
constitutes the threshold -g and -b, to be used by down stream commands, for what is
definitely a good segment and what is definitely a bad segment.

2. DAStrim -- soon !