File: extrinsic.bug.cfg

package info (click to toggle)
augustus 3.4.0%2Bdfsg2-2
  • links: PTS, VCS
  • area: main
  • in suites: bullseye
  • size: 758,480 kB
  • sloc: cpp: 65,451; perl: 21,436; python: 3,927; ansic: 1,240; makefile: 1,032; sh: 189; javascript: 32
file content (183 lines) | stat: -rw-r--r-- 10,485 bytes parent folder | download | duplicates (5)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
==# extrinsic information configuration file for AUGUSTUS
# 
# protein hints
# include with --extrinsicCfgFile=filename
# date: 16.10.2007
# Mario Stanke (mstanke@gwdg.de)


# source of extrinsic information:
# M manual anchor (required)
# P protein database hit
# E EST/cDNA database hit
# C combined est/protein database hit
# D Dialign
# R retroposed genes
# T transMapped refSeqs
# W wiggle track coverage info from RNA-Seq

[SOURCES]
M RM E W

#
# individual_liability: Only unsatisfiable hints are disregarded. By default this flag is not set
# and the whole hint group is disregarded when one hint in it is unsatisfiable.
# 1group1gene: Try to predict a single gene that covers all hints of a given group. This is relevant for
# hint groups with gaps, e.g. when two ESTs, say 5' and 3', from the same clone align nearby.
#
[SOURCE-PARAMETERS]


#   feature        bonus         malus   gradelevelcolumns
#		r+/r-
#
# the gradelevel colums have the following format for each source
# sourcecharacter numscoreclasses boundary    ...  boundary    gradequot  ...  gradequot
# 

[GENERAL]
      start        1        1  M    1  1e+100  RM  1     1    E 1    1    W 1    1
       stop        1        1  M    1  1e+100  RM  1     1    E 1    1    W 1    1
        tss        1        1  M    1  1e+100  RM  1     1    E 1    1    W 1    1
        tts        1        1  M    1  1e+100  RM  1     1    E 1    1    W 1    1
        ass        1        1  M    1  1e+100  RM  1     1    E 1    1    W 1    1
        dss        1        1  M    1  1e+100  RM  1     1    E 1    1    W 1    1
   exonpart        1     .992  M    1  1e+100  RM  1     1    E 1    1    W 1  1.005
       exon        1        1  M    1  1e+100  RM  1     1    E 1    1    W 1    1
 intronpart        1        1  M    1  1e+100  RM  1     1    E 1    1    W 1    1
     intron        1       .8  M    1  1e+100  RM  1     1    E 1    1000 W 1    1
    CDSpart        1  1 0.985  M    1  1e+100  RM  1     1    E 1    1	  W 1    1
        CDS        1        1  M    1  1e+100  RM  1     1    E 1    1    W 1    1
    UTRpart        1   1 .973  M    1  1e+100  RM  1     1    E 1    1    W 1    1
        UTR        1        1  M    1  1e+100  RM  1     1    E 1    1    W 1    1
     irpart        1        1  M    1  1e+100  RM  1     1    E 1    1    W 1    1
nonexonpart        1        1  M    1  1e+100  RM  1     1.01 E 1    1    W 1    1
  genicpart        1        1  M    1  1e+100  RM  1     1    E 1    1    W 1    1

#
# Explanation: 
# 
# The gff/gtf file containint the hints must contain somewhere in the last
# column an entry source=?, where ? is one of the source characters listed in
# the line after [SOURCES] above. You can use different sources when you have
# hints of different reliability of the same type, e.g. exon hints from ESTs
# and exon hints from evolutionary conservation information.
# 
# In the [GENERAL] section the entries second column specify a bonus for obeying
# a hint and the entry in the third column specify a malus (penalty) for
# predicting a feature that is not supported by any hint. The bonus and the
# malus is a factor that is multiplied to the posterior probability of gene
# structueres. 
# Example: 
#   CDS     1000  0.7  ....
# means that, when AUGUSTUS is searching for the most likely gene structure,
# every gene structure that has a CDS exactly as given in a hint gets
# a bonus factor of 1000. Also, for every CDS that is not supported the
# probability of the gene structure gets a malus of 0.7. Increase the bonus to
# make AUGUSTUS obey more hints, decrease the malus to make AUGUSTUS predict few
# features that are not supported by hints. The malus helps increasing
# specificity, e.g. when the exons predicted by AUGUSTUS are suspicious because
# there is no evidence from ESTs, mRNAs, protein databases, sequence
# conservation, transMapped expressed sequences.
# Setting the malus to 1.0 disables those penalties. Setting the bonus to 1.0
# disables the boni. 
# 
#       start: translation start (start codon), specifies an interval that contains
#              the start codon. The interval can be larger than 3bp, in which case
#              every ATG in the interval gets a bonus. The highest bonus is given
#              to ATGs in the middle of the interval, the bonus fades off towards the ends.
#        stop: translation end  (stop codon), see 'start'
#         tss: transcription start site, see 'start'
#         tts: transcription termination site, see 'start'
#         ass: acceptor (3') splice site, the last intron position
#         dss: donor (5') splice site, the first intron position
#    exonpart: part of an exon in the biological sense. The bonus applies only
#              to exons that contain the interval from the hint. Just
#              overlapping means no bonus at all. The malus applies to every
#              base of an exon. Therefore the malus for an exon is exponential
#              in the length of an exon: malus=exonpartmalus^length.
# 	     Therefore the malus should be close to 1, e.g. 0.99.
#        exon: exon in the biological sense. Only exons that exactly match the
#              hint get a bonus. Exception: The exons that contain the start
#              codon and stop codon. This malus applies to a complete exon
#              independent of its length.
#  intronpart: introns both between coding and non-coding exons. The bonus
#              applies to every intronic base in the interval of the hint.
#      intron: An intron gets the bonus if and only if it is exactly as in the hint.
#     CDSpart: part of the coding part of an exon. (CDS = coding sequence)
#         CDS: coding part of an exon with exact boundaries. For internal exons
#              of a multi exon gene this is identical to the biological
#              boundaries of the exon. For the first and the last coding exon
#              the boundaries are the boundaries of the coding sequence (start, stop).
#         UTR: exact boundaries of a UTR exon or the untranslated part of a
#              partially coding exon.
#     UTRpart: The hint interval must be included in the UTR part of an exon.
#      irpart: The bonus applies to every base of the intergenic region. If UTR
#              prediction is turned on (--UTR=on) then UTR is considered
#              genic. If you choose against the usual meaning the bonus of
#              irparts to be much smaller than 1 in the configuration file you
#              can force AUGUSTUS to not predict an intergenic region in the
#              specified interval. This is useful if you want to tell AUGUSTUS
#              that two distant exons belong to the same gene, when AUGUSTUS
#              tends to split that gene into smaller genes.
# nonexonpart: intergenic region or intron. The bonus applies to very non-exon
#              base that overlaps with the interval from the hint. It is
#              geometric in the length of that overlap, so choose it close to
#              1.0. This is useful as a weak kind of masking, e.g. when it is
#              unlikely that a retroposed gene contains a coding region but you
#              do not want to completely forbid exons.
#   genicpart: everything that is not intergenic region, i.e. intron or exon or UTR if
#              applicable. The bonus applies to every genic base that overlaps with the
#              interval from the hint. This can be used in particular to make Augustus
#              predict one gene between positions a and b if a and b are experimentally
#              confirmed to be part of the same gene, e.g. through ESTs from the same clone.
#              alias: nonirpart
#
# Any hints of types dss, intron, exon, CDS, UTR that (implicitly) suggest a donor splice
# site allow AUGUSTUS to predict a donor splice site that has a GC instead of the much more common GT.
# AUGUSTUS does not predict a GC donor splice site unless there is a hint for one.
# 
# Starting in column number 4 you can tell AUGUSTUS how to modify the bonus 
# depending on the source of the hint and the score of the hint. 
# The score of the hints is specified in the 6th column of the hint gff/gtf.
# If the score is used at all, the score is not used directly through some
# conversion formula but by distinguishing different classes of scores, e.g. low
# score, medium score, high score. The format is the following:
# First, you specify the source character, then the number of classes (say n), then you
# specify the score boundaries that separate the classes (n-1 thresholds) and then you specify
# for each score class the multiplicative modifier to the bonus (n factors). 
# 
# Examples:
# 
# M 1 1e+100
# means for the manual hint there is only one score class, the bonus for this
# type of hint is multiplied by 10^100. This practically forces AUGUSTUS to obey
# all manual hints.
# 
# T    2       1.5 1 5e29
# For the transMap hints distinguish 2 classes. Those with a score below 1.5 and
# with a score above 1.5. The bonus if the lower score hints is unchanged and
# the bonus of the higher score hints is multiplied by 5x10^29.
# 
# D    8     1.5  2.5  3.5  4.5  5.5  6.5  7.5  0.58  0.4  0.2  2.9  0.87  0.44 0.31  7.3
# Use 8 score classes for the DIALIGN hints. DIALIGN hints give a score, a strand and
# reading frame information for CDSpart hints. The strand and reading frame are often correct but not
# often enough to rely on them. To account for that I generated hints for all
# 6 combinations of a strand and reading frame and then used 2x2x2=8 different
# score classes:
# {low score, high score} x {DIALIGN strand, opposite strand} x {DIALIGN reading frame, other reading frame}
# This example shows that scores don't have to be monotonous. A higher score
# does not have to mean a higher bonus. They are merely a way of classifying the
# hints into categories as you wish. In particular, you could get the effect of
# having different sources by having just hints of one source and then distinguishing
# more scores classes.
# 
# 
# Future plans:
# - Add fuzzy intron hints. Introns get a bonus only when they approximately
# have the same boundaries as in the hint.
# - Make the splice site hints fuzzy also. Allow a hint interval that contains a
# likely splice site, as opposed to only an individual position.
# - Write a program that automatically optimizes the boni and mali given an
# annotated test set of genes and hints for that set of sequences.