1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183
|
==# extrinsic information configuration file for AUGUSTUS
#
# protein hints
# include with --extrinsicCfgFile=filename
# date: 16.10.2007
# Mario Stanke (mstanke@gwdg.de)
# source of extrinsic information:
# M manual anchor (required)
# P protein database hit
# E EST/cDNA database hit
# C combined est/protein database hit
# D Dialign
# R retroposed genes
# T transMapped refSeqs
# W wiggle track coverage info from RNA-Seq
[SOURCES]
M RM E W
#
# individual_liability: Only unsatisfiable hints are disregarded. By default this flag is not set
# and the whole hint group is disregarded when one hint in it is unsatisfiable.
# 1group1gene: Try to predict a single gene that covers all hints of a given group. This is relevant for
# hint groups with gaps, e.g. when two ESTs, say 5' and 3', from the same clone align nearby.
#
[SOURCE-PARAMETERS]
# feature bonus malus gradelevelcolumns
# r+/r-
#
# the gradelevel colums have the following format for each source
# sourcecharacter numscoreclasses boundary ... boundary gradequot ... gradequot
#
[GENERAL]
start 1 1 M 1 1e+100 RM 1 1 E 1 1 W 1 1
stop 1 1 M 1 1e+100 RM 1 1 E 1 1 W 1 1
tss 1 1 M 1 1e+100 RM 1 1 E 1 1 W 1 1
tts 1 1 M 1 1e+100 RM 1 1 E 1 1 W 1 1
ass 1 1 M 1 1e+100 RM 1 1 E 1 1 W 1 1
dss 1 1 M 1 1e+100 RM 1 1 E 1 1 W 1 1
exonpart 1 .992 M 1 1e+100 RM 1 1 E 1 1 W 1 1.005
exon 1 1 M 1 1e+100 RM 1 1 E 1 1 W 1 1
intronpart 1 1 M 1 1e+100 RM 1 1 E 1 1 W 1 1
intron 1 .8 M 1 1e+100 RM 1 1 E 1 1000 W 1 1
CDSpart 1 1 0.985 M 1 1e+100 RM 1 1 E 1 1 W 1 1
CDS 1 1 M 1 1e+100 RM 1 1 E 1 1 W 1 1
UTRpart 1 1 .973 M 1 1e+100 RM 1 1 E 1 1 W 1 1
UTR 1 1 M 1 1e+100 RM 1 1 E 1 1 W 1 1
irpart 1 1 M 1 1e+100 RM 1 1 E 1 1 W 1 1
nonexonpart 1 1 M 1 1e+100 RM 1 1.01 E 1 1 W 1 1
genicpart 1 1 M 1 1e+100 RM 1 1 E 1 1 W 1 1
#
# Explanation:
#
# The gff/gtf file containint the hints must contain somewhere in the last
# column an entry source=?, where ? is one of the source characters listed in
# the line after [SOURCES] above. You can use different sources when you have
# hints of different reliability of the same type, e.g. exon hints from ESTs
# and exon hints from evolutionary conservation information.
#
# In the [GENERAL] section the entries second column specify a bonus for obeying
# a hint and the entry in the third column specify a malus (penalty) for
# predicting a feature that is not supported by any hint. The bonus and the
# malus is a factor that is multiplied to the posterior probability of gene
# structueres.
# Example:
# CDS 1000 0.7 ....
# means that, when AUGUSTUS is searching for the most likely gene structure,
# every gene structure that has a CDS exactly as given in a hint gets
# a bonus factor of 1000. Also, for every CDS that is not supported the
# probability of the gene structure gets a malus of 0.7. Increase the bonus to
# make AUGUSTUS obey more hints, decrease the malus to make AUGUSTUS predict few
# features that are not supported by hints. The malus helps increasing
# specificity, e.g. when the exons predicted by AUGUSTUS are suspicious because
# there is no evidence from ESTs, mRNAs, protein databases, sequence
# conservation, transMapped expressed sequences.
# Setting the malus to 1.0 disables those penalties. Setting the bonus to 1.0
# disables the boni.
#
# start: translation start (start codon), specifies an interval that contains
# the start codon. The interval can be larger than 3bp, in which case
# every ATG in the interval gets a bonus. The highest bonus is given
# to ATGs in the middle of the interval, the bonus fades off towards the ends.
# stop: translation end (stop codon), see 'start'
# tss: transcription start site, see 'start'
# tts: transcription termination site, see 'start'
# ass: acceptor (3') splice site, the last intron position
# dss: donor (5') splice site, the first intron position
# exonpart: part of an exon in the biological sense. The bonus applies only
# to exons that contain the interval from the hint. Just
# overlapping means no bonus at all. The malus applies to every
# base of an exon. Therefore the malus for an exon is exponential
# in the length of an exon: malus=exonpartmalus^length.
# Therefore the malus should be close to 1, e.g. 0.99.
# exon: exon in the biological sense. Only exons that exactly match the
# hint get a bonus. Exception: The exons that contain the start
# codon and stop codon. This malus applies to a complete exon
# independent of its length.
# intronpart: introns both between coding and non-coding exons. The bonus
# applies to every intronic base in the interval of the hint.
# intron: An intron gets the bonus if and only if it is exactly as in the hint.
# CDSpart: part of the coding part of an exon. (CDS = coding sequence)
# CDS: coding part of an exon with exact boundaries. For internal exons
# of a multi exon gene this is identical to the biological
# boundaries of the exon. For the first and the last coding exon
# the boundaries are the boundaries of the coding sequence (start, stop).
# UTR: exact boundaries of a UTR exon or the untranslated part of a
# partially coding exon.
# UTRpart: The hint interval must be included in the UTR part of an exon.
# irpart: The bonus applies to every base of the intergenic region. If UTR
# prediction is turned on (--UTR=on) then UTR is considered
# genic. If you choose against the usual meaning the bonus of
# irparts to be much smaller than 1 in the configuration file you
# can force AUGUSTUS to not predict an intergenic region in the
# specified interval. This is useful if you want to tell AUGUSTUS
# that two distant exons belong to the same gene, when AUGUSTUS
# tends to split that gene into smaller genes.
# nonexonpart: intergenic region or intron. The bonus applies to very non-exon
# base that overlaps with the interval from the hint. It is
# geometric in the length of that overlap, so choose it close to
# 1.0. This is useful as a weak kind of masking, e.g. when it is
# unlikely that a retroposed gene contains a coding region but you
# do not want to completely forbid exons.
# genicpart: everything that is not intergenic region, i.e. intron or exon or UTR if
# applicable. The bonus applies to every genic base that overlaps with the
# interval from the hint. This can be used in particular to make Augustus
# predict one gene between positions a and b if a and b are experimentally
# confirmed to be part of the same gene, e.g. through ESTs from the same clone.
# alias: nonirpart
#
# Any hints of types dss, intron, exon, CDS, UTR that (implicitly) suggest a donor splice
# site allow AUGUSTUS to predict a donor splice site that has a GC instead of the much more common GT.
# AUGUSTUS does not predict a GC donor splice site unless there is a hint for one.
#
# Starting in column number 4 you can tell AUGUSTUS how to modify the bonus
# depending on the source of the hint and the score of the hint.
# The score of the hints is specified in the 6th column of the hint gff/gtf.
# If the score is used at all, the score is not used directly through some
# conversion formula but by distinguishing different classes of scores, e.g. low
# score, medium score, high score. The format is the following:
# First, you specify the source character, then the number of classes (say n), then you
# specify the score boundaries that separate the classes (n-1 thresholds) and then you specify
# for each score class the multiplicative modifier to the bonus (n factors).
#
# Examples:
#
# M 1 1e+100
# means for the manual hint there is only one score class, the bonus for this
# type of hint is multiplied by 10^100. This practically forces AUGUSTUS to obey
# all manual hints.
#
# T 2 1.5 1 5e29
# For the transMap hints distinguish 2 classes. Those with a score below 1.5 and
# with a score above 1.5. The bonus if the lower score hints is unchanged and
# the bonus of the higher score hints is multiplied by 5x10^29.
#
# D 8 1.5 2.5 3.5 4.5 5.5 6.5 7.5 0.58 0.4 0.2 2.9 0.87 0.44 0.31 7.3
# Use 8 score classes for the DIALIGN hints. DIALIGN hints give a score, a strand and
# reading frame information for CDSpart hints. The strand and reading frame are often correct but not
# often enough to rely on them. To account for that I generated hints for all
# 6 combinations of a strand and reading frame and then used 2x2x2=8 different
# score classes:
# {low score, high score} x {DIALIGN strand, opposite strand} x {DIALIGN reading frame, other reading frame}
# This example shows that scores don't have to be monotonous. A higher score
# does not have to mean a higher bonus. They are merely a way of classifying the
# hints into categories as you wish. In particular, you could get the effect of
# having different sources by having just hints of one source and then distinguishing
# more scores classes.
#
#
# Future plans:
# - Add fuzzy intron hints. Introns get a bonus only when they approximately
# have the same boundaries as in the hint.
# - Make the splice site hints fuzzy also. Allow a hint interval that contains a
# likely splice site, as opposed to only an individual position.
# - Write a program that automatically optimizes the boni and mali given an
# annotated test set of genes and hints for that set of sequences.
|