1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77
|
#Please insert up references in the next lines (line starts with keyword UP)
UP arb.hlp
UP glossary.hlp
#Please insert subtopic references (line starts with keyword SUB)
SUB pos_var_pars.hlp
# Hypertext links in helptext can be added like this: LINK{ref.hlp|http://add|bla@domain}
#************* Title of helpfile !! and start of real helpfile strunk********
TITLE Estimate Parameters from Column Statistics
OCCURRENCE ARB_DIST
DESCRIPTION In a standard RNA, base frequencies are not equally
distributed. Especially in the archea subclass we find
extremely G+C rich sequences.
This yielded in a couple of new rate corrections, algorithms
and programs which:
- calculate the average G+C content of all/two sequences
- correct the distance.
But further research showed us that the G+C frequencies are
not equally distributed within a sequence. Especially helical
parts have a significant higher G+C content than non
helical parts.
One strait forward algorithm would calculate each frequency
independently for each column.
Especially for small datasets the resulting frequencies would
look like random data, as too few examples are analyzed.
In ARB we implemented a combination of the 2 approaches.
Lets say we want to estimate a Parameter 'P' with
a maximum variance 'maxvar', so we need a minimum
samples 'minsap'.
- All sequence positions are clustered according to
- helical/non helical region
- variability
The size of the cluster is choosen with respect
to the variability of the sequences to get a
minimum of independent events.
- The final parameter estimate for a column is a
weighted sum between the estimate for the
cluster and the estimate for the single position.
You can give your favorite method a higher weight by
controlling the smoothing parameter:
Less smoothing -> independent parameter estimates
Much smoothing -> clustered parameter estimates
To get a good tree we recommend you to try all selections.
NOTES To get parameters from a column statistic you first have
to create one.
Do this with <ARB_NT/SAI/Positional Variability (Parsimony M.)>
WARNINGS Problems may occur when
1. 'independent parameter estimates' is selected and
2. your dataset is quite small (<100 Sequences) and
3. one sequence is bad or badly aligned
or
1. Much smoothing of parameters is selected and
2. you are analyzing ribosomal RNA and
3. 'Use Helix Information' is turned off
BUGS No bugs known
|