File: bayes.html

package info (click to toggle)
lamarc 2.1.10.1%2Bdfsg-3
links: PTS, VCS
area: main
in suites: buster
size: 77,052 kB
sloc: cpp: 112,339; xml: 16,769; sh: 3,528; makefile: 1,219; python: 420; perl: 260; ansic: 40
file content (484 lines) | stat: -rw-r--r-- 22,867 bytes
parent folder | download | duplicates (3)
<!-- header fragment for html documentation -->
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN">
<HTML>
<HEAD>

<META NAME="description" CONTENT="Estimation of population parameters using genetic data using a maximum likelihood approach with Metropolis-Hastings Monte Carlo Markov chain importance sampling">
<META NAME="keywords" CONTENT="MCMC, Markov chain, Monte Carlo, Metropolis-Hastings, population, parameters, migration rate, population size, recombination rate, maximum likelihood">

<TITLE>LAMARC Documentation: Bayesian Tutorial</title>
</HEAD>


<BODY BGCOLOR="#FFFFFF">
<!-- coalescent, coalescence, Metropolis-Hastings, Markov chain Monte Carlo
 simulation, migration rate, effective population size, recombination rate,
 maximum likelihood -->


(<A HREF="output.html">Previous</A> | <A HREF="index.html">Contents</A> | <A
HREF="tracer.html">Next</A>)

<H2>Bayesian-LAMARC tutorial:  Why?</H2>

<P>This tutorial is designed to be read from beginning to end, but if you
like you can jump straight to:</P>

<UL>
<LI><A HREF="bayes.html#intro">Introduction to Bayesian analysis</A></LI>
<LI><A HREF="bayes.html#approach">The Bayesian approach in LAMARC</A></LI>
<LI><A HREF="bayes.html#results">Results of a Bayesian run</A></LI>
<LI><A HREF="bayes.html#curvefiles">Curvefile uses</A></LI>
<LI><A HREF="bayes.html#drawbacks">Curvefile drawbacks</A></LI>
<LI><A HREF="bayes.html#surprises">Curvefile surprises</A></LI>
<LI><A HREF="bayes.html#compare">Comparing Bayes-LAMARC to Likelihood-LAMARC</A></LI>
<LI><A HREF="bayes.html#priors">Details about Bayesian priors</A></LI>
<LI><A HREF="bayes.html#final">Final thoughts</A></LI>
</UL>


<h3><A NAME="intro">What's this about?  What's a Bayesian analysis?</h3>

<P>Better minds than mine have devoted themselves to creating whole web
pages explaining what a Bayesian approach is, and why it's important.  See,
for example, Eliezer Yudkowsky's <A
HREF="http://yudkowsky.net/rational/bayes/">An Intuitive Explanation of
Bayes' Theorem</A>.  In brief, the advantages of a Bayesian approach
include:

<UL>
<LI> Any information you have about the parameters before you start the
program can be incorporated into the answer
<LI> The answers are in terms of probability instead of likelihood, and
the answers therefore have credibility intervals instead of support
intervals.
<LI> For LAMARC in particular, the limitations of reliance on 'driving
values' for the likelihood run are no longer an issue because the parameters
vary during the search through tree-space.
</ul>

The biggest disadvantage of a Bayesian approach is:

<ul>
<LI> You <i>must</i> include information about the parameters before you
start the analysis, regardless of your state of ignorance, and that
information is incorporated into the answer, sometimes in non-obvious ways.
</ul>
</P>

<h3><A NAME="approach">How is a Bayesian approach used in LAMARC?</h3>

<P>In Likelihood-LAMARC, each chain is a search through tree-space with a
single set of 'driving values'.  In Bayesian LAMARC, this search through
tree-space is augmented by a search through parameter space as defined by
the priors.  This search serves by proxy as the 'driving values' for the
parallel search through tree-space.
</P>

<P>In practical terms, a Likelihood-LAMARC search looks like:
<OL>
<LI>Generate a tree using a set of driving values.
<LI>Come up with a new tree using that same set of driving values and the
old tree.
<LI>Compare the two trees, and usually keep the one with the better
likelihood.
<LI>Save the tree you liked better.
<LI>Go to step 2 until you're done.
<LI>Analyze the trees you've collected.
<LI>Make a new set of driving values based on your trees, and go back
to step 2 again for a new chain.
</ol>

(The 'usually' is so that the search can sometimes move 'downhill' in hopes
of finding a better peak in a different area.)
</P>
<P>
A Bayesian-LAMARC search looks like:

<OL>
<LI>Generate a tree using a set of driving values.
<LI>Randomly choose whether to go to step 3 or step 7.
</P>
<P>
<LI>Select a new point in parameter space.
<LI>Compare the two sets of parameters, and usually keep the one with the
better likelihood.
<LI>Save the set of parameters you liked better, and replace the old set of
driving values with those new ones.
<LI>Go to step 2 unless you're done, then go to step 10.
</P>
<P>
<LI>Come up with a new tree using the current set of driving values and the
old tree.
<LI>Compare the two trees, and usually keep the one with the better
likelihood.
<LI>Go to step 2 unless you're done, then go to step 10.
</P>
<P>
<LI>Analyze the parameters you've collected.
</P>
</ol>

Steps 1 and 7-9 are exactly the same as those in Likelihood-LAMARC, except
that the trees are not saved for subsequent analysis, and instead the
<b>parameters</b> are saved and analyzed.
</P>

<P>Just as Likelihood-LAMARC samples trees proportionally to their
likelihood, Bayesian-LAMARC samples sets of parameters proportionally to their
probability.  For more detail, and a comparison of simulated data in both a
Likelihood and Bayesian setting, see:</P>

<P><A HREF="http://www.genetics.org/cgi/content/abstract/175/1/155">Kuhner,
M. K. and L. P. Smith, 2007  <i>Comparing Likelihood and Bayesian Coalescent
Estimation of Population Parameters</i> Genetics 175: 155-165.</a></P>


<h3>That was more than I really wanted to know.</h3>

<P>You asked!</P>

<h3><A NAME="results">What I meant to say is, methodology aside, what are
the results of a Bayesian run and what do they mean?</h3>

<P> In the current version of LAMARC, each parameter is analyzed separately,
and a probability density function is created for each.  (A probability
density function is a curve where the area under the curve is one.)  The
highest point on these curves is given as the point estimate for that
parameter.  The area under the curve is used to report credibility intervals
in the output file--the 0.005 percentile is the point at which 0.005% of
the area of the curve falls below it, and .995% of the curve falls above
it.  The reverse is true for the 0.995 percentile.  The combination of the
.005 percentile and the .995 percentile give you your 99% credibility
intervals.  (Credibility intervals are like confidence intervals in that
they tell you where the truth is most likely to lie, but are used in a
Bayesian context.  They are sometimes called the 'posterior probability
interval'.)
</P>

<P> <A NAME="LnLpictures"> Probably of more interest are the curvefiles themselves, produced by
LAMARC as output files.  Each parameter has a curvefile for each
chromosomal region in your data set, plus one curvefile for the
overall estimate (the product of the individual curves).  They are
named [prefix]_[region]_[parameter].txt, where [prefix] is 'curvefile' by
default (this can be changed in the menu or the XML), [region] is 'overall'
or 'reg#' with # the number of the region in question, and [parameter] is
the short name of the parameter, like 'Theta1' for the theta for the first
population, or 'M32' for the migration rate from the third population into
the second population. </P>

<P> Each curvefile contains information about the curve at the top,
followed by a 2-column tab-delimited list of the parameter values and point
likelihoods that define the probability density function.  It can be
imported into a spreadsheet program or Mathematica to create an actual
graph.
</P>

<h3><A NAME="curvefiles">So, what do these curvefiles tell me?</h3>

<P> Each curvefile is a sort of convolution of the likelihood and the
prior: if both inputs are accurate, the result tells you the relative
probability that the parameter is any particular value.  Unlike the
likelihood version, this overview is complete and continuous.

<P> Another advantage of the curvefiles is that you can check them to see
if they're unimodal.  Our research to date indicates that the true
probability density function should be unimodal, with three exceptions:  an
insufficiently long run of LAMARC, insufficient sampling of the data, and
unusual correlations between parameters.  The first gives rise to multiple
peaks in the curvefiles that come from single regions, which will not be
conserved from one region to the next.  The second gives rise to multiple
peaks in the curvefiles that come from the overall estimate over regions. 
The third gives rise to the <b>same</b> multimodal peaks across all
regions. The various possibilities, then, are:

<UL>

  <LI> <b>You have multiple peaks in both the individual regions curvefiles
  and the overall-region curvefiles</b>:  If these multiple peaks are
  different across the different regions, LAMARC was not run long enough. 
  It is probably not possible to tell at this point if the data were also
  undersampled.  If these multiple peaks are conserved across regions and
  show up in the overall-region curvefiles too, it is possible you have
  unusually-correlated parameters--see below for a discussion on this.

  <LI><b>You have multiple peaks in the overall region curvefiles, but not
  in the individual regions curvefiles</b>:  LAMARC was probably run long
  enough, but the data coverage is insufficient.  Your options are either
  to collect new data at a different region, or report the multiple peaks
  and their likely cause in your analysis.

  <LI><b>You have multiple peaks in the individual regions curvefiles, but
  not in the overall-region curvefiles</b>:  LAMARC was probably not run
  long enough for any one region, but good coverage of the data seems to
  have overcome this insufficiency.

  <LI><b>You have no multiple peaks</b>:  LAMARC was run for long enough,
  and your data coverage was good.  Congratulations!

</UL>
</P>


<h3><A NAME="drawbacks">So what won't these curvefiles tell me?</h3>

<P> The first thing they won't tell you is anything about any correlation
between your parameters.  For example, the known correlation between Theta
and growth would not be seen in the curvefiles (or anywhere else in a
Bayesian-LAMARC run).  This is not a shortcoming of the Bayesian method
<i>per se</i>, but rather a shortcoming of LAMARC as currently written: 
each parameter's walk through parameter space is analyzed without regard to
how the other parameters are moving at the same time.  So, say that two
parameters are jointly exploring the following parameter space, illustrated
as a contour map.  We'll start by observing what happens in the uncorrelated
case:</P>

<P>
<center><IMG SRC="images/uncorrelated.gif" ALT="Graph of two uncorrelated
parameters"></center>
</P>

<P> In the Bayesian run, this contour map fills up with sampled points in
proportion to its height.  So, the peak area fills up with a lot of points,
and as you go down, the density of sampled points goes down, too. A
two-dimensional analysis of these points could reproduce the contour
map.  But in a one-dimensional analysis (as in the current version of
LAMARC), each curvefile is the result of projecting each point onto one of
the axes, then smoothing the density of resulting points to get a
probability density function (seen outside each axis).
</P>

<P>Now let's look at some other examples, where the two parameters are
either correlated (on the left) or inversely-correlated (on the right).  The
contour map of parameter space might then look like:
</P>

<P>
<center>
<IMG src="images/correlated1.gif" alt="Graph of two
correlated parameters">
<IMG src="images/correlated2.gif" alt="Graph of two
inversely correlated parameters">
</center>
</P>

<P> Unfortunately, not only can a one-dimensional analysis not distinguish
between these two cases, it also cannot distinguish between these cases and
the uncorrelated case:  the projected densities onto the axes are the same
in all three cases.
</P>

<P>The problem gets even knottier when the two parameters have some sort of
non-symmetrical correlation.  Consider a correlation whose contour map looks
like this: </P>

<center>
<IMG src="images/variably_correlated.gif" alt="Graph of two variably
correlated parameters">
</center>

<P> In this example, there is a single maximum on the contour map, but its
oddly-distorted shape causes there to be multiple peaks in its projection
onto the X axis.  Given a sufficiently-long 'flat' section of the curve, the
secondary peak could even end up being higher than the peak corresponding
to the peak of the contour plot.
</P>

<P> This is, it should be noted, not an <b>incorrect</b> assessment of the
data, but it is <b>limited</b>.  In other words, if the parameter plotted
on the Y axis in the above example is truly equally likely to be anywhere
in the lower half of its allowed range, (which it would be, if the prior on
that parameter was accurate), and all those Y values correspond to a single
X value, there will indeed be a peak at that X value, which may even
overshadow the peak seen in the contour plot.  Conversely, an analysis that
simply reported the peak of the contour plot would miss the fact that there
was a different X value which might be a better choice, were you only
considering possible values of X. </p>

<P> The take-home lesson?  The point estimates of individual parameters you
get in a Bayesian run are only guaranteed to be accurate if considered
independently from one another.  And, of course, if the priors are accurate.
</P>

<h3><A NAME="surprises">Is there any other feature of these curvefiles that
might surprise me if I didn't know the secret?</h3>

<P> Funny you should ask!  As a matter of fact, there's one biggie:  Our
curve-smoothing algorithm does not take into account the prior boundaries. 
This means that if your prior ranges from 0 to 100, you might end up with a
curve that goes negative, even if negative values are absurd for your data. 
This does not mean that LAMARC actually used any negative values calculating
the trees, but rather that the curve-smoothing algorithm has taken some of
the density at or around zero and smoothed it beyond the boundary.  If
you're desperately curious about our curve-smoothing algorithm, look <A
HREF="curve-smoothing.html">here</A>.
</P>

<P> If you get this behavior, there are two ways to interpret it.  One is
simply to assign all density outside the boundary in question to the
boundary itself.  In other words, if 3% of your curve falls below zero,
simply claim that the lower 3% of your probability density curve is
<b>at</b> zero.  Another possibility (especially if you have accepted the
default priors) is that your priors are wrong.  If, in the same 0 to 100
example, you have a lot of density smoothed beyond 100, and the peak of the
curve is at 99.5, it's fairly clear that your data is pushing the search as
high as you are allowing it to go.  It might be informative to see where it
would take the search when you allow it to go higher still.
</P>


<h3><A NAME="compare">So, why should I run Bayesian LAMARC vs. Likelihood
LAMARC?</h3>

<P> There are two advantages we have found in a Bayesian run vs. a
likelihood run in LAMARC.  In some difficult cases, our experience indicates that
Bayes-LAMARC does a better job of searching tree space than does Likelihood-
LAMARC.  We attribute this to Likelihood-LAMARC's reliance on driving
values:  its search is akin to looking under a lamp-post because that's
where the light is.  The method of changing the driving values from chain
to chain allow us (to extend the analogy) to move the lamp-post from time
to time, and given a sufficiently long run, the search would find all the
good trees.  But Bayes-LAMARC's methodology searches tree space at the same
time that it searches parameter space, meaning that the driving values are
constantly changing.  The boundaries of that search are immobile--they're
the boundaries of the priors--but if your priors are correct, the search
does a better job of covering the allowable space. </P>

<P> The second advantage of a Bayesian run vs. a likelihood run is in a
sense a subset of the first advantage:  a Bayesian run is better at
determining whether your data support parameter values at or near zero.  In
a likelihood run, a driving value of zero makes it nearly impossible to
search tree space with a non-zero value.  Conversely, having a non-zero
driving value makes it nearly impossible to search tree space where the
value is zero.  One observed failure mode in Likelihood LAMARC is when it
estimates a parameter after one chain to be nearly zero, then estimates
that same parameter to be high after the next chain, near-zero again the
next, and continues to ping-pong back and forth throughout the run.  Our
current theory is that this behavior stems from having single driving
values.  Our Bayesian runs have not shown this same behavior, and instead
settle on a near-zero estimate, complete with seemingly appropriate
confidence intervals. </P>

<P> Finally, if you wish to analyze a case with modern populations diverging
from ancestral populations, you must use a Bayesian analysis as likelihood
is not supported in cases with divergence.  For the curious, the reason for
this is that a likelihood run would prefer to keep the population splitting
times constant throughout a chain and then consider what they suggest about
other possible split times, but it turns out that trees inferred with one
split time are often not useful in deciding about other potential split
times.  A Bayesian analysis avoids this problem as it allows split times
to vary throughout the chain.</P>

<h3><A NAME="priors">So tell me about these priors, then.  They're flat,
right?</h3>

<P> Yes, the current version of LAMARC only has flat priors.  A prior is
supposed to represent your knowledge of the potential answers before you
start, and if you don't know anything about what the potential answer might
be beyond 'it'll be larger than X and less than Y', assigning all values
between X and Y the same probability is a way to express that.
</P>

<P> Even then, you have to decide what kind of <b>density</b> your priors
should have between those boundaries.  LAMARC currently gives you two
options:  a linear prior, and a logarithmic prior (with natural logarithms,
not base-10 logarithms).  This choice can affect the confidence intervals
of the reported parameters, and will definitely affect the search through
parameter space.  </P>

<P> Your choice should be motivated by how you believe the parameter varies.
For example, if you believe that if the best value for theta will be around
0.01, and that the 95% confidence interval will be from 0.1 to 0.001, you
should use a logarithmic prior for theta.  If you believe that the best
value for a migration rate will be around 100, and that the 95% confidence
interval will be from 50 to 150, you should use a linear prior for
migration.
</P>

<P> In the end, if you run LAMARC long enough and you have enough data, 
it won't matter which type of prior you use because the data will overwhelm it.  
But your search will
have to be much longer if you use the wrong type.  If you get it right
beforehand, it will spend an equal amount of time searching the 1st to 50th
percentiles as it will the 50th to 99th percentiles, meaning that both
limits will have the same degree of precision.  In addition, your search
will have been maximally efficient. </P>

<P> Also remember that any flat areas of your curvefiles should raise a red
flag.  For example, let's go back to our graph of variably-correlated
parameters: </P>

<center>
<IMG src="images/variably_correlated.gif" alt="Graph of two variably
correlated parameters">
</center>

<P>One possible cause of the curve on the left (with the long flat section)
is a logarithmic prior for that parameter with a lower bound so low that it
included a wide swath of values, all of which had the same effect on the
data as they would if the parameter value was zero.  For example, if you
are estimating recombination rate, there will be a particular value for
that rate that predicts a single recombination event in the entire history
of your population.  All potential recombination rates below that value will
therefore predict the same thing:  zero recombination events.  Conversely,
any tree with zero recombination events will happily accept any value for
recombination below that cut-off.
</P>

<P> The upshot of all this is that if you get a curvefile with a flat
section, you must realize that section only includes the information you put
into the prior--in other words, the same amount of information you had
before you ran LAMARC at all.  What makes this example particularly worrying
is not that recombination has a long flat section--that's fairly easy to
explain--but that due to a correlation between recombination and some other
parameter, that other parameter now has a secondary peak.  And as we said
before:  if your prior is right, than this secondary peak is also right. 
But if the prior was intended to represent more or less complete lack of
information, the program gleaned information from it anyway.
</P>

<P> With some amount of trepidation, we have put in default priors for each
force, though we strongly encourage users to make deliberate choices about
the priors for their particular variables, instead of blindly accepting the
given defaults.  For reference, the default priors are:
<UL>
<LI>Theta:  Logarithmic, 0.00001 - 10.0
<LI>Migration:  Logarithmic, 0.01 - 1000.0
<LI>Recombination:  Logarithmic, 0.00001 - 10.0
<LI>Growth:  Linear, -500.0 - 1000.0
</UL>

<h3><A NAME="final">Uh, suddenly I'm not so sure about doing a Bayesian run
any more.</h3>

<P> You are wise to be cautious, for it is always critically important to
examine your assumptions before undertaking any scientific endeavor, and
the priors represent a large set of assumptions that can affect your
results in sometimes surprising ways.  It might help to realize that your
priors are part of the model for population history you're using--a model
that already includes a variety of assumptions and simplifications.  The
difference, of course, is that the assumptions and simplifications of the
coalescent model have been hashed out in the literature over the course of
several years, and you're going to have to defend the prior you put on your
data for the first time. </P>

<P> On the plus side, the more data you collect and the longer you run
LAMARC, the less your choice of priors will affect the results.  If you do a
sufficient analysis, and understand that any part of the resulting curves
that are flat came from your assumptions and not from your data, you will be
OK.
</P>

<h3>Well, OK, I'll give it a shot.  What do I do?</h3>

<P> Onward, then, to the <A HREF="bayes_howto.html">Bayesian tutorial</a>.
</P>

(<A HREF="output.html">Previous</A> | <A HREF="index.html">Contents</A> | <A
HREF="tracer.html">Next</A>)

<!--
//$Id: bayes.html,v 1.14 2012/05/16 17:14:01 mkkuhner Exp $
-->
</BODY>
</HTML>