1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394
|
<!-- header fragment for html documentation -->
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN">
<HTML>
<HEAD>
<META NAME="description" CONTENT="Estimation of population parameters using genetic data using a maximum likelihood approach with Metropolis-Hastings Monte Carlo Markov chain importance sampling">
<META NAME="keywords" CONTENT="MCMC, Markov chain, Monte Carlo, Metropolis-Hastings, population, parameters, migration rate, population size, recombination rate, maximum likelihood">
<TITLE>LAMARC Documentation: Evolutionary forces</title>
</HEAD>
<BODY BGCOLOR="#FFFFFF">
<!-- coalescent, coalescence, Markov chain Monte Carlo simulation, migration rate, effective
population size, recombination rate, maximum likelihood -->
<P>(<A HREF="gamma.html">Previous</A> | <A HREF="index.html">Contents</A> |
<A HREF="parameters.html">Next</A>)</P>
<H2>Evolutionary Forces</H2>
<P> This article describes, in some technical detail, the way we
model evolutionary forces. You don't need this information to
get the program running, but it can be very helpful in understanding
your results. </P>
The forces supported at this time are:
<UL>
<LI><A HREF="forces.html#coalescence">Coalescence</A></LI>
<LI><A HREF="forces.html#migration">Migration</A></LI>
<LI><A HREF="forces.html#recombination">Recombination</A></LI>
<LI><A HREF="forces.html#growth">Growth</A></LI>
<LI><A HREF="forces.html#gamma">Gamma-distributed relative mutation rate over regions</A></LI>
<LI><A HREF="forces.html#divergence">Divergence</A></LI>
</UL>
<P> For each force, we describe the parameter being estimated, and
list some assumptions made in the analysis. If the assumptions
are violated, the results will not necessarily be wrong, but
should be interpreted cautiously.</P>
<H3><A NAME="coalescence">Coalescence</A></H3>
We estimate the parameter Theta for each population. While most papers
describe Theta as 4<i>N<sub>e</sub></i>mu (where <i>N<sub>e</sub></i> is the
effective number of individuals and mu is the
mutation rate in mutations per generation), this is specific to diploids.
If you put haploid mtDNA into LAMARC, the Theta estimates will be estimates
of 2<i>N<sub>f</sub></i>mu instead, where <i>N<sub>f</sub></i> is the effective
number of females in the population. It is best to think of Theta as "number of heritable
copies in population * 2 * mutation rate," since this definition works no
matter what the ploidy is. (The "2" comes from the fact that two sequences
that have diverged for time <i>t</i> are different by 2 * mu * <i>t</i>
mutations, since both diverging lineages accumulate mutations.)</P>
<P> The units of mu are mutations per site per generation. Please note
that many studies compute a (lower-case) theta whose units involve
mutations per <b>locus</b> per generation. To convert, multiply the per-
site Theta by the number of sites.</P>
<P> If the "multiple rate categories" option of the data model is
used, the mutation rate is the weighted mean across all categories.</P>
<P> While it would be helpful to have separate estimates of population
size and mutation rate, with genetic data from a single time point
there is no way to separate them. If you have outside information
about either population size or mutation rate, for example from
outgroup analysis, you can then estimate the other parameter
directly.</P>
<P> A Frequently Asked Question is how to interpret the parameter
Theta when analyzing mitochondrial DNA or Y-chromosome DNA.
In these cases Theta is proportional to the mtDNA or Y-chromosome
effective population size. For example, when given mtDNA, the program
estimates Theta=2<i>N<sub>f</sub></i>mu, and when given Y-chromosome DNA, the
program estimates Theta=2<i>N<sub>m</sub></i>mu (where "f" and "m" refer to the
female and male effective population sizes).</P>
<P> To combine estimates of Theta from regions with different Thetas, such
as mtDNA and nuclear DNA, you can set the relative population sizes in the
'<A HREF="menu.html#data">Effective Population Size</a>' menu, or set these
values directly in the XML input. Reported multi-region Theta estimates
will be scaled accordingly.</P>
<H4> Assumptions:</H4>
<UL> <LI><P>(unless growth rates are being estimated) The population(s) have been the
same size for a very long time (though they can be different from one
another).</P>
<LI><P>All markers being considered are neutral, or at least the variation
seen is neutral (a gene which undergoes purifying selection against harmful
variants will work, but a gene under balancing selection will give
distorted results).</P>
<LI><P>The markers are also not tightly linked to any segments undergoing
directional or balancing selection.</P>
<LI><P>The population is significantly larger than the sample drawn from it.
If not, the basic coalescent equations break down. Results will also be
questionable if the population size is truly tiny--less than a few dozen.
A rule of thumb is that the sample size should be no greater than the
square root of the population size, so a population of 100 should have a
sample no greater than 10.</P>
<LI><P>Each population is free of internal subdivisions which would impede
gene flow.</P>
<LI><P>All regions and segments in the analysis either reflect the same
underlying Theta (they do not vary in mutation rate or population size),
have been correctly annotated with their relative mutation rates and
population sizes, or have mutation rates drawn from a gamma distribution
(if explicitly estimated).</P> </UL>
<H3><A NAME="migration">Migration</A></H3>
<P> We estimate the immigration rate into each population from each
other population (moving forward in time), so that a three-population case will estimate
six rates. The immigration parameter <i>M</i> is expressed as
<i>m</i>/mu where <i>m</i>
is the immigration rate per generation and mu is the neutral
mutation rate per site per generation. (As usual, if multiple
mutation categories are used, mu is the weighted average.)
</P>
<P> If you would prefer to consider migration in terms of 4<i>Nm</i>,
multiply the given value by the recipient population's value of Theta.
Please be careful of the distinction between immigration and
emigration. LAMARC always estimates and reports immigration, so 'M12'
indicates the rate at which individuals from population 1 have moved to
population 2.
</P>
<P> In the current version we cannot meaningfully combine genetic
regions with different migration rates in the same analysis. We hope
to provide this capability later.</P>
<P> Peter Beerli has argued that it may be useful to include a population in your analysis
even if you have not sampled any individuals from it. This can
guard against biases produced by unacknowledged populations.
For example, if you believe that your real-life situation involves
three populations exchanging migrants, but you have been able to
sample only two of them, you might add the third population
even with no individuals. The parameter estimates involving
this population will be weak, but the estimates involving the
other two populations may be more accurate than if the unsampled
population were omitted entirely. </P>
<P> However, LAMARC is not very stable with unsampled
populations, because the likelihood surface for the parameters of
the unsampled population may be flat or multi-peaked. For this
reason we have not allowed unsampled populations in v2. If
you feel this capability would help you, consider using MIGRATE,
and also please let us know. We will work on stabilizing
estimates for unsampled populations if this is of general interest.</P>
<H4> Assumptions:</H4>
<UL>
<LI>
<P>The current migration structure has been stable for a long time,
and the populations have existed for a long time. (If one population
is a recent offshoot of another, you will see spurious evidence
of migration between them.)</P>
<LI> <P> Since the migration situation is assumed to be stable, there must
be some way for individuals to have reached all of the populations. Thus,
all populations must be connected at least through one path, and at most one
population can have no immigrants. If your data do contain several
completely isolated populations, you will estimate small, but non-zero,
migration rates into them. This is a result of assuming a common ancestor
for all populations.</P>
</UL>
<P> We do not assume that migration is symmetrical, though you can impose
this assumption if you wish using <A
HREF="menu.html#constraints">constraints</a>.</P>
<H3><A NAME="recombination">Recombination</A></H3>
<P> We estimate the recombination rate <i>r</i> = <i>C</i>/mu,
where <i>C</i> is
the recombination rate per inter-site link per generation and
mu is the neutral mutation rate per site per generation. As
usual, if multiple rate categories are in effect, mu is the
weighted average of the categories.</P>
<P> In the current version we cannot meaningfully combine regions or
populations with different recombination rates. We hope to provide this
capability later.</P>
<P> We do not assume that all recombinations are visible. An advantage
of LAMARC and its predecessor RECOMBINE over pairwise methods
is that their recombination estimates are not dragged downwards
if the data are nearly invariant (though the error bars, of course,
increase greatly).</P>
<H4> Assumptions:</H4>
<UL>
<LI><P>The recombination rate is constant across the region and has not
changed for a long time.</P>
<LI><P>Recombination frequency is not affected by sequence divergence;
highly similar and highly dissimilar sequences are equally likely to
recombine.</P>
<LI><P>All recombination is homologous.</P>
<LI><P>There is no gene conversion. While double recombinants can produce
events that resemble conversions, they have the dynamics of two independent
events, not a single event like a true conversion.</P>
<LI><P>There is no interference among adjacent recombinations (this follows
from the coalescent assumption that no two events happen at the same
instant).</P>
<LI><P>Recombination events are selectively neutral.</P>
</UL>
<H3><A NAME="growth">Growth</A></H3>
<P> We estimate the exponential growth rate <i>g</i> as defined in the
following equation, where <i>t</i> is a time before the present:</P>
<P><center>Theta<sub><i>t</i></sub> = Theta<sub>present time</sub> exp(-<i>gt</i>)</center></P>
<P>This means that positive values of <i>g</i> indicate a population which is
growing, and negative values a population which is shrinking. They
are not symmetrical in magnitude, however; <i>g</i> = 10 indicates rather slow
growth, but <i>g</i> = -10 indicates significant shrinkage for most values of
Theta.</P>
<P> When migration is also in effect, growth rates are computed
for each population independently.</P>
<P> In the presence of growth, the reported value of Theta is the
present-day Theta (at the time when the data were sampled). The equation
above can be used to determine values of Theta for past times. Just
remember that the time parameter <i>t</i> is measured in units of mutations
(i.e. 1 <i>t</i> is the average number of generations it takes one site to
accumulate one mutation),
and <i>g</i> is measured in the inverse units of <i>t</i>.
If mu is known, divide generations by mu to
get units of <i>t</i> (conversely, <i>t</i>*mu is a number of generations).
<H4>Assumptions:</H4>
<UL>
<LI><P> The population has been growing or shrinking at the same
exponential rate for a long time. (This is particularly questionable when
the population is shrinking; except for microorganisms, most biological
populations do not survive long periods of exponential shrinkage.)</P>
<LI><P> If migration is in effect, immigration rate does not depend on
population size.</P>
<LI><P> The growth force cannot be used in combination with the gamma "force."
Please see the comment <A HREF="forces.html#gamma">there</A> for an explanation.
</UL>
<P> Simulation studies show that estimation of growth tends to be biased
upwards; this bias is reduced by having multiple unlinked regions and/or, in
cases with recombination, long sequences. Be cautious in interpreting
results from one or a few regions. The profile likelihoods are more reliable
than the maxima but can also show the bias.</P>
<H3><A NAME="gamma">Gamma-distributed relative mutation rate over regions</A></H3>
<P> This isn't exactly a "force," but because LAMARC optionally allows you to estimate
the parameter (α) of the gamma distribution which best fits data sets composed of
multiple unlinked genomic regions, it's in the menu for evolutionary forces. More
information about this is available <A HREF="gamma.html">here</A>.</P>
<P> Because there need to be multiple genomic regions in order for there to
be relative mutation rates to distribute among them, this "force" will not
appear in the evolutionary forces menu if the data is annotated as coming from a single
genomic region.</P>
<P> Due to a mathematical detail of how this feature is implemented in LAMARC,
it can only be applied to maximum-likelihood analyses. If you wish to perform
a Bayesian analysis, your best bet is to estimate the relative single-region
mutation rates by some other means, then supply these to LAMARC as
constants.</P>
<H4>Assumptions:</H4>
<UL>
<LI><P>Each population being analyzed is constant in size. This is because
a certain integrand containing the gamma distribution arises when the populations
are assumed to be growing or shrinking exponentially, and this integrand
cannot be integrated analytically. Hence LAMARC does not allow the gamma
"force" to be used in combination with growth.</P>
</UL>
<H3><A NAME="divergence">Divergence</A></H3>
<p>
We estimate the time at which pairs of populations split from their
common ancestor. Thus, a case with three populations will estimate two
divergence times. For example, a human/chimp/gorilla case might
estimate the human versus chimp divergence time and the human+chimp versus
gorilla divergence time.
</p>
<p>
LAMARC does not infer the population tree. The information that we should
be computing human+chimp and not chimp+gorilla as the first split must be
provided by the user. [For a method that attempts to infer population trees,
see the "*BEAST" program from Drummond's group.] Divergence
requires not only the topology of the population tree, but its order of
events. If we have a tree with an A+B split and a C+D split we must
specify which split happened most recently.
</p>
<p>
Divergence times are scaled by the mutation rate. A time of 1.0 for the
divergence of two populations means that it occured long enough ago that
each character has an expectation of 1 mutation since then.
</p>
<p>
If migration is additionally turned on, migration is allowed between populations that exist at the same time. This leads to inference of more migration rates
than a non-divergence case. For example, in the human/chimp/gorilla
example the program could maximally infer bidirectional migration rates
between human and chimp, human and gorilla, chimp and gorilla, and
between gorilla and the human/chimp ancestor. Users should beware of
inferring too many migration rates for the available amount of data.
Unwanted rates can be fixed at zero. For example, in this case fixing
the rates between human and gorilla to zero would seem reasonable.
</p>
<p>
An important caveat about models of divergence is that in some cases the
data will say nothing about one or more parameters (the jargon term is
"not identifiable"). For example, if a population split very recently
there will be no power to infer the sizes of the two descendants or the
migration rate between them. If
they split very long ago there will be no power to infer the
size of the ancestor or its migration rate. And if the migration rate between two populations is very high there will
be no power to infer their divergence time. It is important to look at
confidence intervals and, in Bayesian runs, the shape of the posterior
distribution to detect cases in which some parameters are not identifiable.
</p>
<p> Quick calculators for starting values (FST, Watterson) are not available in
cases with divergence. We don't know how to use them for unsampled ancestral
populations, so cravenly don't use them at all.</p>
<H4>Assumptions:</H4>
<UL>
<LI>
<P>The population tree given by the user is correct. Grossly distorted
estimates should be expected if the population tree is wrong, and there
is no way for the program to detect this problem.
</P>
</LI>
<LI>
<P>The populations have been the same size (unless growth rates are being
estimated) throughout their existance. No relationship is assumed between
pre-split and post-split sizes (the size of the human/chimp ancestor has
no fixed relationship with the sizes of the human and chimp populations).
</P>
</LI>
<LI>
<P>The migration rates are stable throughout the lifespan of a population.
</P>
</LI>
</UL>
<P>(<A HREF="gamma.html">Previous</A> | <A HREF="index.html">Contents</A> |
<A HREF="parameters.html">Next</A>)</P>
<!--
//$Id: forces.html,v 1.26 2013/11/07 22:46:06 mkkuhner Exp $
-->
</BODY>
</HTML>
|