File: forces.html

package info (click to toggle)
lamarc 2.1.10.1%2Bdfsg-3
  • links: PTS, VCS
  • area: main
  • in suites: buster
  • size: 77,052 kB
  • sloc: cpp: 112,339; xml: 16,769; sh: 3,528; makefile: 1,219; python: 420; perl: 260; ansic: 40
file content (394 lines) | stat: -rw-r--r-- 17,215 bytes parent folder | download | duplicates (3)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
<!-- header fragment for html documentation -->
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN">
<HTML>
<HEAD>

<META NAME="description" CONTENT="Estimation of population parameters using genetic data using a maximum likelihood approach with Metropolis-Hastings Monte Carlo Markov chain importance sampling">
<META NAME="keywords" CONTENT="MCMC, Markov chain, Monte Carlo, Metropolis-Hastings, population, parameters, migration rate, population size, recombination rate, maximum likelihood">

<TITLE>LAMARC Documentation: Evolutionary forces</title>
</HEAD>


<BODY BGCOLOR="#FFFFFF">
<!-- coalescent, coalescence, Markov chain Monte Carlo simulation, migration rate, effective
 population size, recombination rate, maximum likelihood -->


<P>(<A HREF="gamma.html">Previous</A> | <A HREF="index.html">Contents</A> |
<A HREF="parameters.html">Next</A>)</P>

<H2>Evolutionary Forces</H2>
<P> This article describes, in some technical detail, the way we
model evolutionary forces.  You don't need this information to
get the program running, but it can be very helpful in understanding
your results. </P>

The forces supported at this time are:

<UL>
<LI><A HREF="forces.html#coalescence">Coalescence</A></LI>
<LI><A HREF="forces.html#migration">Migration</A></LI>
<LI><A HREF="forces.html#recombination">Recombination</A></LI>
<LI><A HREF="forces.html#growth">Growth</A></LI>
<LI><A HREF="forces.html#gamma">Gamma-distributed relative mutation rate over regions</A></LI>
<LI><A HREF="forces.html#divergence">Divergence</A></LI>
</UL>

<P> For each force, we describe the parameter being estimated, and
list some assumptions made in the analysis.  If the assumptions
are violated, the results will not necessarily be wrong, but
should be interpreted cautiously.</P>

<H3><A NAME="coalescence">Coalescence</A></H3>

We estimate the parameter Theta for each population.  While most papers
describe Theta as 4<i>N<sub>e</sub></i>mu (where <i>N<sub>e</sub></i> is the
effective number of individuals and mu is the
mutation rate in mutations per generation), this is specific to diploids.
If you put haploid mtDNA into LAMARC, the Theta estimates will be estimates
of 2<i>N<sub>f</sub></i>mu instead, where <i>N<sub>f</sub></i> is the effective 
number of females in the population.  It is best to think of Theta as "number of heritable
copies in population * 2 * mutation rate," since this definition works no
matter what the ploidy is.  (The "2" comes from the fact that two sequences
that have diverged for time <i>t</i> are different by 2 * mu * <i>t</i> 
mutations, since both diverging lineages accumulate mutations.)</P>

<P> The units of mu are mutations per site per generation.  Please note
that many studies compute a (lower-case) theta whose units involve
mutations per <b>locus</b> per generation.  To convert, multiply the per-
site Theta by the number of sites.</P>

<P> If the "multiple rate categories" option of the data model is
used, the mutation rate is the weighted mean across all categories.</P>

<P> While it would be helpful to have separate estimates of population 
size and mutation rate, with genetic data from a single time point
there is no way to separate them.  If you have outside information
about either population size or mutation rate, for example from
outgroup analysis, you can then estimate the other parameter
directly.</P>

<P> A Frequently Asked Question is how to interpret the parameter
Theta when analyzing mitochondrial DNA or Y-chromosome DNA.
In these cases Theta is proportional to the mtDNA or Y-chromosome
effective population size.  For example, when given mtDNA, the program
estimates Theta=2<i>N<sub>f</sub></i>mu, and when given Y-chromosome DNA, the
program estimates Theta=2<i>N<sub>m</sub></i>mu (where "f" and "m" refer to the
female and male effective population sizes).</P>

<P> To combine estimates of Theta from regions with different Thetas, such
as mtDNA and nuclear DNA, you can set the relative population sizes in the
'<A HREF="menu.html#data">Effective Population Size</a>' menu, or set these
values directly in the XML input.  Reported multi-region Theta estimates
will be scaled accordingly.</P>

<H4> Assumptions:</H4>

<UL> <LI><P>(unless growth rates are being estimated) The population(s) have been the
same size for a very long time (though they can be different from one
another).</P>

<LI><P>All markers being considered are neutral, or at least the variation 
seen is neutral (a gene which undergoes purifying selection against harmful
variants will work, but a gene under balancing selection will give
distorted results).</P>

<LI><P>The markers are also not tightly linked  to any segments undergoing
directional or balancing selection.</P>

<LI><P>The population is significantly larger than the sample drawn from it. 
If not, the basic coalescent equations break down.  Results will also be
questionable if the population size is truly tiny--less than a few dozen. 
A rule of thumb is that the sample size should be no greater than the
square root of the population size, so a population of 100 should have a
sample no greater than 10.</P>

<LI><P>Each population is free of internal subdivisions which would impede
gene flow.</P>

<LI><P>All regions and segments in the analysis either reflect the same
underlying Theta (they do not vary in mutation rate or population size),
have been correctly annotated with their relative mutation rates and
population sizes, or have mutation rates drawn from a gamma distribution
(if explicitly estimated).</P> </UL>


<H3><A NAME="migration">Migration</A></H3>

<P> We estimate the immigration rate into each population from each
other population (moving forward in time), so that a three-population case will estimate
six rates.  The immigration parameter <i>M</i> is expressed as 
<i>m</i>/mu where <i>m</i>
is the immigration rate per generation and mu is the neutral
mutation rate per site per generation.  (As usual, if multiple
mutation categories are used, mu is the weighted average.)
</P>

<P> If you would prefer to consider migration in terms of 4<i>Nm</i>, 
multiply the given value by the recipient population's value of Theta.
Please be careful of the distinction between immigration and
emigration.  LAMARC always estimates and reports immigration, so 'M12'
indicates the rate at which individuals from population 1 have moved to 
population 2.
</P>

<P> In the current version we cannot meaningfully combine genetic
regions with different migration rates in the same analysis.  We hope 
to provide this capability later.</P>

<P> Peter Beerli has argued that it may be useful to include a population in your analysis
even if you have not sampled any individuals from it.  This can
guard against biases produced by unacknowledged populations.
For example, if you believe that your real-life situation involves
three populations exchanging migrants, but you have been able to
sample only two of them, you might add the third population
even with no individuals.  The parameter estimates involving
this population will be weak, but the estimates involving the
other two populations may be more accurate than if the unsampled
population were omitted entirely.  </P>

<P> However, LAMARC is not very stable with unsampled
populations, because the likelihood surface for the parameters of
the unsampled population may be flat or multi-peaked. For this
reason we have not allowed unsampled populations in v2.  If
you feel this capability would help you, consider using MIGRATE,
and also please let us know.  We will work on stabilizing
estimates for unsampled populations if this is of general interest.</P>

<H4> Assumptions:</H4>

<UL>
<LI>
<P>The current migration structure has been stable for a long time,
and the populations have existed for a long time.  (If one population
is a recent offshoot of another, you will see spurious evidence
of migration between them.)</P>

<LI> <P> Since the migration situation is assumed to be stable, there must
be some way for individuals to have reached all of the populations. Thus,
all populations must be connected at least through one path, and at most one
population can have no immigrants.   If your data do contain several
completely isolated populations, you will estimate small, but non-zero,
migration rates into them. This is a result of assuming a common ancestor
for all populations.</P>

</UL>

<P> We do not assume that migration is symmetrical, though you can impose
this assumption if you wish using <A
HREF="menu.html#constraints">constraints</a>.</P>

<H3><A NAME="recombination">Recombination</A></H3>

<P> We estimate the recombination rate <i>r</i> = <i>C</i>/mu, 
where <i>C</i> is
the recombination rate per inter-site link per generation and
mu is the neutral mutation rate per site per generation.  As
usual, if multiple rate categories are in effect, mu is the
weighted average of the categories.</P>

<P> In the current version we cannot meaningfully combine regions or
populations with different recombination rates.  We hope to provide this 
capability later.</P>

<P> We do not assume that all recombinations are visible.  An advantage
of LAMARC and its predecessor RECOMBINE over pairwise methods
is that their recombination estimates are not dragged downwards
if the data are nearly invariant (though the error bars, of course,
increase greatly).</P>

<H4> Assumptions:</H4>

<UL>

<LI><P>The recombination rate is constant across the region and has not
changed for a long time.</P>

<LI><P>Recombination frequency is not affected by sequence divergence;
highly similar and highly dissimilar sequences are equally likely to
recombine.</P>

<LI><P>All recombination is homologous.</P>

<LI><P>There is no gene conversion.  While double recombinants can produce
events that resemble conversions, they have the dynamics of two independent
events, not a single event like a true conversion.</P>

<LI><P>There is no interference among adjacent recombinations (this follows
from the coalescent assumption that no two events happen at the same
instant).</P>

<LI><P>Recombination events are selectively neutral.</P>

</UL>

<H3><A NAME="growth">Growth</A></H3>

<P> We estimate the exponential growth rate <i>g</i> as defined in the
following equation, where <i>t</i> is a time before the present:</P>

<P><center>Theta<sub><i>t</i></sub> = Theta<sub>present time</sub> exp(-<i>gt</i>)</center></P>

<P>This means that positive values of <i>g</i> indicate a population which is
growing, and negative values a population which is shrinking.  They
are not symmetrical in magnitude, however; <i>g</i> = 10 indicates rather slow
growth, but <i>g</i> = -10 indicates significant shrinkage for most values of
Theta.</P>

<P> When migration is also in effect, growth rates are computed
for each population independently.</P>

<P> In the presence of growth, the reported value of Theta is the
present-day Theta (at the time when the data were sampled). The equation
above can be used to determine values of Theta for past times.  Just
remember that the time parameter <i>t</i> is measured in units of mutations 
(i.e. 1 <i>t</i> is the average number of generations it takes one site to 
accumulate one mutation),
and <i>g</i> is measured in the inverse units of <i>t</i>.
If mu is known, divide generations by mu to
get units of <i>t</i> (conversely, <i>t</i>*mu is a number of generations).

<H4>Assumptions:</H4>

<UL>

<LI><P> The population has been growing or shrinking at the same
exponential rate for a long time.  (This is particularly questionable when
the population is shrinking; except for microorganisms, most biological
populations do not survive long periods of exponential shrinkage.)</P>

<LI><P> If migration is in effect, immigration rate does not depend on
population size.</P>

<LI><P> The growth force cannot be used in combination with the gamma "force."
Please see the comment <A HREF="forces.html#gamma">there</A> for an explanation.

</UL>

<P> Simulation studies show that estimation of growth tends to be biased
upwards; this bias is reduced by having multiple unlinked regions and/or, in
cases with recombination, long sequences.  Be cautious in interpreting
results from one or a few regions.  The profile likelihoods are more reliable
than the maxima but can also show the bias.</P>


<H3><A NAME="gamma">Gamma-distributed relative mutation rate over regions</A></H3>

<P> This isn't exactly a "force," but because LAMARC optionally allows you to estimate
the parameter (&alpha;) of the gamma distribution which best fits data sets composed of
multiple unlinked genomic regions, it's in the menu for evolutionary forces. More
information about this is available <A HREF="gamma.html">here</A>.</P>

<P> Because there need to be multiple genomic regions in order for there to
be relative mutation rates to distribute among them, this "force" will not
appear in the evolutionary forces menu if the data is annotated as coming from a single
genomic region.</P>

<P> Due to a mathematical detail of how this feature is implemented in LAMARC,
it can only be applied to maximum-likelihood analyses.  If you wish to perform
a Bayesian analysis, your best bet is to estimate the relative single-region
mutation rates by some other means, then supply these to LAMARC as
constants.</P>

<H4>Assumptions:</H4>

<UL>

<LI><P>Each population being analyzed is constant in size. This is because
a certain integrand containing the gamma distribution arises when the populations
are assumed to be growing or shrinking exponentially, and this integrand
cannot be integrated analytically.  Hence LAMARC does not allow the gamma
"force" to be used in combination with growth.</P>

</UL>

<H3><A NAME="divergence">Divergence</A></H3>

<p>
We estimate the time at which pairs of populations split from their
common ancestor.  Thus, a case with three populations will estimate two
divergence times.  For example, a human/chimp/gorilla case might
estimate the human versus chimp divergence time and the human+chimp versus
gorilla divergence time.
</p>

<p>
LAMARC does not infer the population tree.  The information that we should
be computing human+chimp and not chimp+gorilla as the first split must be
provided by the user.  [For a method that attempts to infer population trees,
see the &quot;*BEAST&quot; program from Drummond's group.] Divergence
requires not only the topology of the population tree, but its order of
events.  If we have a tree with an A+B split and a C+D split we must
specify which split happened most recently.
</p>

<p>
Divergence times are scaled by the mutation rate.  A time of 1.0 for the
divergence of two populations means that it occured long enough ago that
each character has an expectation of 1 mutation since then.
</p>

<p>
If migration is additionally turned on, migration is allowed between populations that exist at the same time.  This leads to inference of more migration rates
than a non-divergence case.  For example, in the human/chimp/gorilla
example the program could maximally infer bidirectional migration rates
between human and chimp, human and gorilla, chimp and gorilla, and
between gorilla and the human/chimp ancestor.  Users should beware of
inferring too many migration rates for the available amount of data.
Unwanted rates can be fixed at zero.  For example, in this case fixing
the rates between human and gorilla to zero would seem reasonable.
</p>

<p>
An important caveat about models of divergence is that in some cases the
data will say nothing about one or more parameters (the jargon term is
"not identifiable").  For example, if a population split very recently
there will be no power to infer the sizes of the two descendants or the
migration rate between them.  If
they split very long ago there will be no power to infer the
size of the ancestor or its migration rate.  And if the migration rate between two populations is very high there will
be no power to infer their divergence time.  It is important to look at
confidence intervals and, in Bayesian runs, the shape of the posterior
distribution to detect cases in which some parameters are not identifiable. 
</p>

<p> Quick calculators for starting values (FST, Watterson) are not available in
cases with divergence.  We don't know how to use them for unsampled ancestral
populations, so cravenly don't use them at all.</p>

<H4>Assumptions:</H4>

<UL>

<LI>
<P>The population tree given by the user is correct.  Grossly distorted
estimates should be expected if the population tree is wrong, and there
is no way for the program to detect this problem.
</P>
</LI>

<LI>
<P>The populations have been the same size (unless growth rates are being
estimated) throughout their existance.  No relationship is assumed between
pre-split and post-split sizes (the size of the human/chimp ancestor has
no fixed relationship with the sizes of the human and chimp populations).
</P>
</LI>

<LI>
<P>The migration rates are stable throughout the lifespan of a population.
</P>
</LI>

</UL>

<P>(<A HREF="gamma.html">Previous</A> | <A HREF="index.html">Contents</A> |
<A HREF="parameters.html">Next</A>)</P>

<!--
//$Id: forces.html,v 1.26 2013/11/07 22:46:06 mkkuhner Exp $
-->
</BODY>
</HTML>