File: data_required.html

package info (click to toggle)
lamarc 2.1.10.1%2Bdfsg-3
links: PTS, VCS
area: main
in suites: buster
size: 77,052 kB
sloc: cpp: 112,339; xml: 16,769; sh: 3,528; makefile: 1,219; python: 420; perl: 260; ansic: 40
file content (135 lines) | stat: -rw-r--r-- 8,065 bytes
parent folder | download | duplicates (3)
<!-- header fragment for html documentation -->
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN">
<HTML>
<HEAD>

<META NAME="description" CONTENT="Estimation of population parameters using genetic data using a maximum likelihood approach with Metropolis-Hastings Monte Carlo Markov chain importance sampling">
<META NAME="keywords" CONTENT="MCMC, Markov chain, Monte Carlo, Metropolis-Hastings, population, parameters, migration rate, population size, recombination rate, maximum likelihood">

<TITLE>LAMARC Documentation: Suitable data for LAMARC</title>
</HEAD>


<BODY BGCOLOR="#FFFFFF">
<!-- coalescent, coalescence, Markov chain Monte Carlo simulation, migration rate, effective
 population size, recombination rate, maximum likelihood -->


<P>(<A HREF="glossary.html">Previous</A> | <A HREF="index.html">Contents</A> | <A HREF="tutorial.html">Next</A>)</P>
<H2>Suitable data for LAMARC </H2>

<P> This article gives our best available information on the type and amount
of data needed to use LAMARC successfully.  Remember that LAMARC is an experimental tool, not an oracle. You know your data better than we do. The following guidelines will help, but we don't guarantee success even
if you meet these criteria. Every data set is different. But if you have
much less data than described here, you will probably not be happy with your results.</P>

<LI><A HREF="data_required.html#regions">How many unlinked regions?</A></LI>
<LI><A HREF="data_required.html#samples">How many samples?</A></LI>
<LI><A HREF="data_required.html#sites">How many linked sites?</A></LI>
<LI><A HREF="data_required.html#kinds">What kinds of samples?</A></LI>
<LI><A HREF="data_required.html#pops">What kinds of populations?</A></LI>


<H3> <A NAME="regions">How many unlinked regions?</H3>

<P> Best rule we've found is count the number of forces you wish to recover and add 1. So for estimation of Theta and
migration rate it is possible to get results with one or two regions but the returned values improve markedly with three. This is because each independent region tracks a different evolutionary pathway. Each gives you an independent measure of the forces of interest. Averaging the estimate for each force from each region together gives you a much better estimate than any individual region can provide.</p>
<p> There are a few exceptions:</p>
<p>Estimation of growth rate of a single population is very poor with less
than 3 unlinked regions and particularly benefits from having more. </p> <p>Recombination is best estimated from a single lengthy region (more discussion of this below in "How many linked sites?"</P>

<P> Note that runtime increases approximately linearly with the number of regions.  But it's hard to have too many, unless you exceed your computer's memory capacity (or your patience).</P>

<H3> <A NAME="samples">How many samples?</H3>

<P> Surprisingly, Felsenstein has shown that for estimation of Theta the
optimal number of samples is very low, around 8.  (Reference:  <A
HREF="http://mbe.oxfordjournals.org/cgi/content/full/23/3/691?ijkey=BKStzV1zbTncUJJ&keytype=ref">
Felsenstein 2006</A>.) If you can obtain another unlinked region you will
gain much more than you would gain by increasing the sample size for the current
region beyond 8.  We generally recommend aiming for 10-15 sequences per population. Again, recombination is an exception; a recombination analysis probably needs 15-20
sequences.</P>

<P> If you cannot get multiple genes, multiple sequences are some help, but
we do not recommend going above 30 sequences/subpopulation as the added
difficulty of the analysis more than outweighs the gain in information. 
Remember that the more sequences you have, the longer you will have to
search to find good trees, and the longer each tree will take to process. 
Runtime goes up with the log of the number of sequences, but as you will also
have to perform more steps (perhaps many more steps) in the final outcome,
adding sequences causes a worse-than-linear increase in runtime.</P>

<H3> <A NAME="sites">How many linked sites?</H3>

<P> This depends on the expected level of polymorphism.  For DNA or SNP
data, a region should ideally be long enough to see 10 or more variable
sites.  If that's not possible you will definitely need multiple unlinked
regions.  Above 100 or so variable sites, there is little additional
information to be gained outside of estimating recombination.</P>

<P> Runtime goes up less than linearly with number of sites; how much less
depends on how polymorphic your data are.  Nearly invariant data are very
quick, so if you have long sequences, by all means use them (assuming you
have the memory).</P>

<P> Recombination is an exception.  You would like your sequenced area to
include tens of recombinations, but not hundreds.  Unfortunately it is hard
to know this in advance.  A recombinational analysis will definitely need
20-30 variable sites or more to have much accuracy.  In general, sequences
for recombination inference should be long, but watch out for signs that you
are in the hundreds-of-recombinations zone as such runs will take a very
long time.  (The estimates will be good, but you won't want to wait for
them.) </p>
<p>The best approach if you do not know how much recombinaion your have is to start small and experiment. For example, run a very small (2-4) sample subset of your data and see how long it takes and how much recombination is finds. If you find you have high levels of recombination you may then want to shorten your sequences. </P>

<H3> <A NAME="kinds">What kinds of samples?</H3>

<P> LAMARC <B>requires</B> random samples from each subpopulation.  Therefore
<ul>
<li><b>DO NOT</b> cherrypick the most divergent or most interesting sequences, 
<li><b>DO NOT</b> pick one from each major lineage, 
<li><b>DO NOT</b> discard identical sequences, and
<li><b>DO NOT</b> do anything of this sort! 
</ul>
Such  data will give grossly biased estimates.</P>

<P> It has been stated in print (not by us!) that mutually incompatible
sites in the data set must be removed.  <B>This is totally untrue.</B>  It
is true for programs which assume an infinite-sites model, but LAMARC does
not.  Remove sites only if you cannot be sure they are correctly aligned. 
It is best to remove all alignment columns for which the alignment is
doubtful. </P>

<P> Be sure that all of your samples are of the same homolog; throwing in a
few sequences from a paralog will ruin your analysis.</P>

<P> You do not need to sample from multiple subpopulations evenly, nor in
proportion to their population sizes.  However, try not to let the sample
size from any subpopulation go below 2 diploid or 4 haploid individuals as
the estimates are likely to be unstable.</P>

<H3><A NAME="pops"> What kinds of populations?</H3>

<P> LAMARC works well for subpopulations which show definite signs of 
geographic structure.  You may want to use the program STRUCTURE to test
for this before beginning.  If STRUCTURE does not see any structure in
your populations, the migration rate is probably too high and you should
pool those populations together.</P>

<p> One warning here. If STRUCTURE finds clusters it's tempting to use them to redefine your populations. DON"T DO THAT!!! You've just returned all the migrants to their original populations. If you find structure, keep your samples in their original spacial groupings.</p>

<P> At the other extreme, if your populations are totally isolated you
will not, of course, get a good estimate of the migration rate, though
other parameters can be well estimated.  If your populations have
probably not exchanged migrants in the last 4N generations (for a diploid,
or 2N for a haploid) you should analyze each one separately, or use a divergence
model; it is not appropriate to try to estimate non-divergence migration rates. </P>
<P>

<P>(<A HREF="glossary.html">Previous</A> | <A HREF="index.html">Contents</A> | <A HREF="tutorial.html">Next</A>)</P>
<!--
//$Id: data_required.html,v 1.9 2012/05/16 17:14:01 mkkuhner Exp $
-->
</BODY>
</HTML>