File: parallel.html

package info (click to toggle)
lamarc 2.1.10.1%2Bdfsg-3
links: PTS, VCS
area: main
in suites: buster
size: 77,052 kB
sloc: cpp: 112,339; xml: 16,769; sh: 3,528; makefile: 1,219; python: 420; perl: 260; ansic: 40
file content (298 lines) | stat: -rw-r--r-- 15,698 bytes
parent folder | download | duplicates (3)
<!-- header fragment for html documentation -->
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN">
<HTML>
<HEAD>

<META NAME="description" CONTENT="Estimation of population parameters using genetic data using a maximum likelihood approach with Metropolis-Hastings Monte Carlo Markov chain importance sampling">
<META NAME="keywords" CONTENT="MCMC, Markov chain, Monte Carlo, Metropolis-Hastings, population, parameters, migration rate, population size, recombination rate, maximum likelihood">

<TITLE>LAMARC Documentation: Parallelizing</title>
</HEAD>


<BODY BGCOLOR="#FFFFFF">
<!-- coalescent, coalescence, Metropolis-Hastings, Markov chain Monte Carlo
 simulation, migration rate, effective population size, recombination rate,
 maximum likelihood -->

(<A HREF="tutorial2.html">Previous</A> | <A HREF="index.html">Contents</A> | 
<A HREF="mapping.html">Next</A>)
<H2>Parallelizing LAMARC</H2>

<P>LAMARC does not come in a parallel version.  However, if you wish, you
can do 'poor man's parallelization' by running subsets of your analysis
on different machines and then consolidating the results.  This is done by
making separate input files asking LAMARC to perform a different
set of independent calculations and output a series of summary files. 
Those summary files can then be concatenated together for a final LAMARC
run that will perform the overall analysis.
</P>

<P> You can parallelize LAMARC by spreading out the calculations for
different <A HREF="glossary.html#region">genetic regions</a> over different
computers (one per genetic region), and/or by spreading the calculation of
different 'replicates' over different computers.</P>

<p> This is complex enough we highly recommend you keep good notes since it is easy to get confused. 
Check lists and careful naming conventions help, but keeping track of a bunch of asynchronous software 
jobs on multiple machines is inherently difficult.</p> 

<p>There are three steps to the process</p>
<UL>
<LI><A HREF="parallel.html#divide">Dividing up your data</A></LI>
<LI> <A HREF="parallel.html#running">Running the programs and collecting the data</a></LI>
<LI> <A HREF="parallel.html#combine">Concatenating your data and running the joint analysis</a></LI>
</UL>

<h3><A NAME="pyParallel">Scripts to run &quot;Poor Man's Parallelization&quot;</a></h3>

<p>
We have released a set of Python scripts in the <tt>scripts/pyParallel</tt> directory of
the LAMARC distribution. They can be used to perform the steps below. However, since
this is a difficult topic and the scripts are not thoroughly tested, you may wish to
read the information in the sections below.
</p>

<p>
To run the scripts, do the following steps from a terminal/command window:
<ul>
<li>prepare a single LAMARC file describing all regions and setting the full 
number of replicates you want. We'll assume that file is called <tt>infile.xml</tt>
<li> put <tt>infile.xml</tt> in its own, otherwise empty directory
<li> from that directory, issue the command: <tt>python &lt;path/to/&gt;pyParallel/divide_data.py -l infile.xml -o divdir</tt>
</ul>
The program will produce output telling you which files to run LAMARC on next and 
which python program to run after the LAMARC runs are completed.
Depending if you have both multiple regions and multiple replicates, there
may be a second such step of multiple LAMARC runs followed by a python command.
The last step is always a single LAMARC run to produce a final file.
</p>

<h3><A NAME="divide">Dividing up your data</h3>

<h4>Regions:</h4>
<P> If you have more than one genetic region, the first step is to divide
up your data into its component regions.  You can do this by modifying your
original data files and feeding them into the converter one set at a time,
or you can, within the converter, delete the extra regions.  In the end, you
will want one LAMARC input file for each of your regions.
</P>

<P> Do note that different genetic regions can take different amounts of
time for LAMARC to analyze, depending on the amount of data in each.  If
one of your regions is a 100 kbase stretch of DNA, and another region is a
single microsatellite segment, the DNA is going to take longer to analyze
than the microsat.  You should almost always be running preliminary LAMARC
runs on your data anyway so you know what type and amount of analysis best
suits your data, and these amounts will be different for every region.  In
any event, plan to put the regions that will take a long time on your
faster computers, and your less-data-rich regions on the slower computers.
</P>

<P> You must also be sure that the same number of chains are being used for
each analysis (i.e. 10 initial chains and 2 final chains).  </P>

<h4>Replicates:</h4>
<P> When running multiple replicates on different machines, the same input
files can be used with two VERY IMPORTANT caveats.
<UL>

<LI>First, if your parallel machines have shared file space, you will
overwrite your output and summary files if run from the same directory. This is a 
bad idea anyway, because you will want to investigate runs from months ago sometimes. It's 
best to run every run in its own file space, for example snail/p2/s3/recomb which is the run of 
snail data which has 2 populations, SNP segment 3 and has recombination turned on. It's a bit of a 
hassle to type, but it pays when you discover 6 months from now that something very strange happened in that run. 
<LI>Second, and this is particularly important: you must be <b>sure</b> that
your parallel runs have different random number seeds.  There are two ways
of setting the random number seed and both have their pitfalls, here:
<UL>

<LI> <b>If you set the random number seed with the &lt;seed&gt; tag in the
input file, you must use different input files with different seeds for
each replicate.  What's more, the numbers must be more that four apart from
each other.</b>  LAMARC uses as a seed the closest integer of the form 4n+1
where n is an integer, so, for example, 1000, 1001, 1002, and 1003 will all
be 'rounded' to 1001 before being used as a seed.

<LI> Conversely, if there is no &lt;seed&gt; tag in the input file, LAMARC
will use the system clock, so you can use the same input file for each
replicate.  However,  <b>runs must be started at least 5 seconds apart from
each other if you do this</b>.  LAMARC will use the current time to the
second to get an integer, then again 'round' that integer to the closest
integer of the form 4n+1.  Effectively, that means that every four-second
window of time gets a single random number seed.  So, if you have a script
that starts LAMARC runs on different computers, any it starts up in the
same 4-second window will be computationally identical to each other, and
therefore useless as separate analyses.  Occasionally, different
architectures will produce different results with LAMARC even with the same
seed, but it is, shall we say, unwise to rely on this behavior.

</ul>
</ul>
</P>

<h3><A NAME="running">Running the programs and collecting the data.</h3>

<P> In order to get LAMARC to do a joint analysis over all replicates and/or
regions, it must be given a set of trees and the numbers used to obtain
those trees.  Each run, therefore, must write this information out by using
summary files.
</P>

<P> Summary file writing can either be turned on from the menu (I for
Input and Output related tasks, then W for Writing of tree summary file), or
from the XML, using the &lt;use-out-summary&gt; tag (set true) and the
&lt;out-summary-file&gt; tag (for the name of the file) (see the <A
href="menu.html#io">menu</A> section of the documentation).  Be sure to save
each summary file with a different name, or at least within a different
directory, so you can tell where it came from later.  Nothing within the
summary files will enable you to distinguish them later, apart from the
different values they contain.
</P>

<P> Don't forget that if you are doing multiple replicates, each region
needs the same number of replicates. </P>

<P> Also, since the purpose of this exercise is speed, be sure to turn off
profiling for all your parameters for these individual runs.  You can turn
on profiling again in your final LAMARC run, but having it on here just
wastes time--this information is not saved to the output summary files.
</P>

<P>When you have run these analyses, you should have summary files for all
of your LAMARC runs.</P>

<h3><A NAME="combine">Concatenating your data and running the joint analysis</h3>

<P> You now need to engage in a bit of chicanery.  You must fool LAMARC into
thinking that it wants to run one huge analysis with all of your regions and
replicates, then create a summary file that pretends to be the result of
that analysis.  If you feed that summary file to LAMARC, it will then see
that the only analysis that is left to be done is the joint analysis, and
skip to that step.
</P>

<h4> Combining over replicates </h4>

<P> First, if you have used multiple replicates from separate runs, you must
create a new input file for each region that claims to want to run the
number of replicates you have already created.  This is as simple as copying
any of the input files used to create the individual replicates, and
changing the value of the &lt;replicates&gt; tag (the random number seed
does not matter for this step).  Then you must create one summary file out
of your collected set of summary files.  This requires a bit of editing, but
not much.   </P>

<P> Each summary file will have one &lt;number&gt; tag for each chain in
that analysis, numbered 0 0 0, 0 0 1, 0 0 2, etc.  The first number is for
the region, the second for the replicate, and the third for the chain.  For
now, regardless of which genomic region you are dealing with, just change
the replicate number.  So, the first summary file you can leave as-is, the
second summary file you change to 0 1 0, 0 1 1, 0 1 2, etc., the third you
change to 0 2 0, etc., and so on.  In addition, there will be a
&lt;reg_rep&gt; tag at the beginning of the &lt;chainsum&gt; section, which
will read '0 0' at first.  Again, the first number is for the region, and
the second number is for the replicate.  You should change the second number
(corresponding to the replicate number) to match.  Finally, delete the
&lt;XML-summary-file&gt; tag from the beginning of all files but the first,
the &lt;/XML-summary-file&gt; tag from the end of all files but the last,
and the &lt;end-region&gt; &lt;/end-region&gt; pair of tags from the end of
all files. </P>

<P>Then, concatenate your replicate summary files, in order.  Nothing need
be interleaved; each file should simply follow the next, in whatever order
you gave them replicate numbers.
</P>

<P>When you're done, you should have one large summary file which should
match the format of <A HREF="insumfile.3rep.html"> insumfile.3rep.xml</a>
(actual xml is <A HREF="insumfile.3rep.xml">here</a>)
, though it should be much much larger (the
example file has only two trees per replicate).</p>

<P>Now you need to run LAMARC again.  Use your new input file, telling it
to use the summary file you have created.  If you are doing this step on
your way to concatenating your data from multiple genomic regions, you must
again turn on summary file <B>writing</B>.  LAMARC will read in your data, and
should skip right to the multi-replicate analysis.  If it does not,
something has gone awry, and you should check your input summary file.  One
trick is to compare the new output summary file to your input summary file,
and see where they diverge--that point is presumably the source of your
error.</P>

<P>If all goes well, you should now have a new output summary file that is
identical to your input summary file with the exception that it now has a
&lt;replicate-summary&gt; tag in it containing the estimates of your
various parameters, along with the reported maximum likelihood.  These
differences can be seen in <a HREF="outsumfile.3rep.html">
outsumfile.3rep.xml</a> (actual xml is <A HREF="outsumfile.3rep.xml">here</a>).  If you had more than one region, you must then
repeat this process for each of your regions. </P>

<h4>Combining over regions</h4>

<P>Now you need to repeat essentially the same process to combine your
regions together to get one overall parameter estimate.  First, you need to
go create one input file that wants to do a full multi-region and
multi-replicate (if necessary) analysis.  You'll need to go back to the
converter for this, and feed it all of your data this time, instead of just
the data from one genomic region at a time.
</P>

<P>Next, take the resulting LAMARC input file and either in the XML or in
the menu, you need to re-set up the correct number of chains you used for
your individual analyses, as well as the correct number of replicates.  You
also need to tell it to read from a summary file, which we now need to
create.  Finally, if you want error analysis, turn on profiling in this step
(and not before, since the profiling information is not saved).
</P>

<P>To combine your region-based summary files, you'll need to again change
the &lt;number&gt; tags to correctly match the region number in your data. 
Each file should start with a set of &lt;number&gt; tags containing 0 0 0,
then 0 0 1, 0 0 2, etc.  If you have multiple replicates, much further down
you will find a new set of tags containing 0 1 0, 0 1 1, 0 1 2, etc.  Again,
the first number is the region, the second number is the replicate, and the
third number is the chain.  Leave these tags alone for your first region
(the numbering starts at zero), but for each subsequent region, you'll need
to increase the first number:  1 0 0 , 1 0 1, 1 0 2, etc. (with 1 1 0, 1 1
1, 1 1 2, etc. following if you have multiple replicates).  Also, each
replicate will have a &lt;reg_rep&gt; tag, containing 0 0, 0 1, 0 2, etc.
(If you didn't do replicates, that still counts as 'one replicate', numbered
0 0).  The first number in that tag (corresponding to the region) also needs
to be changed to the appropriate region number (1 0, 1 1, 1 2, etc. for the
summary file for your second region if it has multiple replicates, or just
the first "1 0" if there are none. </P>

<P>Finally, you need to delete the initial &lt;XML-summary-file&gt; tag
from all files but the first, and the final &lt;/XML-summary-file&gt; from
all files but the last.  Do <b>not</b> delete any &lt;end-region&gt;
&lt;/end-region&gt; tag pairs you might have at this point (which you will
have if you did not use replicates), and certainly do not delete the
information in the &lt;replicate-summary&gt; tags. </P>

<P>Now, concatenate all your files, in order.  In other words, make one big
file that starts off with region '0', has region '1' next, and so on.  Save
that as a unique input summary file, and tell your LAMARC input file to
read from that.
</p>

<P>Once again, you're ready to run LAMARC, this time for the final time. 
If all goes well, it will cruise through the data, pausing at the end of
each replicate to recalculate some internal modifiers (the Geyer weights,
if you know what those are) as well as the profiles (if you have turned
those on again), and then at the very end, it will calculate a summary over
regions.  Again, if you have LAMARC write to a summary file, that file
should be identical to your cobbled-together version, with the exception
that at the very end, you will get a &lt;region-summary&gt; tag with your
final estimates therein.  And you're done! </P>

(<A HREF="tutorial2.html">Previous</A> | <A HREF="index.html">Contents</A> | 
<A HREF="mapping.html">Next</A>)
<!--
//$Id: parallel.html,v 1.17 2011/08/18 22:02:19 ewalkup Exp $
-->
</BODY>
</HTML>