File: lesson_paral_gspw.html

package info (click to toggle)
abinit 7.8.2-2
links: PTS, VCS
area: main
in suites: jessie, jessie-kfreebsd
size: 278,292 kB
ctags: 19,095
sloc: f90: 463,759; python: 50,419; xml: 32,095; perl: 6,968; sh: 6,209; ansic: 4,705; fortran: 951; objc: 323; makefile: 43; csh: 42; pascal: 31
file content (462 lines) | stat: -rw-r--r-- 32,626 bytes
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<html lang="en"><head><title>Tutorial GSPW</title>


<meta http-equiv="CONTENT-TYPE" content="text/html; charset=utf-8"><meta name="CREATED" content="20100407;16500000"><meta name="CHANGEDBY" content="Martin Stankovski"><meta name="CHANGED" content="20100407;17391100"><style>
<!--
@page { size: 8.5in 11in; margin: 0.79in }
P { margin-bottom: 0.08in }
-->
</style></head><body style="background-color: rgb(255, 255, 255);"><hr><h1>ABINIT,
lesson PARAL-GSPW:</h1>
<h2>Explore the various features of the KGB parallelization scheme<br>
</h2>
<hr><p>This tutorial will indicate how to perform GS calculations on hundreds processors using ABINIT.
</p><p>You will learn how to use some keywords related to the "KGB"
parallelization scheme. "K" stands for "k-point", "G" refers to the wavevector of a planewave, and "B" stands for a "Band".
On one hand, you will see how to improve the
speed-up of your calculations and, on the other hand, how to increase
the convergence speed of a self consistent field calculation. <br>
<br>This lesson should take about 1.5 hour and requires
      access to at least a 200 CPU core parallel computer.<i> </i> </p>
<p>You are supposed to know already some basics of parallelism in ABINIT, explained in the tutorial 
<a href="lesson_basepar.html">A first introduction to ABINIT in parallel</a>.
</p><p>On the contrary, the present tutorial is not a prerequisite for other tutorials about parallelism in ABINIT.</p><p>
<h5>Copyright (C) 2005-2014 ABINIT group (FB)
<br>This file is distributed under the terms of the GNU General
Public
License, see
~abinit/COPYING or <a href="http://www.gnu.org/copyleft/gpl.txt">
http://www.gnu.org/copyleft/gpl.txt </a>.
<br>For the initials of contributors, see
~abinit/doc/developers/contributors.txt .
</h5>

<script type="text/javascript" src="list_internal_links.js"> </script>

<h3><b>Contents of lesson PARAL-GSPW :</b></h3>

<ul>

<li><a href="#0">0.</a> Introduction<br>
</li>
  <li><a href="#1">1.</a> A simple way to begin...<br>
  </li>

<li><a href="#2">2.</a> ... which is coherent with a more sophisticated way <br>
</li>
<li><a href="#3">3.</a> Meaning of <a href="../input_variables/varpar.html#bandpp" target="kwimg">bandpp</a>: part 1<br>
 </li>
<li><a href="#4">4.</a> Meaning of <a href="../input_variables/varpar.html#bandpp" target="kwimg">bandpp</a>: part 2  </li>
  <li><a href="#5">5.</a> The KGB parallelization</li></ul>


<hr>
<h3>Notice:&nbsp;</h3>

<p>
<i>Before continuing you might work in a different subdirectory, as for the other lessons. 
Why not "work_paral_gspw" ? All the input files can be found in the ~abinit/tests/tutoparal/Input directory. 
You might have to adapt them to the path of the directory in which you have decided to perform your runs. 
You can compare your results with reference output files located in ~abinit/tests/tutoparal/Refs.
</i></p><p><i>In the following, when "run ABINIT
      over nn CPU cores" appears, you have
      to use a specific command line, according to the operating system
      and architecture of the computer you are using. This can be, for
      instance,</i></p>
<pre><i>mpirun -n nn abinit &lt; abinit.files <br></i></pre>
<p><i>or the use of a specific
      submission file. Some scripts are given as examples in the directory ~abinit/doc/tutorial/lesson_paral_gspw/. 
      You can adapt them to your own calculations.<br>
</i>
</p>
<hr><a name="0">&nbsp;</a>
<p>
</p>


<h3><b>0. Introduction</b></h3>

<p>When the size of the system increases up to 100 or 1000 atoms, it is
usually impossible to perform ab initio calculations on a single computing core. This is
because the basis sets used to solve the problem (PWs, bands, ...)
increase proportionnally (linearly, as the square, or even exponentially...).
The trouble has two origins: (i) the memory, with the amount of data
stored in RAM and (ii) the computation time, with specific strong
bottlenecks which are for example the eigensolver, the 3dim-FFTs...</p>
<p>Therefore, it is generally mandatory to adopt a parallelization
strategy. That is to say: (i) to distribute the data in a MPI-sense on
a large number of processors, and (ii) to parallelize the routines
responsible for the increase of the computation time.&nbsp;</p>
<p>In this tutorial, we will go beyond the simple k-point and spin
parallelism explained in the tutorial "<a href="lesson_basepar.html">A first introduction to ABINIT in parallel</a>" 
and show: <br>
</p>
<ul>
  <li>how to improve performance by using a large number of processors, even when the number of k-points is not large,</li>
  <li>how to decrease the computation time for a given number of processors. The aim is twofold:<br>
  </li>
</ul>
<div style="margin-left: 80px;">
<ol>
  <li>reduce the time needed to perform one electronic iteration (to improve the efficiency)<br>
  </li>
  <li>reduce the number of electronic iterations (to improve the convergence)&nbsp;</li>
</ol>
</div>
<ul>
  <li>how to use other keywords or features related to the KGB parallelization scheme.<br>
 </li>
</ul>
<span style="font-style: italic;"></span>
<p>The tests are performed on a system with 108 atoms of gold;
this is a benchmark used for a long time during the
development and the implementation of the KGB parallelization.
With respect to the input file used for the benchmark, the cutoff energy is strongly
reduced in this tutorial, for practical reasons. For real tests, you can
see the results (in particular the scaling) in: </p>

<ul>
  <li>the publication concerning the KGB parallelisation: <cite>F. Bottin, S. Leroux, A. Knyazev, G. Zerah,
    Comput. Mat. Science 42, 329, (2008) "Large scale ab initio calculations
    based on three levels of parallelization "
    (<a href="http://arxiv.org/abs/0707.3405">available on Arxiv.org</a>)</cite>.
  </li>
  <li>the Abinit paper: <em>X. Gonze, B. Amadon, P.-M. Anglade, J.-M.
Beuken, F. Bottin, P. Boulanger, F. Bruneval,' D. Caliste, R. Caracas,
M. Cote, T. Deutsch, L. Genovese, Ph. Ghosez, M. Giantomassi, S.
Goedecker, D.R. Hamann, P. Hermet, F. Jollet, G. Jomard, S. Leroux, M.
Mancini, S. Mazevet, M.J.T. Oliveira, G. Onida, Y. Pouillon, T. Rangel,
G.-M. Rignanese, D. Sangalli, R. Shaltaf, M. Torrent, M.J. Verstraete,
G. Zerah, J.W. Zwanziger</em><cite>,
    Computer Phys. Commun. 180, 2582-2615 (2009).
    "ABINIT : First-principles approach of materials and nanosystem properties."</cite></li><li><font id="text">the presentation at the <a href="http://www.abinit.org/community/events/program3rd">ABINIT workshop 2007</a></font></li>
</ul>
<font id="text">You are strongly suggested to read these documents before beginning this tutorial. You might learn a lot of useful things. <br>
<br>
However, </font>even if you scan them with attention<font id="text">,&nbsp;
you won't learn the answer to the most frequently asked question. Why
this parallelization is named KGB? We don't know. Some people say that the reason comes from the <span style="font-weight: bold;">K</span>-point, the plane waves <span style="font-weight: bold;">G</span> and the <span style="font-weight: bold;">B</span>ands, but you can imagine everything you want.<br>
</font>
<p><br>
</p>


<hr><a name="1">&nbsp;</a>
<h3><b>1. A simple way to begin... <br>
</b></h3>
<p>One of the most simple way to launch the KGB parallelization in ABINIT is to add
just one input variable to the sequential input file. 
This is <a href="../input_variables/varpar.html#paral_kgb" target="kwimg">paral_kgb</a> and controls everything concerning the KGB parallelization: 
the use of the LOBPCG eigensolver (<a href="../input_variables/vardev.html#wfoptalg" target="kwimg">wfoptalg</a>=4 or 14) <font id="text">of A. Knyazev</font>, the parallel 3dim-FFT (<a href="../input_variables/vardev.html#fftalg" target="kwimg">fftalg</a>=401) <font id="text">written by S. Goedecker</font>,
and some other tricks... At the time of writing (but not for much longer), you
still have to define the number or processors needed on each level of
the KGB parallelization: <a href="../input_variables/varpar.html#npband" target="kwimg">npband</a>, <a href="../input_variables/varpar.html#npfft" target="kwimg">npfft</a> and <a href="../input_variables/varpar.html#npkpt" target="kwimg">npkpt</a>. <br>
</p>
<p>If ABINIT is not yet able to handle directly the number of cores and
to launch from scratch an efficient distribution of processors, you can already use
this parallelization as a black box. It is possible to get a good
estimation of the most efficient processor distribution by performing
a simple sequential run. This is done by simply adding to the input file
(in addition to <a href="../input_variables/varpar.html#paral_kgb" target="kwimg">paral_kgb</a>=1)
<a href="../input_variables/varpar.html#autoparal" target="kwimg">autoparal</a>=1 and
<a href="../input_variables/varpar.html#max_ncpus" target="kwimg">max_ncpus</a>="the maximum number of processors you want".<br>
<b>Note</b> that max_ncpus is a new variable introduced in Abinit version 7.6.2. Prior to this version, one had to use
<a href="../input_variables/varpar.html#paral_kgb" target="kwimg">paral_kgb</a>=-"the maximum number of processors you want"
to specify the maximum number of processors.

</p>

<p> In order to do that, 
copy the file tests/tutoparal/Input/tgspw_01.in and the related tgspw_01.files file in your working directory. 
Then run ABINIT on 1 CPU core.<br>
At the end of the log file tgspw_01.log, you will see:<br>
</p>
<p></p>
<pre>     npimage|       npkpt|    npspinor|       npfft|      npband|     bandpp |       nproc|      weight|<br>   1 -&gt;    1|   1 -&gt;    1|   1 -&gt;    1|   1 -&gt;   22|   1 -&gt;  108|   1 -&gt;   65|   8 -&gt;  108|   1 -&gt;  108|<br>           1|           1|           1|          12|           9|           1|         108|      55.13 |<br>           1|           1|           1|          10|           9|           1|          90|      54.98 |<br>           1|           1|           1|           9|           9|           1|          81|      54.85 |<br>           1|           1|           1|           9|          12|           1|         108|      54.37 |<br>           1|           1|           1|          12|           9|           2|         108|      52.86 |<br>  ......................<br>
</pre>
A weight is assigned to each distribution of processors. As indicated in the documentation, you are
advised to select a processor distribution with a higher weight. If we just focus on <a href="../input_variables/varpar.html#npband" target="kwimg">npband</a> and <a href="../input_variables/varpar.html#npfft" target="kwimg">npfft</a>, we  see
that for 108 processors the recommended distribution is are (12x9). <br>
<br>
In a second step you can launch ABINIT in parallel on 108 processors by changing your input file as follows:<br>
<span style="margin-left: 10px; font-family: monospace;"><br>
&nbsp;- paral_kgb 1  autoparal 1  max_ncpus 108<br>
&nbsp;+ paral_kgb 1  npband 12 npfft 9</span><br>
<br>
You can now perform your calculations using the KGB parallelization
in ABINIT. But somehow, you did it without understanding how you got the result... <br>
<br>
<hr><a name="2">&nbsp;</a>
<h3><b>2. ... which is however coherent with a more sophisticated method<br>
</b></h3>


In this part we will try to recover the previous result, but with a
more comprehensive understanding of what is happening. As shown above, the
couple (<a href="../input_variables/varpar.html#npband" target="kwimg">npband</a>x<a href="../input_variables/varpar.html#npfft" target="kwimg">npfft</a>) of
input variables can have various values: (108x1), (54x2), (36x3),
(27x4), (18x6) and (12x9). But also (9x12) ... which is not indicated.
In order to perform these seven calculations you can use the input file
tgspw_02.in (and tgspw_02.files) and change the line corresponding to the processor distribution. A first calculation with:<br>
<p>
<span style="margin-left: 10px; font-family: monospace;">+ npband 108 npfft 1 </span>
<br>
<br>
A second one with another distribution:<br>
<span style="margin-left: 10px; font-family: monospace;"><br>
&nbsp;- </span><span style="margin-left: 10px; font-family: monospace;">npband 108 npfft 1 </span><span style="margin-left: 10px; font-family: monospace;"><br>
&nbsp;+ </span><span style="margin-left: 10px; font-family: monospace;">npband&nbsp; 54 npfft 2 </span><br>
&nbsp;
<br>
And so on... Alternatively, this can be performed using a shell script including: <span style="font-family: monospace;"><br>
</span></p>
<pre><span style="font-family: monospace;">&gt;&gt; </span>cp tgspw_02.in tmp.file<span style="font-family: monospace;"><br></span>&gt;&gt; echo "npband 108 npfft 1" &gt;&gt; tgspw_02.in<br>&gt;&gt; mpirun -n 108 abinit &lt; tgspw_02.files<br>&gt;&gt; cp tgspw_02.out tgspw_02.108-01.out<br>&gt;&gt; cp tmp.file tgspw_02.in<br>&gt;&gt; echo "npband 54 npfft 2" &gt;&gt; tgspw_02.in<br>&gt;&gt; ... <br></pre>
By reference to the couple (<a href="../input_variables/varpar.html#npband" target="kwimg">npband</a>x<a href="../input_variables/varpar.html#npfft" target="kwimg">npfft</a>),
all these results are named: tgspw_02.108-01.out, tgspw_02.054-02.out,
tgspw_02.036-03.out, tgspw_02.027-04.out, tgspw_02.018-06.out,
tgspw_02.012-09.out and tgspw_02.009-12.out. The timing of each
calculation can be retrieved using the shell command:<span style="font-family: monospace;"><br>
</span>
<pre><span style="font-family: monospace;">&gt;&gt; </span>grep Proc *out<br>&gt;&gt; tgspw_02.009-12.out:- Proc.   0 individual time (sec): cpu=         88.3  wall=         88.3<br>&gt;&gt; tgspw_02.012-09.out:- Proc.   0 individual time (sec): cpu=         75.2  wall=         75.2<br>&gt;&gt; tgspw_02.018-06.out:- Proc.   0 individual time (sec): cpu=         63.7  wall=         63.7<br>&gt;&gt; tgspw_02.027-04.out:- Proc.   0 individual time (sec): cpu=         69.9  wall=         69.9<br>&gt;&gt; tgspw_02.036-03.out:- Proc.   0 individual time (sec): cpu=        116.0  wall=        116.0<br>&gt;&gt; tgspw_02.054-02.out:- Proc.   0 individual time (sec): cpu=        104.7  wall=        104.7<br>&gt;&gt; tgspw_02.108-01.out:- Proc.   0 individual time (sec): cpu=        141.5  wall=        141.5<br></pre>


<p>As far as the timing is concerned, the best distributions are then the ones proposed above in section <a href="lesson_paral_gspw.html#1">1.</a>: that is to say the couples (18x6) and (27x4). So the prediction was pretty good.<br>
</p>Up to now, we have not learned more than in section <a href="lesson_paral_gspw.html#1">1.</a>. We have so far only considered the timing (the efficiency) of one
electronic step, or 10 electronic steps as this is limited in the input
file. However, when the <a href="../input_variables/varpar.html#npband" target="kwimg">npband</a>
value is modified, the size of the block in LOBPCG changes, and
finally the solutions of this blockeigensolver are also affected. In other
words, we never had in mind that the convergence of these calculations is also strongly
important. One calculation can
be the quickest on one step but the slowest at the end of the
convergence because it takes many more steps. 
In order to see this without performing
any additional calculations, we can have a look at the degree of convergence
at the end of the calculations we already have. The last iterations of the
SCF loop give:<br>
<pre>&gt;&gt; grep "ETOT 10" *.out<br>&gt;&gt; tgspw_02.009-12.out: ETOT 10&nbsp; -3754.4454784191&nbsp;&nbsp;&nbsp; -1.549E-03 7.222E-05 1.394E+00<br>&gt;&gt; tgspw_02.012-09.out: ETOT 10&nbsp; -3754.4458434046&nbsp;&nbsp;&nbsp; -7.875E-04 6.680E-05 2.596E-01<br>&gt;&gt; tgspw_02.018-06.out: ETOT 10&nbsp; -3754.4457793663&nbsp;&nbsp;&nbsp; -1.319E-03 1.230E-04 6.962E-01<br>&gt;&gt; tgspw_02.027-04.out: ETOT 10&nbsp; -3754.4459048995&nbsp;&nbsp;&nbsp; -1.127E-03 1.191E-04 5.701E-01<br>&gt;&gt; tgspw_02.036-03.out: ETOT 10&nbsp; -3754.4460493339&nbsp;&nbsp;&nbsp; -1.529E-03 7.121E-05 3.144E-01<br>&gt;&gt; tgspw_02.054-02.out: ETOT 10&nbsp; -3754.4460393029&nbsp;&nbsp;&nbsp; -1.646E-03 7.096E-05 7.284E-01<br>&gt;&gt; tgspw_02.108-01.out: ETOT 10&nbsp; -3754.4464631635&nbsp;&nbsp;&nbsp; -6.162E-05 2.151E-05 7.457E-02</pre>
The last column indicates the convergence of the density (or potential)
residual. You can see that this quantity is the smallest when <a href="../input_variables/varpar.html#npband" target="kwimg">npband</a>
is the highest. This result is expected: the convergence is better when the
size of the block is as large as possible, so need for re-orthogonalization
among band processor pools is minimized. But the best convergence is obtained
for the (108x1) distribution... when the worst timing is measured. <br>
<br>
<span style="font-style: italic;"><span style="font-weight: bold;">So, you face a dilemma. The calculation with the
smallest number of iterations (the best convergence) is not the best
concerning the timing of one iteration (the best efficiency), and
vice versa... So you have to check both of these features for all the
processor distributions. On one hand, the timing of one iteration and,
on the other hand, the number of iterations needed to converge. The
best choice is a compromise between them, not necessarly the independent optima.<br>
</span></span><br>
In the following we will choose the (27x4) couple, because this one definitively
offers more guarantees concerning the convergence and the timing... even if the (18x6) one is
slightly quicker per electronic step.<br>
 <br>
Note: you can verify that the convergence is not changed when the
<a href="../input_variables/varpar.html#npfft" target="kwimg">npfft</a> value is modified. The same results will be obtained, step by step.<br>
<hr><a name="3">&nbsp;</a>
<h3><b>3. Meaning of </b><a href="../input_variables/varpar.html#bandpp" target="kwimg">bandpp</a><b>: part 1 (convergence)<br>
</b></h3>We have seen in the previous section that the best convergence
is obtained when the size of the block is the largest. This size is
related to the <a href="../input_variables/varpar.html#npband" target="kwimg">npband</a>
input variable. But not only. It is possible to increase the size of
the block without increasing drastically the number of band processors.
This means that it's possible to decrease the number of electronic
steps without increasing strongly the timing of one electronic step.
For systems with peculiar convergence, when some trouble leads the
calculation to diverge, this is not just useful but indispensable to converge at all.
<br>
<br>
The input variable enabling an increasing of the size block without increasing
the number of band processors is named <a href="../input_variables/varpar.html#bandpp" target="kwimg">bandpp</a>. The
size block is then defined as:
<a href="../input_variables/varpar.html#bandpp" target="kwimg">bandpp</a>x<a href="../input_variables/varpar.html#npband" target="kwimg">npband</a>.
In the following, we keep the same input file as previously and add:
<br> 
<pre>- nstep 10<br>+ nstep 20<br>+ npband 27 npfft 4<br>
</pre>
You can copy the input file
tgspw_03.in then run ABINIT over 108 cores with tgspw_02.027-04.out, <a href="../input_variables/varpar.html#bandpp" target="kwimg">bandpp</a>=1 and <a href="../input_variables/varpar.html#bandpp" target="kwimg">bandpp</a>=2. 
The output files will be named tgspw_03.bandpp1.out and tgspw_03.bandpp2.out,
respectively. A comparison between these two files shows that the
convergence is better in the second case. The convergence is even
achieved before the input file limit of 20&nbsp; electronic iterations. Quod Erat Demonstrandum:<br>
<br>
<span style="font-weight: bold; font-style: italic;">For a given
number of processors, it is possible to improve the convergence by increasing bandpp.<br>
<br>
</span>We can also compare the result obtained for the (27x4) distribution and <a href="../input_variables/varpar.html#bandpp" target="kwimg">bandpp</a>=2 with the (54x2) one and <a href="../input_variables/varpar.html#bandpp" target="kwimg">bandpp</a>=1. Use the same input file and add:<br>
<pre>- npband 27 npfft 4<br>+ npband 54 npfft 2 bandpp 1<br>
</pre>
Then run ABINIT over 108 cores and copy the output file to
tgspw_03.054-02.out. Perform a
diff (with vimdiff for example) between the two output files
tgspw_03.bandpp1.out
and tgspw_03.054-02.out.
You can convince yourself that the two calculations (54x2)
with&nbsp;<a href="../input_variables/varpar.html#bandpp" target="kwimg">bandpp</a>=1 and (27x4)
with&nbsp;<a href="../input_variables/varpar.html#bandpp" target="kwimg">bandpp</a>=2,
give exactly the same convergence. This result is expected, since the sizes of
the block are equal (to 54) and the number of FFT processors <a href="../input_variables/varpar.html#npfft" target="kwimg">npfft</a>
does not affect the convergence. <br>
<p><span style="font-weight: bold; font-style: italic;">It is possible to modify the distribution of processors,
without changing the convergence, by reducing npband and increasing bandpp proportionally. <br>
</span></p>

<hr><a name="4">&nbsp;</a>
<h3><b>4. Meaning of&nbsp;</b><a href="../input_variables/varpar.html#bandpp" target="kwimg">bandpp</a><b>: part 2 (efficiency)<br>
</b></h3>
<p>In the previous section, we showed that the convergence doesn't change if
<a href="../input_variables/varpar.html#bandpp" target="kwimg">bandpp</a> and
<a href="../input_variables/varpar.html#npband" target="kwimg">npband</a> change in inverse proportions.
What about the influence of&nbsp;<a href="../input_variables/varpar.html#bandpp" target="kwimg">bandpp</a> 
if&nbsp; you fix the distribution? Two cases have to be treated separately.</p>
<p>You can see, in the previous calculations of section <a href="lesson_paral_gspw.html#3">3.</a>,
that the timing increases when <a href="../input_variables/varpar.html#bandpp" target="kwimg">bandpp</a> increases:</p>
<pre>&gt;&gt; grep Proc tgspw_03.bandpp1.out tgspw_03.bandpp2.out<br>&gt;&gt; tgspw_03.bandpp1.out:- Proc.   0 individual time (sec): cpu=        121.4  wall=        121.4<br>&gt;&gt; tgspw_03.bandpp2.out:- Proc.   0 individual time (sec): cpu=        150.7  wall=        150.7<br></pre>
<p>while there are fewer electronic iterations for
<a href="../input_variables/varpar.html#bandpp" target="kwimg">bandpp</a>=2 (19) than for
<a href="../input_variables/varpar.html#bandpp" target="kwimg">bandpp</a>=1 (20). If you
perform a diff between these two files, you will see that the
increase in time is essentially due to the section "zheegv-dsyegv".</p>
<pre>&gt;&gt; grep "zheegv-dsyegv" tgspw_03.bandpp1.out tgspw_03.bandpp2.out<br>&gt;&gt; tgspw_03.bandpp1.out:- zheegv-dsyegv&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 1321.797&nbsp; 10.1&nbsp;&nbsp; 1323.215&nbsp; 10.1&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 513216<br>&gt;&gt; tgspw_03.bandpp2.out:- zheegv-dsyegv&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 5166.002&nbsp; 31.8&nbsp;&nbsp; 5164.574&nbsp; 31.8&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 244944<br></pre>
<p>The "zheegv-dsyegv" is a part of the LOBPCG algorithm which is performed in sequential, so the same calculation is
done on each processor. In the second calculation, the size
of the block being larger (27x2=54) than in the first (27), the
computational time of this diagonalization is more expensive. To sum up, the timing of a single electronic
step increases by increasing <a href="../input_variables/varpar.html#bandpp" target="kwimg">bandpp</a>, but the convergence improves.<br>


</p>

<br>
<span style="font-weight: bold; font-style: italic;">Do not increase too
much the bandpp value, unless you decrease proportionally <a href="../input_variables/varpar.html#npband" target="kwimg">npband</a>
or if you want to improve the convergence whatever the cost in total timing.</span><br>
<br>
The only exception is when <a href="../input_variables/vardev.html#istwfk" target="kwimg">istwfk</a>=2,
i.e. when real wavefunctions are employed. This occurs when the Gamma
point alone is used to sample the Brillouin Zone. You can use the input
file
tgspw_04.in in
order to check that. The input is modified with respect to the previous
input files in order to be more realistic and use only the Gamma point.
Add
<a href="../input_variables/varpar.html#bandpp" target="kwimg">bandpp</a>=1,2,4 or 6 in the input file
tgspw_04.in and run ABINIT in each case over 108 cores.
You will obtain four output files named tgspw_04.bandpp1.out,
tgspw_04.bandpp2.out,
tgspw_04.bandpp4.out and
tgspw_04.bandpp6.out in
reference to&nbsp;<a href="../input_variables/varpar.html#bandpp" target="kwimg">bandpp</a>.
If you compare the outputs of these calculations:<span style="font-family: monospace;"><br>
</span>
<pre><span style="font-family: monospace;"></span></pre>

<pre>&gt;&gt; grep Proc *out<br>&gt;&gt; tgspw_04.bandpp1.out:- Proc.&nbsp;&nbsp; 0 individual time (sec): cpu=&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 61.4&nbsp; wall=&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 61.4<br>&gt;&gt; tgspw_04.bandpp2.out:- Proc.&nbsp;&nbsp; 0 individual time (sec): cpu=&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 49.0&nbsp; wall=&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 49.0<br>&gt;&gt; tgspw_04.bandpp4.out:- Proc.&nbsp;&nbsp; 0 individual time (sec): cpu=&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 62.5&nbsp; wall=&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 62.5<br>&gt;&gt; tgspw_04.bandpp6.out:- Proc.&nbsp;&nbsp; 0 individual time (sec): cpu=&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 75.3&nbsp; wall=&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 75.3</pre>
you can see that the timing decreases for&nbsp;<a href="../input_variables/varpar.html#bandpp" target="kwimg">bandpp</a>=2
and increases thereafter. This behaviour comes from the FFTs. For <a href="../input_variables/varpar.html#bandpp" target="kwimg">bandpp</a>=2,
the real wavefunctions are associated in pairs in the complex FFTs,
leading to a reduction by a factor of 2 of the timing in this part (you can see this reduction by a diff of the output files).
Above&nbsp;<a href="../input_variables/varpar.html#bandpp" target="kwimg">bandpp</a>=2,
there is longer any gain in the FFTs, whereas some significant losses in
computational time appear in the "zheegv-dsyegv" section.<br>
<br>

<span style="font-weight: bold; font-style: italic;">When calculations are
performed at the Gamma point, you are strongly encouraged to use bandpp=2...
or more </span><span style="font-weight: bold; font-style: italic;">if you need
to improve the convergence whatever the timing</span><span style="font-weight: bold; font-style: italic;">.</span><br>
<br>
<hr><a name="5">&nbsp;</a>
<h3><b>5. The KGB parallelization.</b></h3>
<p>Up to now, we only performed a GB parallelization. This implies
parallelization over 2 levels of PWs or over 2 levels of bands and FFTs,
for different sections of the code (see the <a href="http://arxiv.org/abs/0707.3405">paper</a>
or <a href="http://www.abinit.org/community/events/program3rd">presentation</a>). If the system
has more than 1 <span style="font-weight: bold;">k</span>-point,
one can add a third level of parallelization and perform a real KBG
parallelization. There is no additional difficulty in adding 
processors on this level. In order to explain the procedure, we
restart with the same input file that was used in section <a href="lesson_paral_gspw.html#2">2.</a>
and add a denser M-P grid (see the input
file tgspw_05.in). In
this case, the system has 4 <span style="font-weight: bold;">k</span>-points in
the IBZ so the calculation can be parallelized over (at most) 4 <span style="font-weight: bold;">k</span>-point processors. This is done using the
<a href="../input_variables/varpar.html#npkpt" target="kwimg">npkpt</a> input
variable:<br> </p>
<pre> + npkpt 4<br></pre>
<p>This implies we use four times more processors than before, so run ABINIT
      over 432 CPU cores. The timing obtained in the output file tgspw_05.out:<br>
</p>
<pre>&gt;&gt; grep Proc tgspw_05.out<br>&gt;&gt; - Proc.   0 individual time (sec): cpu=         87.3  wall=         87.3</pre>
<p>is quasi-identical to the one obtained for 1 <span style="font-weight: bold;">k</span>-point
(69.9 sec, see the output file tgspw_02.027-04.out.
This means that a calculation 4 times larger (due to an increase of the number of <span style="font-weight: bold;">k</span>-points)
gives approximatively the same human time if you parallelize over all the <span style="font-weight: bold;">k</span>-points. 
You have just re-derived a well established result: the scaling (the speedup) is quasi-linear on the
<span style="font-weight: bold;">k</span>-point level.<br>
</p>
<p><span style="font-weight: bold; font-style: italic;">When you want
to parallelize a calculation, begin by the k-point level, then follow
with the band level (up to <a href="../input_variables/varpar.html#npband" target="kwimg">npband</a>=50
typically) then finish by the FFT level.<br>
</span></p>
<p>Here, the timing obtained for the output tgspw_05.out leads
to a hypothetical speedup of 346, which is good, but not 432 as
expected if the scaling was linear as a function of the number of the  <span style="font-weight: bold;">k</span>-point
processors. Indeed, in order to be comprehensive, we have to mention
that the timing obtained in this output is slightly longer (17 sec.
more) than the one obtained in tgspw_02.027-04.out.&nbsp;
Compare the time spent in all the routines. A first clue
comes from the timing done below the "vtowfk level", which contains
99.xyz% of sequential processor time: <br>
</p>
<pre>&gt;&gt; grep "vtowfk   " tgspw_05.out tgspw_02.027-04.out<br>&gt;&gt; tgspw_05.out:- vtowfk                     26409.565  70.7  26409.553  70.7           4320<br>&gt;&gt; tgspw_05.out:- vtowfk                     26409.565  70.7  26409.553  70.7           4320<br>&gt;&gt; tgspw_02.027-04.out:- vtowfk                      6372.940  84.8   6372.958  84.8           1080<br>&gt;&gt; tgspw_02.027-04.out:- vtowfk                      6372.940  84.8   6372.958  84.8           1080<br><br></pre>We
see that the KGB parallelization performs really well, since
the human time spent within vtowfk is approximatively equal: 26409/432
~ 6372/108. So, the speedup is quasi-linear below vtowfk. The problem
comes from parts above vtowfk
which are not parallelized and are responsible for the negligible
(1-99.xyz)% of time spent in sequential. These parts are no longer
negligible when you parallelize over hundreds of processors. <br>
<br>
You can also prove this using the percentages rather than the values of
the overall times (shown above): the time spent in vtowfk correponds to 84.8% of
the overall time when you don't parallelize over <span style="font-weight: bold;">k</span>-points, and only 70.7% when you parallelize. This means you
lose time above vtowfk in this case.


<p>This behaviour is related to the
<a href="http://en.wikipedia.org/wiki/Amdahl%27s_law">Amdhal's law</a>:
"<span style="font-style: italic;">The speedup of a program using multiple processors in parallel
computing is limited by the time needed for the sequential fraction of
the program. For example, if a program needs 20 hours using a single
processor core, and a particular portion of 1 hour cannot be
parallelized, while the remaining promising portion of 19 hours (95%)
can be parallelized, then regardless of how many processors we devote
to a parallelized execution of this program, the minimum execution time
cannot be less than that critical 1 hour. Hence the speedup is limited
up to 20.</span>"</p>
<p><span style="font-weight: bold; font-style: italic;">In our case, the part above the loop over k-point
in not parallelized by the KGB parallelization. Even if this part is
very small, less than 1%, when the biggest part below is strongly
(perfectly) parallelized, this small part determines an upper bound for the speedup. </span><br>
</p>
<p><br>

</p>

<hr><br>
To do in the future:<span style="font-weight: bold; font-style: italic;"></span>
<ul>
<li><a href="lesson_paral_gspw.html">6.</a> What else about the convergency: wfoptalg, nline 
  </li><li><a href="lesson_paral_gspw.html">7.</a> MPI-IO is direct </li><li><a href="lesson_paral_gspw.html">8.</a> Future: only one keyword, then ... none.</li></ul>

<script type="text/javascript" src="list_internal_links.js"> </script>


</body></html>