1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187
|
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN">
<!--Converted with LaTeX2HTML 2019.2 (Released June 5, 2019) -->
<HTML lang="EN">
<HEAD>
<TITLE>4.4 Parallelization issues</TITLE>
<META NAME="description" CONTENT="4.4 Parallelization issues">
<META NAME="keywords" CONTENT="user_guide">
<META NAME="resource-type" CONTENT="document">
<META NAME="distribution" CONTENT="global">
<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=utf-8">
<META NAME="viewport" CONTENT="width=device-width, initial-scale=1.0">
<META NAME="Generator" CONTENT="LaTeX2HTML v2019.2">
<LINK REL="STYLESHEET" HREF="user_guide.css">
<LINK REL="next" HREF="node19.html">
<LINK REL="previous" HREF="node17.html">
<LINK REL="next" HREF="node19.html">
</HEAD>
<BODY >
<!--Navigation Panel-->
<A
HREF="node19.html">
<IMG WIDTH="37" HEIGHT="24" ALT="next" SRC="next.png"></A>
<A
HREF="node14.html">
<IMG WIDTH="26" HEIGHT="24" ALT="up" SRC="up.png"></A>
<A
HREF="node17.html">
<IMG WIDTH="63" HEIGHT="24" ALT="previous" SRC="prev.png"></A>
<A ID="tex2html199"
HREF="node1.html">
<IMG WIDTH="65" HEIGHT="24" ALT="contents" SRC="contents.png"></A>
<BR>
<B> Next:</B> <A
HREF="node19.html">4.5 Understanding the time</A>
<B> Up:</B> <A
HREF="node14.html">4 Performances</A>
<B> Previous:</B> <A
HREF="node17.html">4.3 File space requirements</A>
<B> <A ID="tex2html200"
HREF="node1.html">Contents</A></B>
<BR>
<BR>
<!--End of Navigation Panel-->
<H2><A ID="SECTION00054000000000000000"></A>
<A ID="SubSec:badpara"></A>
<BR>
4.4 Parallelization issues
</H2>
<P>
<TT>pw.x</TT> can run in principle on any number of processors.
The effectiveness of parallelization is ultimately judged by the
''scaling'', i.e. how the time needed to perform a job scales
with the number of processors, and depends upon:
<UL>
<LI>the size and type of the system under study;
</LI>
<LI>the judicious choice of the various levels of parallelization
(detailed in Sec.<A HREF="#SubSec:badpara">4.4</A>);
</LI>
<LI>the availability of fast interprocess communications (or lack of it).
</LI>
</UL>
Ideally one would like to have linear scaling, i.e. <!-- MATH
$T \sim T_0/N_p$
-->
<I>T</I>∼<I>T</I><SUB>0</SUB>/<I>N</I><SUB>p</SUB> for
<I>N</I><SUB>p</SUB> processors, where <I>T</I><SUB>0</SUB> is the estimated time for serial execution.
In addition, one would like to have linear scaling of
the RAM per processor: <!-- MATH
$O_N \sim O_0/N_p$
-->
<I>O</I><SUB>N</SUB>∼<I>O</I><SUB>0</SUB>/<I>N</I><SUB>p</SUB>, so that large-memory systems
fit into the RAM of each processor.
<P>
Parallelization on k-points:
<UL>
<LI>guarantees (almost) linear scaling if the number of k-points
is a multiple of the number of pools;
</LI>
<LI>requires little communications (suitable for ethernet communications);
</LI>
<LI>reduces the required memory per processor by distributing wavefunctions
(but not other quantities like the charge density), unless you set
<TT>disk_io='high'</TT>.
</LI>
</UL>
Parallelization on PWs:
<UL>
<LI>yields good to very good scaling, especially if the number of processors
in a pool is a divisor of <I>N</I><SUB>3</SUB> and <I>N</I><SUB>r3</SUB> (the dimensions along the z-axis
of the FFT grids, <TT>nr3</TT> and <TT>nr3s</TT>, which coincide for NCPPs);
</LI>
<LI>requires heavy communications (suitable for Gigabit ethernet up to
4, 8 CPUs at most, specialized communication hardware needed for 8 or more
processors );
</LI>
<LI>yields almost linear reduction of memory per processor with the number
of processors in the pool.
</LI>
</UL>
<P>
A note on scaling: optimal serial performances are achieved when the data are
as much as possible kept into the cache. As a side effect, PW
parallelization may yield superlinear (better than linear) scaling,
thanks to the increase in serial speed coming from the reduction of data size
(making it easier for the machine to keep data in the cache).
<P>
VERY IMPORTANT: For each system there is an optimal range of number of processors on which to
run the job. A too large number of processors will yield performance
degradation. If the size of pools is especially delicate: <I>N</I><SUB>p</SUB> should not
exceed <I>N</I><SUB>3</SUB> and <I>N</I><SUB>r3</SUB>, and should ideally be no larger than
<!-- MATH
$1/2\div1/4 N_3$
-->
1/2÷1/4<I>N</I><SUB>3</SUB> and/or <I>N</I><SUB>r3</SUB>. In order to increase scalability,
it is often convenient to
further subdivide a pool of processors into ''task groups''.
When the number of processors exceeds the number of FFT planes,
data can be redistributed to "task groups" so that each group
can process several wavefunctions at the same time.
<P>
The optimal number of processors for "linear-algebra"
parallelization, taking care of multiplication and diagonalization
of <I>M</I>×<I>M</I> matrices, should be determined by observing the
performances of <TT>cdiagh/rdiagh</TT> (<TT>pw.x</TT>) or <TT>ortho</TT> (<TT>cp.x</TT>)
for different numbers of processors in the linear-algebra group
(must be a square integer).
<P>
Actual parallel performances will also depend on the available software
(MPI libraries) and on the available communication hardware. For
PC clusters, OpenMPI (<TT>http://www.openmpi.org/</TT>) seems to yield better
performances than other implementations (info by Kostantin Kudin).
Note however that you need a decent communication hardware (at least
Gigabit ethernet) in order to have acceptable performances with
PW parallelization. Do not expect good scaling with cheap hardware:
PW calculations are by no means an "embarrassing parallel" problem.
<P>
Also note that multiprocessor motherboards for Intel Pentium CPUs typically
have just one memory bus for all processors. This dramatically
slows down any code doing massive access to memory (as most codes
in the Q<SMALL>UANTUM </SMALL>ESPRESSO distribution do) that runs on processors of the same
motherboard.
<P>
<HR>
<!--Navigation Panel-->
<A
HREF="node19.html">
<IMG WIDTH="37" HEIGHT="24" ALT="next" SRC="next.png"></A>
<A
HREF="node14.html">
<IMG WIDTH="26" HEIGHT="24" ALT="up" SRC="up.png"></A>
<A
HREF="node17.html">
<IMG WIDTH="63" HEIGHT="24" ALT="previous" SRC="prev.png"></A>
<A ID="tex2html199"
HREF="node1.html">
<IMG WIDTH="65" HEIGHT="24" ALT="contents" SRC="contents.png"></A>
<BR>
<B> Next:</B> <A
HREF="node19.html">4.5 Understanding the time</A>
<B> Up:</B> <A
HREF="node14.html">4 Performances</A>
<B> Previous:</B> <A
HREF="node17.html">4.3 File space requirements</A>
<B> <A ID="tex2html200"
HREF="node1.html">Contents</A></B>
<!--End of Navigation Panel-->
</BODY>
</HTML>
|