File: node18.html

package info (click to toggle)
espresso 6.7-3
  • links: PTS, VCS
  • area: main
  • in suites: trixie
  • size: 311,092 kB
  • sloc: f90: 447,429; ansic: 52,566; sh: 40,631; xml: 37,561; tcl: 20,077; lisp: 5,923; makefile: 4,502; python: 4,379; perl: 1,219; cpp: 761; fortran: 618; java: 568; awk: 128
file content (187 lines) | stat: -rw-r--r-- 6,643 bytes parent folder | download | duplicates (3)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN">

<!--Converted with LaTeX2HTML 2019.2 (Released June 5, 2019) -->
<HTML lang="EN">
<HEAD>
<TITLE>4.4 Parallelization issues</TITLE>
<META NAME="description" CONTENT="4.4 Parallelization issues">
<META NAME="keywords" CONTENT="user_guide">
<META NAME="resource-type" CONTENT="document">
<META NAME="distribution" CONTENT="global">

<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=utf-8">
<META NAME="viewport" CONTENT="width=device-width, initial-scale=1.0">
<META NAME="Generator" CONTENT="LaTeX2HTML v2019.2">

<LINK REL="STYLESHEET" HREF="user_guide.css">

<LINK REL="next" HREF="node19.html">
<LINK REL="previous" HREF="node17.html">
<LINK REL="next" HREF="node19.html">
</HEAD>

<BODY >
<!--Navigation Panel-->
<A
 HREF="node19.html">
<IMG WIDTH="37" HEIGHT="24" ALT="next" SRC="next.png"></A> 
<A
 HREF="node14.html">
<IMG WIDTH="26" HEIGHT="24" ALT="up" SRC="up.png"></A> 
<A
 HREF="node17.html">
<IMG WIDTH="63" HEIGHT="24" ALT="previous" SRC="prev.png"></A> 
<A ID="tex2html199"
  HREF="node1.html">
<IMG WIDTH="65" HEIGHT="24" ALT="contents" SRC="contents.png"></A>  
<BR>
<B> Next:</B> <A
 HREF="node19.html">4.5 Understanding the time</A>
<B> Up:</B> <A
 HREF="node14.html">4 Performances</A>
<B> Previous:</B> <A
 HREF="node17.html">4.3 File space requirements</A>
 &nbsp; <B>  <A ID="tex2html200"
  HREF="node1.html">Contents</A></B> 
<BR>
<BR>
<!--End of Navigation Panel-->

<H2><A ID="SECTION00054000000000000000"></A>
<A ID="SubSec:badpara"></A>
<BR>
4.4 Parallelization issues
</H2>

<P>
<TT>pw.x</TT> can run in principle on any number of processors.
The effectiveness of parallelization is ultimately judged by the 
''scaling'', i.e. how the time needed to perform a job scales
 with the number of processors, and depends upon:

<UL>
<LI>the size and type of the system under study;
</LI>
<LI>the judicious choice of the various levels of parallelization 
(detailed in Sec.<A HREF="#SubSec:badpara">4.4</A>);
</LI>
<LI>the availability of fast interprocess communications (or lack of it).
</LI>
</UL>
Ideally one would like to have linear scaling, i.e. <!-- MATH
 $T \sim T_0/N_p$
 -->
<I>T</I>∼<I>T</I><SUB>0</SUB>/<I>N</I><SUB>p</SUB> for 
<I>N</I><SUB>p</SUB> processors, where <I>T</I><SUB>0</SUB> is the estimated time for serial execution.
 In addition, one would like to have linear scaling of
the RAM per processor: <!-- MATH
 $O_N \sim O_0/N_p$
 -->
<I>O</I><SUB>N</SUB>∼<I>O</I><SUB>0</SUB>/<I>N</I><SUB>p</SUB>, so that large-memory systems
fit into the RAM of each processor.

<P>
Parallelization on k-points:

<UL>
<LI>guarantees (almost) linear scaling if the number of k-points
is a multiple of the number of pools;
</LI>
<LI>requires little communications (suitable for ethernet communications);
</LI>
<LI>reduces the required memory per processor by distributing wavefunctions
(but not other quantities like the charge density), unless you set 
<TT>disk_io='high'</TT>.
</LI>
</UL>
Parallelization on PWs:

<UL>
<LI>yields good to very good scaling, especially if the number of processors
in a pool is a divisor of <I>N</I><SUB>3</SUB> and <I>N</I><SUB>r3</SUB> (the dimensions along the z-axis 
of the FFT grids, <TT>nr3</TT> and <TT>nr3s</TT>, which coincide for NCPPs);
</LI>
<LI>requires heavy communications (suitable for Gigabit ethernet up to 
4, 8 CPUs at most, specialized communication hardware needed for 8 or more
processors );
</LI>
<LI>yields almost linear reduction of memory per processor with the number
of processors in the pool.
</LI>
</UL>

<P>
A note on scaling: optimal serial performances are achieved when the data are
as much as possible kept into the cache. As a side effect, PW
parallelization may yield superlinear (better than linear) scaling,
thanks to the increase in serial speed coming from the reduction of data size 
(making it easier for the machine to keep data in the cache).

<P>
VERY IMPORTANT: For each system there is an optimal range of number of processors on which to 
run the job.  A too large number of processors will yield performance 
degradation. If the size of pools is especially delicate: <I>N</I><SUB>p</SUB> should not 
exceed <I>N</I><SUB>3</SUB> and <I>N</I><SUB>r3</SUB>, and should ideally be no larger than
<!-- MATH
 $1/2\div1/4 N_3$
 -->
1/2÷1/4<I>N</I><SUB>3</SUB> and/or <I>N</I><SUB>r3</SUB>. In order to increase scalability,
it is often convenient to 
further subdivide a pool of processors into ''task groups''.
When the number of processors exceeds the number of FFT planes, 
data can be redistributed to "task groups" so that each group 
can process several wavefunctions at the same time.

<P>
The optimal number of processors for "linear-algebra"
parallelization, taking care of multiplication and diagonalization 
of <I>M</I>×<I>M</I> matrices, should be determined by observing the
performances of <TT>cdiagh/rdiagh</TT> (<TT>pw.x</TT>) or <TT>ortho</TT> (<TT>cp.x</TT>)
for different numbers of processors in the linear-algebra group
(must be a square integer).

<P>
Actual parallel performances will also depend on the available software 
(MPI libraries) and on the available communication hardware. For
PC clusters, OpenMPI (<TT>http://www.openmpi.org/</TT>) seems to yield better 
performances than other implementations (info by Kostantin Kudin). 
Note however that you need a decent communication hardware (at least 
Gigabit ethernet) in order to have acceptable performances with 
PW parallelization. Do not expect good scaling with cheap hardware: 
PW calculations are by no means an "embarrassing parallel" problem.

<P>
Also note that multiprocessor motherboards for Intel Pentium CPUs typically 
have just one memory bus for all processors. This dramatically
slows down any code doing massive access to memory (as most codes 
in the Q<SMALL>UANTUM </SMALL>ESPRESSO distribution do) that runs on processors of the same
motherboard.

<P>
<HR>
<!--Navigation Panel-->
<A
 HREF="node19.html">
<IMG WIDTH="37" HEIGHT="24" ALT="next" SRC="next.png"></A> 
<A
 HREF="node14.html">
<IMG WIDTH="26" HEIGHT="24" ALT="up" SRC="up.png"></A> 
<A
 HREF="node17.html">
<IMG WIDTH="63" HEIGHT="24" ALT="previous" SRC="prev.png"></A> 
<A ID="tex2html199"
  HREF="node1.html">
<IMG WIDTH="65" HEIGHT="24" ALT="contents" SRC="contents.png"></A>  
<BR>
<B> Next:</B> <A
 HREF="node19.html">4.5 Understanding the time</A>
<B> Up:</B> <A
 HREF="node14.html">4 Performances</A>
<B> Previous:</B> <A
 HREF="node17.html">4.3 File space requirements</A>
 &nbsp; <B>  <A ID="tex2html200"
  HREF="node1.html">Contents</A></B> 
<!--End of Navigation Panel-->

</BODY>
</HTML>