File: node19.html

package info (click to toggle)
espresso 6.7-3
  • links: PTS, VCS
  • area: main
  • in suites: trixie
  • size: 311,092 kB
  • sloc: f90: 447,429; ansic: 52,566; sh: 40,631; xml: 37,561; tcl: 20,077; lisp: 5,923; makefile: 4,502; python: 4,379; perl: 1,219; cpp: 761; fortran: 618; java: 568; awk: 128
file content (287 lines) | stat: -rw-r--r-- 9,071 bytes parent folder | download | duplicates (3)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN">

<!--Converted with LaTeX2HTML 2019.2 (Released June 5, 2019) -->
<HTML lang="EN">
<HEAD>
<TITLE>4.5 Understanding the time report</TITLE>
<META NAME="description" CONTENT="4.5 Understanding the time report">
<META NAME="keywords" CONTENT="user_guide">
<META NAME="resource-type" CONTENT="document">
<META NAME="distribution" CONTENT="global">

<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=utf-8">
<META NAME="viewport" CONTENT="width=device-width, initial-scale=1.0">
<META NAME="Generator" CONTENT="LaTeX2HTML v2019.2">

<LINK REL="STYLESHEET" HREF="user_guide.css">

<LINK REL="next" HREF="node20.html">
<LINK REL="previous" HREF="node18.html">
<LINK REL="next" HREF="node20.html">
</HEAD>

<BODY >
<!--Navigation Panel-->
<A
 HREF="node20.html">
<IMG WIDTH="37" HEIGHT="24" ALT="next" SRC="next.png"></A> 
<A
 HREF="node14.html">
<IMG WIDTH="26" HEIGHT="24" ALT="up" SRC="up.png"></A> 
<A
 HREF="node18.html">
<IMG WIDTH="63" HEIGHT="24" ALT="previous" SRC="prev.png"></A> 
<A ID="tex2html201"
  HREF="node1.html">
<IMG WIDTH="65" HEIGHT="24" ALT="contents" SRC="contents.png"></A>  
<BR>
<B> Next:</B> <A
 HREF="node20.html">4.6 Restarting</A>
<B> Up:</B> <A
 HREF="node14.html">4 Performances</A>
<B> Previous:</B> <A
 HREF="node18.html">4.4 Parallelization issues</A>
 &nbsp; <B>  <A ID="tex2html202"
  HREF="node1.html">Contents</A></B> 
<BR>
<BR>
<!--End of Navigation Panel-->
<!--Table of Child-Links-->
<A ID="CHILD_LINKS"><STRONG>Subsections</STRONG></A>

<UL>
<LI><A ID="tex2html203"
  HREF="node19.html#SECTION00055100000000000000">4.5.1 Serial execution</A>
<LI><A ID="tex2html204"
  HREF="node19.html#SECTION00055200000000000000">4.5.2 Parallel execution</A>
<UL>
<LI><A ID="tex2html205"
  HREF="node19.html#SECTION00055210000000000000">4.5.2.1  Quick estimate of parallelization parameters</A>
<LI><A ID="tex2html206"
  HREF="node19.html#SECTION00055220000000000000">4.5.2.2 Typical symptoms of bad/inadequate parallelization</A>
</UL></UL>
<!--End of Table of Child-Links-->
<HR>

<H2><A ID="SECTION00055000000000000000">
4.5 Understanding the time report</A>
</H2>

<P>
The time report printed at the end of a <TT>pw.x</TT> run contains a lot of useful 
information that can be used to understand bottlenecks and improve 
performances.

<P>

<H3><A ID="SECTION00055100000000000000">
4.5.1 Serial execution</A>
</H3>
The following applies to calculations taking a sizable amount of time
(at least minutes): for short calculations (seconds), the time spent in 
the various initializations dominates. Any discrepancy with the following
picture signals some anomaly.

<P>

<UL>
<LI>For a typical job with norm-conserving PPs, the total (wall) time is mostly 
  spent in routine "electrons", calculating the self-consistent solution. 
</LI>
<LI>Most of the time spent in "electrons" is used by routine "c_bands", 
  calculating Kohn-Sham states. "sum_band" (calculating the charge density),
  "v_of_rho" (calculating the potential), "mix_rho" (charge density mixing)
  should take a small fraction of the time.
</LI>
<LI>Most of the time spent in "c_bands" is used by routines "cegterg" (k-points)
  or "regterg" (Gamma-point only), performing iterative diagonalization of
  the Kohn-Sham Hamiltonian in the PW basis set. 
</LI>
<LI>Most of the time spent in "*egterg" is used by routine "h_psi",
  calculating <I>Hψ</I> products. "cdiaghg" (k-points) or "rdiaghg" (Gamma-only), 
  performing subspace diagonalization, should take only a small fraction.
</LI>
<LI>Among the "general routines", most of the time is spent in FFT on Kohn-Sham
  states: "fftw", and to a smaller extent in other FFTs, "fft" and "ffts", 
  and in "calbec", calculating <!-- MATH
 $\langle\psi|\beta\rangle$
 -->
〈<I>ψ</I>| <I>β</I>〉 products. 
</LI>
<LI>Forces and stresses typically take a fraction of the order of 10 to 20%
  of the total time.
</LI>
</UL>
For PAW and Ultrasoft PP, you will see a larger contribution by "sum_band" 
and a nonnegligible "newd" contribution to the time spent in "electrons", 
but the overall picture is unchanged. You may drastically reduce the
overhead of Ultrasoft PPs by using input option "tqr=.true.".

<P>

<H3><A ID="SECTION00055200000000000000">
4.5.2 Parallel execution</A>
</H3>

<P>
The various parallelization levels should be used wisely in order to 
achieve good results. Let us summarize the effects of them on CPU:

<UL>
<LI>Parallelization on FFT speeds up (with varying efficiency) almost 
  all routines, with the notable exception of "cdiaghg" and "rdiaghg".
</LI>
<LI>Parallelization on k-points speeds up (almost linearly) "c_bands" and 
  called routines; speeds up partially "sum_band"; does not speed up
  at all "v_of_rho", "newd", "mix_rho".
</LI>
<LI>Linear-algebra parallelization speeds up (not always) "cdiaghg" and "rdiaghg" 
</LI>
<LI>"task-group" parallelization speeds up "fftw"
</LI>
<LI>OpenMP parallelization speeds up "fftw", plus selected parts of the 
  calculation, plus (depending on the availability of OpenMP-aware
  libraries) some linear algebra operations
</LI>
</UL>
and on RAM:

<UL>
<LI>Parallelization on FFT distributes most arrays across processors
  (i.e. all G-space and R-spaces arrays) but not all of them (in
  particular, not subspace Hamiltonian and overlap matrices)
</LI>
<LI>Linear-algebra parallelization also distributes subspace Hamiltonian
  and overlap matrices.
</LI>
<LI>All other parallelization levels do not distribute any memory
</LI>
</UL>
In an ideally parallelized run, you should observe the following:

<UL>
<LI>CPU and wall time do not differ by much
</LI>
<LI>Time usage is still dominated by the same routines as for the serial run
</LI>
<LI>Routine "fft_scatter" (called by parallel FFT) takes a sizable part of
  the time spent in FFTs but does not dominate it.
</LI>
</UL>

<P>

<H4><A ID="SECTION00055210000000000000">
4.5.2.1  Quick estimate of parallelization parameters</A>
</H4>

<P>
You need to know

<UL>
<LI>the number of k-points, <I>N</I><SUB>k</SUB>
</LI>
<LI>the third dimension of the (smooth) FFT grid, <I>N</I><SUB>3</SUB>
</LI>
<LI>the number of Kohn-Sham states, <I>M</I>
</LI>
</UL>
These data allow to set bounds on parallelization:

<UL>
<LI>k-point parallelization is limited to <I>N</I><SUB>k</SUB> processor pools: 
  <TT>-nk Nk</TT>
</LI>
<LI>FFT parallelization shouldn't exceed <I>N</I><SUB>3</SUB> processors, i.e. if you
  run with <TT>-nk Nk</TT>, use <!-- MATH
 $N=N_k\times N_3$
 -->
<I>N</I> = <I>N</I><SUB>k</SUB>×<I>N</I><SUB>3</SUB> MPI processes at most (<TT>mpirun -np N ...</TT>)
</LI>
<LI>Unless <I>M</I> is a few hundreds or more, don't bother using linear-algebra
  parallelization
</LI>
</UL>
You will need to experiment a bit to find the best compromise. In order
to have good load balancing among MPI processes, the number of k-point
pools should be an integer divisor of <I>N</I><SUB>k</SUB>; the number of processors for
FFT parallelization should be an integer divisor of <I>N</I><SUB>3</SUB>. 

<P>

<H4><A ID="SECTION00055220000000000000">
4.5.2.2 Typical symptoms of bad/inadequate parallelization</A>
</H4>

<P>

<UL>
<LI><EM>a large fraction of time is spent in "v_of_rho", "newd", "mix_rho"</EM>, or
<BR><EM>the time doesn't scale well or doesn't scale at all by increasing the 
  number of processors for k-point parallelization.</EM>  Solution:

<UL>
<LI>use (also) FFT parallelization if possible
</LI>
</UL>
</LI>
<LI><EM>a disproportionate time is spent in "cdiaghg"/"rdiaghg".</EM> Solutions:

<UL>
<LI>use (also) k-point parallelization if possible
</LI>
<LI>use linear-algebra parallelization, with scalapack if possible.
</LI>
</UL>
</LI>
<LI><EM>a disproportionate time is spent in "fft_scatter"</EM>, or
<EM>in "fft_scatter" the difference between CPU and wall time is large.</EM> Solutions:

<UL>
<LI>if you do not have fast (better than Gigabit ethernet) communication
    hardware, do not try FFT parallelization on more than 4 or 8 procs.
</LI>
<LI>use (also) k-point parallelization if possible
</LI>
</UL>
</LI>
<LI><EM>the time doesn't scale well or doesn't scale at all by increasing the 
  number of processors for FFT parallelization.</EM>
    Solutions:

<UL>
<LI>use "task groups": try command-line option <TT>-ntg 4</TT> or
    <TT>-ntg 8</TT>. This may improve your scaling.
</LI>
</UL>
</LI>
</UL>

<P>
<HR>
<!--Navigation Panel-->
<A
 HREF="node20.html">
<IMG WIDTH="37" HEIGHT="24" ALT="next" SRC="next.png"></A> 
<A
 HREF="node14.html">
<IMG WIDTH="26" HEIGHT="24" ALT="up" SRC="up.png"></A> 
<A
 HREF="node18.html">
<IMG WIDTH="63" HEIGHT="24" ALT="previous" SRC="prev.png"></A> 
<A ID="tex2html201"
  HREF="node1.html">
<IMG WIDTH="65" HEIGHT="24" ALT="contents" SRC="contents.png"></A>  
<BR>
<B> Next:</B> <A
 HREF="node20.html">4.6 Restarting</A>
<B> Up:</B> <A
 HREF="node14.html">4 Performances</A>
<B> Previous:</B> <A
 HREF="node18.html">4.4 Parallelization issues</A>
 &nbsp; <B>  <A ID="tex2html202"
  HREF="node1.html">Contents</A></B> 
<!--End of Navigation Panel-->

</BODY>
</HTML>