File: lesson_paral_mbt.html

package info (click to toggle)
abinit 7.8.2-2
  • links: PTS, VCS
  • area: main
  • in suites: jessie, jessie-kfreebsd
  • size: 278,292 kB
  • ctags: 19,095
  • sloc: f90: 463,759; python: 50,419; xml: 32,095; perl: 6,968; sh: 6,209; ansic: 4,705; fortran: 951; objc: 323; makefile: 43; csh: 42; pascal: 31
file content (930 lines) | stat: -rw-r--r-- 44,370 bytes parent folder | download
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
<html>
<head>
<title>
Tutorial on Many-Body calculations in parallel</title>
</head>
<body bgcolor="#ffffff">
<hr>
<h1>ABINIT, tutorial on Many-Body calculations in parallel:</h1>
<h2>G<sub>0</sub>W<sub>0</sub> corrections in &alpha;-quartz SiO<sub>2</sub>. </h2>
<hr>
<p>This lesson aims at showing how to perform parallel calculations with the GW part of ABINIT. 
   We will discuss the approaches used to parallelize the different steps of a typical 
   G<sub>0</sub>W<sub>0</sub> calculation, and how to setup the parameters of the run in order to achieve good speedup.
   &alpha;-quartz SiO<sub>2</sub> is used as test case.
<!--
   To obtain the following physical properties of SiO<sub>2</sub>:
  <ul>
    <li>the RPA polarizability with the Adler-Wiser expression
    <li>the RPA polarizability with the Hilbert transform method
    <li>the quasi-particle (QP) corrections within the one-shot G<sub>0</sub>W<sub>0</sub> approximation
  </ul>
-->
  <p></p>
  It is supposed that you have some knowledge about UNIX/Linux, and you know how to 
  submit MPI jobs.</p>
<p>
This lesson should take about 1.5 hour and requires to have at least a 200 CPU core parallel computer.
<p>You are supposed to know already some basics of parallelism in ABINIT, explained in the tutorial
<a href="lesson_basepar.html">A first introduction to ABINIT in parallel</a>.
<p>
In the following, when "run ABINIT over nn CPU cores" appears, you have to use a specific command line according 
to the operating system and architecture of the computer you are using. 
This can be for instance: 

<pre>
mpirun -n nn abinit < abinit.files 
</pre>

or the use of a specific submission file.

<!--
<a href="../input_variables/vargs.html#optdriver" target="kwimg">optdriver</a>,
<a href="../input_variables/vargw.html#gwcalctyp" target="kwimg">gwcalctyp</a>=12.
<a href="../input_variables/vargw.html#ecutsigx" target="kwimg">ecutsigx</a>,
<a href="../input_variables/vargw.html#ecutwfn" target="kwimg">ecutwfn</a>,
<a href="../input_variables/vargw.html#nfreqim" target="kwimg">nfreqim</a>,
<a href="../input_variables/vargw.html#nfreqre" target="kwimg">nfreqre</a>,
<a href="../input_variables/vargw.html#freqremax" target="kwimg">freqremax</a>,
<a href="../input_variables/vargs.html#optdriver" target="kwimg">optdriver</a>,
<a href="../input_variables/vargw.html#gwcalctyp" target="kwimg">gwcalctyp</a>=12.
-->

<h5>Copyright (C) 2000-2014 ABINIT group (MG,MS)
<br> This file is distributed under the terms of the GNU General Public License, see
~abinit/COPYING or <a href="http://www.gnu.org/copyleft/gpl.txt">
http://www.gnu.org/copyleft/gpl.txt </a>.
<br> For the initials of contributors, see ~abinit/doc/developers/contributors.txt .
</h5>

<script type="text/javascript" src="list_internal_links.js"> </script>

<h3><b>Contents of the lesson:</b></h3>

<ul>
  <li><a href="lesson_paral_mbt.html#KSS_gen"      >1</a> Generating the KSS file in parallel.
  <li><a href="lesson_paral_mbt.html#SCR_AW"       >2</a> Computing the screening in parallel using the Adler-Wiser expression
  <li><a href="lesson_paral_mbt.html#SCR_Hilbert"  >3</a> Computing the screening in parallel using the Hilbert transform method 
  <li><a href="lesson_paral_mbt.html#G0W0"         >4</a> Computing the one-shot GW corrections in parallel
  <li><a href="lesson_paral_mbt.html#Rules_for_MPI">5</a> Basic rules for efficient parallel calculations
</ul>
<hr>

<p>

</b>
The input files necessary to run the examples related to this tutorial are located
in the directory ~abinit/tests/tutoparal/Input.
<p>
Before beginning, you should create a working directory  whose name might be "Work_mbt" (so ~abinit/tests/tutoparal/Input/Work_mbt).
<p>
We will do most of the actions of this tutorial in this working directory.

<p><a name="KSS_gen"></a><br>
<h3><b>1 Generating the KSS file in parallel.</b> </h3>

<p>
In the <a href="lesson_gw1.html">first lesson</a> of the GW tutorial, we have learned how to 
generate the Kohn-Sham Structure file (KSS file) with the sequential version of the code.
Now we will perform a similar calculation taking advantage of the k-point parallelism
implemented in the ground-state part.
<!--
Si0<sub>2</sub> in the alpha-quartz phase will be used as test case.
-->

<p>
First of all, you should copy the files file tmbt_1.files in the working directory Work_mbt:
<pre>
$ cd Work_mbt
$ cp ../tmbt_1.files . 
</pre>
<p>

The abinit files files is described in 
<a href="../users/abinit_help.html#intro1" target="helpsimg">section 1.1</a> of the abinit_help file. 
Please, read it now if you haven't done it yet!

<p>
Now open the input file ~abinit/tests/tutoparal/Input/tmbt_1.in in your preferred editor, and look
at its structure.
<p>
The first dataset performs a rather standard SCF calculation to obtain the ground-state density.
The second dataset reads the density file and calculates the Kohn-Sham band structure including many
empty states:

<pre>
# DATASET 2 : KSS generation
iscf2      -2      # NSCF
getden2    -1      # Read previous density
tolwfr2    1d-12   # Stopping criterion for the NSCF cycle.
kssform2    3      # Conjugate-gradient algorithm (recommended option for large systems)
nband2      160    # Number of (occ and empty) bands computed in the NSCF cycle.
nbdbuf2     10     # A large buffer helps to reduce the number of NSCF steps.
nbandkss2   150    # Number of bands stored in the KSS file (only the converged states are written).
</pre>

<p>
We have already encountered these variables in the <a href="lesson_gw1.html">first lesson</a>
of the GW tutorial so their meaning should be familiar to you.
<p>
The only thing worth stressing is that this calculation solves the NSCF cycle
with the conjugate-gradient method
(<a href="../input_variables/varfil.html#kssform" target="kwimg">kssform</a>=3),
instead of employing the direct diagonalization of the KS Hamiltonian
(<a href="../input_variables/varfil.html#kssform" target="kwimg">kssform</a>=1, default value).
We opt for the conjugate-gradient method since the parallel implementation of
the direct diagonalization is not efficient
and, besides, it is also much more memory demanding than the CG approach.
<!--
is not
efficient
This algorithm, however, is not suited for large system due to the large 
memory requirements (allocation of the KS Hamiltonian) and the bad scaling with 
<a href="../input_variables/varfil.html#kssform" target="kwimg">kssform</a>=1).
This algorithm, however, is not suited for large system due to the large 
memory requirements (allocation of the KS Hamiltonian) and the bad scaling with 
the size of problem. 
For this reason, we strongly suggest using 
<a href="../input_variables/varfil.html#kssform" target="kwimg">kssform</a>=3
for calculations in large systems.
<a href="../input_variables/vargs.html#nbdbuf" target="kwimg">nbdbuf</a> that is set
to a value much larger than the default one (<b>nbdbuf</b>=2).
Such large value allows us to exclude the problematic high-energy states from
the convergence tests thus reducing the number of iterations. 
High energy states are indeed difficult to coverge and the NSCF algorithm
might require many iterations to converge all the bands within <a href="varbas.html#tolwfr">tolwfr</a>, although
the lowest states are already very well-converged after few steps.
-after which the algorithm will stop.
is very important in the case of NSCF calculations involving many empty states, as it
helps reduce the number of iterations required to converge.
Note that the KSS file contains <a href="../input_variables/vargw.html#nbandkss" target="kwimg">nbandkss</a>=100,
hence only the states that are converged within <a href="../input_variables/varbas.html#tolwfr">tolwfr</a>
are saved in the final KSS file
<p>
-->

<!--
<a href="../input_variables/varbas.html#nband" target="kwimg">nband</a>
<a href="../input_variables/vargw.html#nbandkss" target="kwimg">nbandkss</a>
is the key variable to create a _KSS file. 
both <a href="../input_variables/varfil.html#kssform" target="kwimg">kssform</a>=1 and 
<a href="../input_variables/varfil.html#kssform" target="kwimg">kssform</a>=3 when you start calculations beyond the tutorial).
-->
<p>
The NSCF cycle is executed in parallel using the standard parallelism over k-points and spin in which the 
(<a href="../input_variables/varbas.html#nkpt" target="kwimg">nkpt</a> x
 <a href="../input_variables/varbas.html#nsppol" target="kwimg">nsppol</a>) blocks of bands are 
distributed among the nodes.
This test uses an unshifted 4x4x3 grid (48 k points in the full Brillouin Zone, folding to 9
k-points in the irreducible wedge) hence the theoretical maximum speedup is 9.

<p>
Now run ABINIT over nn CPU cores using

<pre>
(mpirun ...) abinit < tmbt_1.files >& tmbt_1.log &
</pre>

but keep in mind that, to avoid idle processors, the number of CPUs should divide 9.
At the end of the run, the code will produce the file tmbt_1o_KSS needed for the subsequent GW calculations.
<p>
With three nodes, the wall clock time is around 1.5 minutes.

<pre>
$ tail tmbt_1.out

-
- Proc.   0 individual time (sec): cpu=        209.0  wall=        209.0

================================================================================

 Calculation completed.
.Delivered    0 WARNINGs and   5 COMMENTs to log file.
+Overall time at end (sec) : cpu=        626.9  wall=        626.9
</pre>

<p>
A reference output file is given in ~tests/tutoparal/Refs, under the name tmbt_1.out.
<p>
Note that 150 bands are not enough to obtain converged GW results, you might increase the number 
of bands in proportion to your computing resources.

<hr>
<p><a name="SCR_AW"></a><br>
<h3>2. Computing the screening in parallel using the Adler-Wiser expression</b> </h3>

<p>
In this part of the tutorial, we will compute the RPA polarizability with the Adler-Wiser approach.
The basic equations are discussed in this <a href="theory_mbt.html#RPA_Fourier_space">section</a> of the GW notes.
<p>
First copy the files file tmbt_2.file in the working directory, 
then create a symbolic link pointing to the KSS file we have generated in the previous step:

<pre>
$ ln -s tmbt_1o_DS2_KSS tmbt_2i_KSS 
</pre>

<!--
Before running the code in parallel, it is worth spending some time to analyze the input file ~abinit/tests/tutoparal/Input/tmbt_2.in.
-->
Now open the input file ~abinit/tests/tutoparal/Input/tmbt_2.in so that we can discuss its structure.
<p>
The set of parameters controlling the screening computation is summarized below:

<pre>
optdriver   3   # Screening run
irdkss      1   # Read input KSS file
symchi      1   # Use symmetries to speedup the BZ integration
awtr        1   # Take advantage of time-reversal. Mandatory when gwpara=2 is used.
gwpara      2   # Parallelization over bands
ecutwfn     24  # Cutoff for the wavefunctions.
ecuteps     8   # Cutoff for the polarizability.
nband       50  # Number of bands in the RPA expression (24 occupied bands)
inclvkb     2   # Correct treatment of the optical limit.
</pre>

<p>
Most of the variables have been already discussed in the <a href="lesson_gw1.html">first</a> lesson
of the GW tutorial. The only variables that deserve some additional explanation are
<a href="../input_variables/varpar.html#gwpara" target="kwimg">gwpara</a> and 
<a href="../input_variables/vargw.html#awtr" target="kwimg">awtr</a>.
<!--
The screening is computed for two frequencies (the static limit and an purely imaginary frequency)
hence this SCR file can only be used for performing GW calculations within the plasmon-pole approximation.
The number of planewaves in the screning 
(<a href="../input_variables/vargw.html#ecuteps" target="kwimg">ecuteps</a>)
is already reasonable whereas the number of bands should be subject to converge tests (left to the user)
-->
<!--
<li><a href="../input_variables/vargw.html#symchi" target="kwimg">symchi</a></li>
-->
<!--
<p>
The value <a href="../input_variables/vargw.html#symchi" target="kwimg">symchi</a>=1 
tells the program that we want to take advantage of the symmetry properties of the
oscillator matrix elements to reduce the number of k-points 
that have to be explicitly considered in the integrals over the BZ.
The set of independent k-points is defined  by the operations of the so-called little group of q, 
namely the symmetry operations whose rotational part leaves q unchanged within a reciprocal lattice vector.
As a consequence the computational gain depends on the symmetries of the external q-point. 
-->

<!--
Note that <b>awtr</b>=2 is the mandatory prerequisite for using the MPI algorithm corresponding to
<a href="../input_variables/vargw.html#gwpara" target="kwimg">gwpara</a>=2, algorithm that is
documented
<p>
Here we just want to discuss some important technical details concerning
the implementation of the parallel algorithms used in the screening part
(<a href="../input_variables/vargs.html#optdriver" target="kwimg">optdriver</a>=3).
<h4><a name="2a"></a></h4>
<h4><b> 2.a Adler-Wiser formula for the RPA polarizability.</b></h4>
<p>
Screening calculations are very CPU and memory demanding due to the large number of bande required to converge.
For this reason the code implements two different parallel algorithms, selected with the input variable 
<a href="../input_variables/varpar.html#gwpara" target="kwimg">gwpara</a>, 
that allow  a substancial descrease of the computational effort. 
Each method presents advantages and drawbacks that are extensively discussed in the documentation of the variable.
-->
<p>
<a href="../input_variables/varpar.html#gwpara" target="kwimg">gwpara</a> selects the parallel algorithm used to compute the screening.
Two different approaches are implemented:
<ul>
<li>
  <b>gwpara</b>=1 &rarr; Trivial parallelization over the k-points in the full Brillouin
</li>
<br>
<li>
<b>gwpara</b>=2 &rarr; Parallelization over bands with memory distribution
</li>
</ul>

Each method presents advantages and drawbacks that are discussed in the documentation of the variable.
In this tutorial, we will be focusing on <b>gwpara</b>=2 since this is the algorithm with 
the best  MPI-scalability and, mostly important, it is the only one that allows for a significant reduction 
of the memory requirement.
<p>
The option <a href="../input_variables/vargw.html#awtr" target="kwimg">awtr</a>=1 
specifies that the system presents time reversal symmetry so that it is possible
to halve the number of transitions that have to be calculated explicitly 
(only resonant transitions are needed).
Note that <a href="../input_variables/vargw.html#awtr" target="kwimg">awtr</a>=1
is MANDATORY when <a href="../input_variables/varpar.html#gwpara" target="kwimg">gwpara</a>=2 is used.
<p>
Before running the calculation in parallel, it is worth discussing some important technical details of the implementation. 
For our purposes, it suffices to say that, 
when <a href="../input_variables/varpar.html#gwpara" target="kwimg">gwpara</a>=2 is used in the
screening part,  the code distributes the wavefunctions such that each processing unit 
owns the FULL set of occupied bands while the empty states are distributed among the nodes.
The parallel computation of the inverse dielectric matrix is done in three different steps that can 
be schematically described as follows:

<ol>
<li>
Each node computes the partial contribution to the RPA polarizability:

<p align="center"><img src=./lesson_paral_mbt/gwpara2_chi0.png></p>

</li>
<br>
<li>
The partial results are collected on each node. 
</li>
<br>
<li>
The master node performs the matrix inversion to obtain the inverse dielectric matrix and writes the final result on file.
</li>
</ol>

<p>
Both the first and second step of the algorithm are expected to scale well with the number of processors.
Step 3, on the contrary, is performed in sequential thus it will have a detrimental effect 
on the overall scaling, especially in the case of large screening matrices
(large <a href="../input_variables/vargw.html#npweps" target="kwimg">npweps</a> or large number of frequency points &omega;).
<p>
Note that the maximum number of CPUs that can be used is dictated by the number of empty states
used to compute the polarizability.
Most importantly, a balanced distribution of the computing time is obtained when the number of
processors divides the number of conduction states.
<!--
i.e. the difference between <a href="../input_variables/varbas.html#nband" target="kwimg">nband</a> and the number of valence bands.
-->
<p>
The main limitation of the present implementation is represented by the storage of the polarizability.
This matrix, indeed, is not distributed hence each node must have enough memory to store in 
memory a table whose size is given by (<b>npweps</b><sup>2</sup> x <b>nomega</b> x 16 bytes) where
<b>nomega</b> is the total number of frequencies computed.

<p>
Tests performed at the Barcelona Supercomputing Center (see figures
below) have revealed that the first and the second part of the MPI algorithm have a very good scaling.
The routines cchi0 and cchi0q0 where the RPA expression is computed (step 1 and 2) scales 
almost linearly up to 512 processors.
The degradation of the total speedup observed for large number of processors is mainly due to the
portions of the computation that are not parallelized, namely the reading of the KSS file (rdkss)
and the matrix inversion (qloop).

<p align="center"><img src=./lesson_paral_mbt/screening_speedup.png></p>

<p align="center"><img src=./lesson_paral_mbt/screening_rel_times.png></p>

<!--
<p align="center"><img src=./lesson_paral_mbt/screening_rel_times_500bands.png></p>
<p align="center"><img src=./lesson_paral_mbt/screening_rel_times_1000bands.png></p>
-->

<p>
At this point, the most important technical details of the implementation have been covered,
and we can finally run ABINIT over nn CPU cores using 

<pre>
(mpirun ...) abinit < tmbt_2.files >& tmbt_2.log &
</pre>

<p>
Run the input file tmb_2.in using different number of processors and keep track of the time for each processor number
so that we can test the scalability of the implementation.
The performance analysis reported in the figures above was obtained with PAW using ZnO as tests case, 
but you should observe a similar behavior also in SiO<sub>2</sub>.
<p>
Now let's have a look at the output results.
Since this tutorial mainly focuses on how to run efficient MPI computations,
we won't perform any converge study for SiO<sub>2</sub>. 
Most of the parameters used in the input files are already close to converge, only the k-point sampling 
and the number of empty states should be increased.
You might modify the input files to perform the standard converge tests
following the procedure described in the <a href="lesson_gw1.html">first lesson</a> of the GW tutorial.
<!--
calculate different SCR files with different values 
of <a href="../input_variables/varbas.html#nband" target="kwimg">nband</a>.
The SCR files can then be used to check the convergence of the QP corrections in the self-energy part.
We strongly suggest to use different input files since the calculations can be run independently 
and moreover one can easily tune the number of processors for each calculation in order to 
have a well balanced calculation.
this is left as useful excercise for the user.
-->
<p>
In the main output file, there is a section reporting how the bands are distributed among the nodes.
For a sequential calculation, we have

<p>
<pre>
 screening : taking advantage of time-reversal symmetry
 Maximum band index for partially occupied states nbvw =    24
 Remaining bands to be divided among processors   nbcw =    26
 Number of bands treated by each node ~   26
</pre>

The value reported in the last line will decrease when the computation is done with more processors.

<p>
The memory allocated for the wavefunctions scales with the number of processors.
You can use the grep utility to extract this information from the log file.
For a calculation in sequential, we have:
<pre>
$ grep "Memory needed" tmbt_2.log

  Memory needed for storing ug=         29.5 [Mb]
  Memory needed for storing ur=        180.2 [Mb]
</pre>

<i>ug</i> denotes the internal buffer used to store the Fourier components of the orbitals whose size
scales linearly with <a href="../input_variables/vargw.html#npwwfn"  target="kwimg">npwwfn</a>.
<i>ur</i> is the array storing the orbitals on the real space FFT mesh. 
Keep in mind that the size of <i>ur</i> scales linearly with the total number of points in the FFT box,
number that is usually much larger than the numer of planewaves (<b>npwwfn</b>).
The number of FFT divisions used in the GW code can be extracted from the main output file using
<pre>
$ grep setmesh tmbt_2.out  -A 1
 setmesh: FFT mesh size selected  =  27x 27x 36
          total number of points  =    26244
</pre>

As discussed in this <a href="theory_mbt.html#oscillator_notes">section</a> of the GW notes,
the Fast Fourier Transform represents one of the most CPU intensive part of the execution.
For this reason the code provides 
the input variable <a href="../input_variables/vargw.html#fftgw" target="kwimg">fftgw</a> that 
can be used to decrease the number of FFT points for better efficiency.
The second digit of the input variable <a href="../input_variables/vargw.html#gwmem" target="kwimg">gwmem</a>, 
instead, governs the storage of the real space orbitals and can used to avoid the storage of 
the costly array <i>ur</i> at the price of an increase in computational time.
<p>

<!--
TODO: Graph with gwpara=1,2
efficiency of the algorithms as a function of the number of processing units.
The number of transitions treated by each node is reported in the log files. 
We can use the grep utility to extract this info. For example on four nodes ...
In the log file
<pre>
 Using spectral method for the imaginary part =  0
 Using symmetries to sum only over the IBZ_q  =  1
 Using faster algorithm based on time reversal symmetry.
 Will sum 5616 (b,b',k,s) states in chi0q0.
</pre>
<p>
-->

<h4><a name="Mrgscr"></a></h4>
<h4><b> 2.d Manual parallelization over q-points.</b></h4>

<p>
The computational effort required by the screening computation scales linearly with the number of q-points.
As explained in this <a href="theory_mbt.html#RPA_Fourier_space">section</a> of the GW notes,
the code exploits the symmetries of the screening function so that only the irreducible Brillouin
zone (IBZ) has to be calculated explicitly.
On the other hand, a large number of q-points might be needed to achieve converged results.
Typical examples are GW calculations in metals or optical properties within the Bethe-Salpeter formalism.
<p>
If enough processing units are available, the linear factor due to the q-point sampling can be trivially 
absorbed by splitting the calculation of the q-points into several independent runs using the variables
<a href="../input_variables/vargw.html#nqptdm" target="kwimg">nqptdm</a> and
<a href="../input_variables/vargw.html#qptdm" target="kwimg">qptdm</a>.
The results can then be gathered in a unique binary file by means of the <b>mrgscr</b> utility 
(see also the automatic tests v3/t87, v3/t88 and v3/t89).
<p>
<!--
TODO 
@Martin: Could you describe the new features you've recently added?
It is also possible to merge different frequencies 
<p>
-->
<hr>

<a name="SCR_Hilbert"></a>
<h3>3. Computing the screening in parallel using the Hilbert transform method</b> </h3>


As discussed in the <a href="theory_mbt.html#RPA_Fourier_space">GW_notes</a>,
the algorithm based on the Adler-Wiser expression is not optimal when many frequencies are wanted. 
In this paragraph, we therefore discuss how to use the Hilbert transform method
to calculate the RPA polarizability on a dense frequency mesh.
The equations implemented in the code are documented in  <a href="theory_mbt.html#hilbert_transform">this section</a> of the GW notes.
<p>
As usual, we have to copy the files file tmbt_3.file in the working directory, and
then create a symbolic link pointing to the KSS file.

<pre>
$ ln -s tmbt_1o_DS2_KSS tmbt_3i_KSS 
</pre>

The input file is ~abinit/tests/tutoparal/Input/tmbt_3.in.

Open it so that we can have a look at its structure.
<p>
A snapshot of the most important parameters governing the algorithm is reported below.

<pre>
gwcalctyp   2    # Contour-deformation technique.
spmeth      1    # Enable the spectral method.
nomegasf  100    # Number of points for the spectral function. 
gwpara      2    # Parallelization over bands
awtr        1    # Take advantage of time-reversal. Mandatory when gwpara=2 is used.
freqremax  40 eV # Frequency mesh for the polarizability
nfreqre    20
nfreqim     5
</pre>

<!--
The main difference is in the value used for
<a href="../input_variables/vargw.html#gwcalctyp" target="kwimg">gwcalctyp</a>, value that instructs the 
code to calculate the inverse dielectric matric for many frequencies both along the real and the imaginary axis.
The real frequency mesh is defined by
<a href="../input_variables/vargw.html#nfreqre" target="kwimg">nfreqre</a>, and
<a href="../input_variables/vargw.html#freqremax" target="kwimg">freqremax</a>,
while the sampling along the imaginary axis is specified by
<a href="../input_variables/vargw.html#nfreqim" target="kwimg">nfreqim</a>
(a more detailed discussion can be found in the <a href="lesson_gw2.html">second lesson</a> of the GW tutorial).
-->
<p>
The input file is similar to the one we used for the Adler-Wiser calculation.
The input variable <a href="../input_variables/vargw.html#spmeth" target="kwimg">spmeth</a> enables the spectral method.
<a href="../input_variables/vargw.html#nomegasf" target="kwimg">nomegasf</a> 
defines the number of &omega;&prime;  points in the linear mesh used for the spectral function i.e. the number 
of &omega;&prime; in the 
<a href="theory_mbt.html#hilbert_transform">equation</a> for the spectral function.
<!--
The mesh is automatically calculated inside the code so that the entire set of resonant transitions is covered.

Note that <a href="../input_variables/vargw.html#nomegasf" target="kwimg">nomegasf</a> 
should be subject to an accurate converge test since a dense sampling is required for an 
accurate representation of the delta function appearing in the expression for the spectral function.
-->
<p>
As discussed in the <a href="theory_mbt.html#hilbert_transform">GW notes</a>,
the Hilbert transform method is much more memory demanding that the Adler-Wiser approach,
mainly because of the large value of 
<a href="../input_variables/vargw.html#nomegasf" target="kwimg">nomegasf</a> that is usually needed to converge the results.
Fortunately, the particular distribution of the data employed in <a href="../input_variables/varpar.html#gwpara" target="kwimg">gwpara</a>=2
turns out to be well suited for the calculation of the spectral function since each processor has to store and treat 
only a subset of the entire range of transition energies. 
The algorithm therefore presents good MPI-scalability since the number of &omega;&prime; frequencies that have to be stored 
and considered in the Hilbert transform decreases with the number of processors.
<!--
<p>
Choosing <a href="../input_variables/vargw.html#nomegasf" target="kwimg">nomegasf</a> so that 
it is a multiple of the number of processors will facilitate the optimization of the 
Hilbert transform.
-->
<!--
An optimal distribution of both computing load and memory is achieved if 
<a href="../input_variables/varpar.html#gwpara" target="kwimg">gwpara</a>=2
is used since, by virtue of the particular distribution of the data, 
for example for calculating QP corrections 
with the countor deformation method or for obtaining the absorption spectrum withing the RPA.
-->
<p>
Now run ABINIT over nn CPU cores using

<pre>
(mpirun ...) abinit < tmbt_3.files >& tmbt_3.log &
</pre>

and test the scaling by varing the number of processors.

Keep in mind that, also in this case, the distribution of the computing work is well balanced when the number of CPUs divides 
the number of conduction states.

<p>
The memory needed to store the spectral function is reported in the log file:

<pre>
$ grep "sf_chi0q0" tmbt_3.log
 memory required by sf_chi0q0:           1.0036 [Gb]
</pre>

Note how the size of this array decreases when more processors are used.
<!--
Using spectral method for the imaginary part =  1
Using symmetries to sum only over the IBZ_q  =  1
Using faster algorithm based on time reversal symmetry.
Will sum 5616 (b,b',k,s) states in chi0q0.
Calculating Im chi0(q=(0,0,0),omega,G,G")
                                                            
Total number of transitions =      5616
min resonant     =    5.749 [eV]
Max resonant     =   39.969 [eV]
                                                            
=== Info on the real frequency mesh for spectral method ===
 maximum frequency =   39.969 [eV]
 nomegasf =   300
 domegasf =  0.11522 [eV]
Using linear mesh for Im chi0
my_wl and my_wr:     1   300

The first line gives the memory required to store a single &omega;&prime; point 
of the spectral function, the second line is the total memory allocated for
the spectral representation.
TODO: Modify the code, Add a few more lines discussing the log.

The second line gives the memory needed to store the spectral function on this particular node.
Note, indeed, that the present implementation distributes bands and not transitions thus
each node has to treat a different number of &omega;&prime; points during the computation of the
spectral function.
<p>
-->

<!--
<p>
<p align="center"><img src=./lesson_paral_mbt/comp-AW-spect-elements.png></p>
Left panel: Comparison between selected matrix elements of $\tee_{\GG_1\GG_2}^{-1}(\qq,\ww)$ 
of $\aa$-quartz SiO$_2$ calculated using the standard Adler-Wiser expression given in \myequation{eq:chi0_for_semicond},
and the method based on the Hilbert transform (\Eq{eq:im_part_chi0} and \Eq{eq:chi0_Hilbert_transform}).
The inset shows the imaginary part of an off-diagonal element of $\tee_{\GG_1\GG_2}^{-1}(\qq\rarr 0,\ww)$.
-->
<p>
The figure below shows the electron energy loss function (EELF) of SiO<sub>2</sub> calculated using the
Adler-Wiser and the Hilbert transform method. You might try to reproduce these results
(the EELF is reported in the file tmbt_3o_EELF, a much denser k-sampling is
required to achieve convergence).

<p align="center"><img src=./lesson_paral_mbt/comp-AW-spect.png></p>
<!--
<p>
Now there is an important question we should address:
how do we converge the number of bands used for evaluating the spectral function? Is this value
equal to the one that produces converged results when the standard implementation based on the
Adler-Wiser formula is used?
Strictly specking the answer is NO! The spectral method requires more bands to converge than the Adler-Wiser method. 
The reason is that the polarizability is obtained in terms of the Hilbert 
transform of the spectral function where the frequency integration extends up to infinity.
Due to the truncation in the number of bands, the tail of the spectral function is never evaluated exactly 
and this is equivalent to introducing a frequency cutoff in the Hilbert transform.
<p>
To check whether this frequency cutoff introduces important errors in the Hilber transform, we
can compare the QP corrections obtained with the spectral method versus analogus results obtained with 
a well converged Adler-Wiser calculation done on a reasonable k-mesh.
A less stringent tests would be to compare the macroscopic dielectric function obtained with the two methods 
assuming that also the other matrix elements of the dielectric matrix have the same convergence
behavior as the head (the macroscopic dielectric function).
-->
<hr>
  
<p><a name="G0W0"></a><br>
<h3>4. Computing the one-shot GW corrections in parallel</b> </h3>

<p>
In this last paragraph, we discuss how to calculate G<sub>0</sub>W<sub>0</sub> corrections in parallel with 
<a href="../input_variables/varpar.html#gwpara" target="kwimg">gwpara</a>=2.
The basic equations used to compute the self-energy matrix elements are discussed in 
<a href="theory_mbt.html#evaluation_gw_sigma">this part</a> of the GW notes.
<p>
Before running the calculation, copy the files file tmbt_4.file in the working directory.
Then create two symbolic links for the SCR and the KSS file:

<pre>
ln -s tmbt_1o_DS2_KSS tmbt_4i_KSS 
ln -s tmbt_2o_SCR     tmbt_4i_SCR
</pre>

Now open the input file ~abinit/tests/tutoparal/Input/tmbt_4.in.
<p>
The most important parameters of the calculation are reported below:

<pre>
optdriver   4            # Sigma run.
irdkss      1  
irdscr      1
gwcalctyp   0 ppmodel 1  # G0W0 calculation with the plasmon-pole approximation.
#gwcalctyp  2            # Uncomment this line to use the contour-deformation technique but remember to change the SCR file!
gwpara      2            # Parallelization over bands.
symsigma    1            # To enable the symmetrization of the self-energy matrix elements.
ecutwfn    24            # Cutoff for the wavefunctions.
ecuteps     8            # Cutoff in the correlation part.
ecutsigx   20            # Cutoff in the exchange part.
nband       50           # Number of bands for the correlation part.
</pre>

For our purposes, it suffices to say that this input file defines a standard one-shot calculation 
with the plasmon-pole model approximation.
We refer to the  documentation and to the <a href="lesson_gw1.html">first lesson</a>
of the GW tutorial for a more complete description of the meaning of these variables.
<!--
<p>
The only parameter that is not treated in the GW tutorial and deserves some clarification  is 
<a href="vargw.html#symsigma">symsigma</a>.
The option <a href="vargw.html#symsigma">symsigma</a>=1 is used to enable 
the symmetrization of the self-energy  matrix elements.
In this case, the BZ integration defining the self-energy matrix elements is 
reduced to an appropriate irreducible wedge defined by the point group of the wave-vector k
specified in the <a href="vargw.html#kptgw">kptgw</a> list.
Similarly to the screening part, also here the computational gain depends on the symmetries of the external k-point. 
-->
<!--
Note that </b>symsigma</b>=1 is not compatible with self-consistent GW calculations.
Moreover, as discussed in the documentaion of the variable, this option should be 
used with particular care in the presence of accidental degeneracies.
-->
<!--
<p>
Here we are mainly interested in understading how to take advantage of the MPI algorithm
for our calculations.
The symmetrized expression leads to a considerable speedup of the run but, unfortunately,
this option is not yet compatible with self-consistent GW calculations 
(see <a href="vargw.html#gwcalctyp">gwcalctyp</a>).
<p>
The algorithm implemented in <a href="vargw.html#symsigma">symsigma</a>=1
constructs a symmetric invariant for the diagonal matrix elements of the self-energy 
by simply averaging the GW results within the degenerate subspace.
Therefore particular care has to be taken in the presence of accidental degeneracies.
since GW calculations performed with <a href="vargw.html#symsigma">symsigma</a>=1 won't be able to remove
the initial accidental degeneracy.
-->
<p>
Also in this case, we use <a href="../input_variables/varpar.html#gwpara" target="kwimg">gwpara</a>=2
to perform the calculation in parallel.
Note, however, that the distribution of the orbitals employed in the self-energy part 
significantly differs from the one used to compute the screening.
In what follows, we briefly describe the two-step procedure used to distribute the wavefunctions:
<p>

<ol>
<li>
Each node reads and stores in memory the states where the QP corrections are computed 
(the list of states specified by <a href="../input_variables/vargw.html#kptgw" target="kwimg">kptgw</a> and 
<a href="../input_variables/vargw.html#bdgw" target="kwimg">bdgw</a>). 
</li>
<br>
<li>
The <a href="../input_variables/varbas.html#nband" target="kwimg">nband</a> bands
are distributed using the following partition scheme:
<p align="center"><img src=./lesson_paral_mbt/band_distribution_sigma.png></p>

where we have assumed a calculation done with four nodes (the index in the box denotes the MPI rank).
</li>
<br>
</ol>

<p>
By virtue of the particular distribution adopted,
the computation of the correlation part is expected to scale well with the number CPUs.
The maximum numer of processors that can be used is limited by
<a href="../input_variables/varbas.html#nband" target="kwimg">nband</a>.

Note, however, that only a subset of processors will receive the occupied states when
the bands are distributed in step 2. 
As a consequence, the theoretical maximum speedup that can be obtained in the exchange part
is limited by the availability of the occupied states on the different MPI nodes involved in the run.
<p>
The best-case scenario is when the QP corrections are wanted for all the occupied states.
In this case, indeed, each node can compute part of the self-energy and almost linear scaling should be reached.
The worst-case scenario is when the quasiparticle corrections are wanted only for a few states (e.g.
band gap calculations) and N<sub>CPU</sub> &gt;&gt; N<sub>valence</sub>.
In this case, indeed, only  N<sub>valence</sub> processors will participate to the calculation of the exchange part.
<!--
As a consequence 
the computation of the matrix elements of the exchange part 
won't scale anymore when the number of processors exceeds the number of occupied bands.
Fortunately this part represents a minor fraction of the overall CPU time.

this meand that the computation of the exchange part 
<p>
<p>
First the code calculates the matrix elements of the exchange part then it proceeds to calculated
the more expected correlation part.
Each node owns the states for which the QP corrections will be calculated, then the nband bands 
are distributed.
We use alternating planes of bands to facilitate the calculation of the exchange part,
It's worth stressing that <a href="../input_variables/varpar.html#gwpara" target="kwimg">gwpara</a>=2
is also memory distributed since each node will keep in memory only the states 
needed to calculate the partial contribution to the final matrix element. 
-->
<p>
To summarize:
The MPI computation of the correlation part is efficient when the number of processors divides <b>nband</b>.
Optimal scaling in the exchange part is obtained only when each node possesses
the full set of occupied states.
<!--
stored when the list of states specified by 
<a href="../input_variables/vargw.html#kptgw" target="kwimg">kptgw</a> and
<a href="../input_variables/vargw.html#bdgw" target="kwimg">bdgw</a>
-->

<p>
The two figures below show the speedup of the sigma part as function of the number of processors.
The self-enery is calcuated for 5 quasiparticle states using nband=1024 (205 occupied states).
Note that this setup is close to the worst-case scenario.
The computation of the self-energy matrix elements (csigme) scales well up to 64 processors.
For large number number of CPUs, the scaling departs from the linear behavior
due to the unbalanced distribution of the occupied bands.
The non-scalable parts of the implementation (init1, rdkss) limit the total speedup due to Amdhal's law.
<!--
<p align="center"><img src=./lesson_paral_mbt/sigma_speedup.png></p>
<p align="center"><img src=./lesson_paral_mbt/sigma_time_relative.png></p>
-->

<p align="center"><img src=./lesson_paral_mbt/sigma_analysis.png></p>

<p>
<!--
The overall scaling obtained with 512 processors is not satisfactory.
Of course the situation can be improved by using more complicate parallel algorithms 
and partitioning of the orbitals, but you should not blame the developers if the code
does not scale linearly up to thousand processors.
Before using a complicated MPI code, you should learn about the philosophy followed 
to parallelize the computation as well as its limitation.
Now you know that the number of processors in your MPI run have to be chosen carefully
according to the dimension of the proble.
The speedup curve reported in the graphic below tells you also that 
using more processors does not necessarily means getting results in less time.
-->
<p>
The implementation presents good memory scalability since the largest arrays are distributed. 
Only the size of the screening does not scale with the number of nodes.
By default each CPU stores in memory the entire screening matrix for all the q-points and frequencies 
in order to optimize the computation.
In the case of large matrices, however, it possible to opt for an out-of-core solution in which only a single
q-point is stored in memory and the data is read from the external SCR file  (slower but less memory demanding). 
This option is controlled by the first digit of <a href="../input_variables/vargw.html#gwmem" target="kwimg">gwmem</a>.
<!--
Only the memory for the occupied states 
used for the exchange part won't scale with the number of processors.
This limitation is even more severe in the sigma part due to the additional factor <b>nqpt</b>
introduced by the BZ sampling.
-->
<p>
Now that we know how distribute the load efficiently, we can finally run the calculation using

<pre>
(mpirun ...) abinit < tmbt_4.files >& tmbt_4.log &
</pre>

Keep track of the time for each processor number so that we can test the scalability of the self-energy part.
<p>
Please note that the results of these tests are not converged.
A well converged calculation would require a 6x6x6 k-mesh to sample the full BZ, and a cutoff energy of 10 Ha 
for the screening matrix.
The QP results converge extremely slowly with respect to the number of empty states.
To converge the QP gaps within 0.1 eV accuracy, we had to include
1200 bands in the screening  and 800 states in the calculation of the self-energy.
<p>

The comparison between the LDA band structure and the G<sub>0</sub>W<sub>0</sub> energy bands of 
&alpha;-quartz SiO<sub>2</sub> is reported in the figure below.
The direct gap at &Gamma; is opened up significantly from the LDA value of 6.4 eV to about 9.5 eV when 
the one-shot G<sub>0</sub>W<sub>0</sub> method is used.
You are invited to reproduce this result 
(take into account that this calculation has been performed at the theoretical LDA parameters,
while the experimental structure is used in all the input files of this tutorial).

<p align="center"><img src=./lesson_paral_mbt/SiO2_KSG0W0_PPM1.png></p>


<!--
<a href="../input_variables/vargw.html#ppmodel" target="kwimg">ppmodel</a>=1 is used.
You are invited to reproduce this result.
-->

<hr>

<a name="Rules_for_MPI"></a>
<h3><b>5. Basic rules for efficient parallel calculations:</b></h3>

<ol>
<li>
Remember that "Anything that can possibly go wrong, does" so, when writing your input file, try to "Keep It Short and Simple".
</li>
<br>
<li>
Do one thing and do it well:
<br>
Avoid using different values of <a href="../input_variables/vargs.html#optdriver" target="kwimg">optdriver</a>
in the same input file.
Each runlevel employs different approaches to distribute memory and CPU time, hence
it is almost impossible to find the number of processors that will produce a balanced run in each dataset.

</li>
<br>
<li>
Prime number theorem:
<br>
Convergence studies should be executed in parallel only when the parameters that are tested do not interfere with the MPI algorithm.
For example, the convergence study in the number of bands in the screening should be 
done in separated input files when  <a href="../input_variables/varpar.html#gwpara" target="kwimg">gwpara</a>=2 is used.
</li>
<br>
<li>
Less is more:
<br>
Split big calculations into smaller runs whenever possible. 
For example, screening calculations can be split over q-points.
The calculation of the self-energy can be easily split over
<a href="../input_variables/vargw.html#kptgw">kptgw</a>  and 
<a href="../input_variables/vargw.html#bdgw">bdgw</a>.
</li>
<br>
<li>
Look before you leap:
<br>
Use the converge tests to estimate how the CPU-time and the memory requirements depend on the
parameter that is tested. 
Having an estimate of the computing resources is very helpful when one has to lauch the final
calculation with converged parameters.
</li>
</ol>

<script type="text/javascript" src="list_internal_links.js"> </script>

</body>
</html>