1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 939 940 941 942 943 944 945 946 947 948 949 950 951 952 953 954 955 956 957 958 959 960 961 962 963 964 965 966 967 968 969 970 971 972 973 974 975 976 977 978 979 980 981 982 983 984 985 986 987 988 989 990 991 992 993 994 995 996 997 998 999 1000 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 1011 1012 1013 1014 1015 1016 1017 1018 1019 1020 1021 1022 1023 1024 1025 1026 1027 1028 1029 1030 1031 1032 1033 1034 1035 1036 1037 1038 1039 1040 1041 1042 1043 1044 1045 1046 1047 1048 1049 1050 1051 1052 1053 1054 1055 1056 1057 1058 1059 1060 1061 1062 1063 1064 1065 1066 1067 1068 1069 1070 1071 1072 1073 1074 1075 1076 1077 1078 1079 1080 1081 1082 1083 1084 1085 1086 1087 1088 1089 1090 1091 1092 1093 1094 1095 1096 1097 1098 1099 1100 1101 1102 1103 1104 1105 1106 1107 1108 1109 1110 1111 1112 1113 1114 1115 1116 1117 1118 1119 1120 1121 1122 1123 1124 1125 1126 1127 1128 1129 1130 1131 1132 1133 1134 1135 1136 1137 1138 1139 1140 1141 1142 1143 1144 1145 1146 1147 1148 1149 1150 1151 1152 1153 1154 1155 1156 1157 1158
|
\documentclass[12pt]{article}
\setlength{\textwidth}{6.5in}
\setlength{\textheight}{9in}
\setlength{\topmargin}{-.5in}
\setlength{\oddsidemargin}{0in}
\setlength{\evensidemargin}{0in}
%\usepackage{graphics}
%\usepackage{amsfonts}
%\usepackage{amssymb}
%\usepackage{epsfig}
%\linespread{1.3}
\begin{document}
\title{SIAL Programmer Guide}
\author{
ACES III Documentation \\
\\
\\
{\bf Victor Lotrich} \\
{\bf Mark Ponton} \\
{\bf Norbert Flocke} \\
{\bf Ajith Perera} \\
{\bf Erik Deumens} \\
{\bf Rodney Bartlett} \\
{\em Quantum Theory Project} \\
{\em University of Florida} \\
{\em Gainesville, FL 32605} \\
\\
}
\maketitle
\newpage
\section{\bf Overview}
\noindent
This guide explains how to write programs for the Super Instruction Processor.
The full specification of the Super Instruction Assembly Language or SIAL, pronounced
"sail", is given in Sect. \ref{sect:sial}. Example programs are given
in Sect. \ref{sect:examples}.\\
\\
The language was created to easily write complex algorithms for the super instruction
processor or SIP. Rather than an interpreter that executes every command when
entered, this processor is more like a hardware processor, such as a modern micro chip.
The instructions are called super instructions because they operate not on individual
numbers, but on data blocks, typically 10,000 numbers.\\
\\
The language is more like an assembly language and is called the super instruction
assembly language (SIAL). The SIP calls the compiler to read the SIAL program in its
entirety and then the processor executes the binary code produced. A standalone compiler
is also available and it writes a .sio file that can be read and executed by the SIP.\\
\\
The super instruction architecture (SIA) provides an interface for execution optimization
of very specialized code that requires a lot of processing and a lot of data communication.
All operations are performed in relatively large blocks, compared to the basic unit of a
64 bit computer word, and the operations are performed asynchronously allowing for multiple
instructions being executed in parallel in each of the tasks in the SIP.\\
\\
The goal of the SIA is to allow efficient parallel processing where latency plays less
of a role and enough work is being given to all components so that there is sufficient
time to hide the latency of any operation. However, it is still necessary to ensure
that the bandwidth of all operations matches so that a steady state exists.
This is where some, automated, tuning of problem to hardware comes in. The problem will
be divided up into such pieces that the given hardware can sustain a steady state where all latency is hidden.\\
\section{\bf Programming guidelines}
Programs must be viewed as having only one global scope. There are procedures, but
these should be viewed as tools to organize code and logic, there are no local variables.
All variables must be declared at the beginning of the program, even index loops.
This is in agreement with the idea that every data element is a heavy object: It is a
block of floating point numbers of considerable size, and any operation on it is
expensive and must be considered with care. Therefore, creating temporary copies for
passing them as arguments is too expensive.\\
\\
Temporary variables, blocks, do exist and should be used as much as possible. They
are allocated when first needed and the SIP will regularly check whether they are
still needed and if not, mark the blocks as available for use.\\
\section{\bf Memory management}
SIAL is a language used to perform mathematical operations in parallel on arrays.
In order to accomplish this, each index of an array is subdivided into segments, which
in turn imposes a breakdown of the data into blocks. These blocks vary in size due to
the types of indices comprising the array and the different segment sizes of those indices.\\
\\
All memory used by the SIP is pre-allocated onto one of several different block stacks.
Each block stack contains blocks of the same size. When a SIAL program is configured
for execution, a pre-pass phase is executed, in which an estimate is made of the number
of blocks needed on each stack as any instruction in the program is executed. Any
remaining memory is allocated equally among the various stacks. As blocks are needed
by the SIP instructions, they are allocated from the stack whose block size best fits
the required block size. If all blocks on the desired stack are in use, the SIP tries
to allocate a block from the stack with the next largest block size, and so forth.\\
\\
Large data structures should be declared as DISTRIBUTED arrays, so that the blocks are
spread over the memory of all tasks. All required communication to write and read blocks
of distributed arrays is implicitly executed by the SIP. However, if a data structure
does not have to be known to all tasks, then it is better to store just a part of the
entire structure in an array declared as LOCAL. This way no communication is implied.\\
\\
The SIP does all input/output. For compatibility with the rest of ACES II, the JOBARC
files is read in the beginning and updated at the end of any SIAL execution; similarly
the LISTS file(s) are read and updated to allow serial modules to intermix with parallel
modules controlled by SIAL. In the SIAL language, there are no explicit read or write
operations defined. The SERVED arrays are to be used for very large arrays and they can
be assumed to be written in an efficient way in parallel by the SIP.\\
\section{Execution Management}
The PARDO construct should be used to execute a chunk of code on the local parts of
a distributed array. All indices to be looped over in this way must be listed on the
PARDO statement, because PARDO constructs cannot be nested.
The master assigns which ranges of the block indices will be executed by each worker.\\
%Each task knows which
%blocks, i.e. which block index values, are local and will only loop over those local
%values.\\
\\
The ALSO PARDO allows the programmer to construct multiple blocks of code that each
by themselves allow distributed processing and also are independent so that they code
blocks can be executed on different sets of processors. If the number of processors is
too small, the ALSO PARDO code blocks will be executed in serial, one after the
other. But the construct allows the SIP to schedule work more efficiently if resources
are available.\\
\\
The regular DO construct should be used to execute a chunk of code for the complete
range of the index specified in the DO statement. The DO construct can be arbitrarily
nested.\\
\\
A few special instructions have been defined. They are a simple mechanism in the SIAL
language to process a small number of types of operations for which it is not worthwhile
to develop a fully defined syntax.
\section{Super Instruction Assembly Language Definition}
\label{sect:sial}
\subsection{General requirements}
\begin{enumerate}
\item Input is free form. The lines must be less than 256 characters. There are no
continuation lines
\item Keywords and variable names are case insensitive.
\item All text after the pound sign (\#) is considered a comment and is ignored.
Blank lines and lines with only comments are ignored.
\item The language may in the future include data layout directives.
\item Every line is meaningful by itself.
\item Every name can be up to 128 characters long and consists of alphanumeric characters
and underscores, the first character must be alphabetic.
\item Reserved words cannot be used as names.
\end{enumerate}
\subsection{Predefined constants}
Some names are defined from other input sources, such as the JOBARC file. All these
constants count in segments, not in individual orbitals.
\subsubsection{Universal constants}
\begin{enumerate}
\item norb: total number of atomic orbital segments equal to the total number of
molecular orbital segments and therefore norb can also be used in declarations of
"no spin", "alpha", "beta" MO indices.
\item scfenerg: SCF energy read in from JOBARC.
\item totenerg: Total energy read in from JOBARC.
\item damp: value of DAMPSCF from ZMAT.
\item scf\_iter: value of SCF\_MAXCYC from ZMAT.
\item scf\_hist: value of SCF\_EXPORDE from ZMAT.
\item scf\_beg: value of SCF\_EXPSTAR from ZMAT.
\item scf\_conv: value of SCF\_CONV from ZMAT.
\item cc\_iter: value of CC\_MAXCYC from ZMAT.
\item cc\_hist: value of CC\_MAXORDER from ZMAT.
\item cc\_beg: Constant equal to 2, there is no ZMAT parameter corresponding to
this constant.
\item cc\_conv: value of CC\_CONV from ZMAT.
\item natoms: the number of atoms from ZMAT.
\item itrips: Starting occupied orbital to process in the triples codes.
\item itripe: Ending occupied orbital to process in the triples codes.
\item ihess1: Beginning Hessian component to process.
\item ihess2: Ending Hessian component to process.
\item jhess1: Beginning Hessian component to process.
\item jhess2: Ending Hessian component to process.
\item subb: Beginning sub\_index range used in triples codes.
\item sube: Ending sub\_index range used in triples codes.
\end{enumerate}
\subsubsection{MO constants:} The constants of type "no spin molecular orbital", "alpha
molecular orbital", "beta molecular orbital" cannot be compared, they are of
different types that correspond to the declarations (defined below) MOINDEX,
MOAINDEX, MOBINDEX, respectively.
\begin{enumerate}
\item nocc,naocc,nbocc: number of occupied molecular orbital
segments (no spin, alpha, beta)
\item nvirt,navirt,nbvirt: number of unoccupied or virtual orbital segments
(no spin, alpha, beta)
\item bocc,baocc,bbocc: begin of occupied orbital segment range (no spin, alpha, beta)
\item eocc,eaocc,ebocc: end occupied orbital segment range (no spin, alpha, beta)
\item bvirt,bavirt,bbvirt: begin of virtual orbital segment range (no spin, alpha, beta)
\item evirt,eavirt,ebvirt: end of virtual orbital segment range (no spin, alpha, beta)
\item static c(mu,p): Restricted spin orbital transformation matrix from the SCF, read in from JOBARC. p is moindex 1:norb and mu is aoindex 1:norb.
\item static ca(mu,pa): Alpha spin orbital transformation matrix from the SCF, read in
from JOBARC. pa is moaindex 1:norb and mu is aoindex 1:norb.
\item static cb(mu,pb): Restricted spin orbital transformation matrix from the SCF, read
in from JOBARC. pb is mobindex 1:norb and mu is aoindex 1:norb.
\item static e(p): Restricted spin orbital energies from the SCF, read in from JOBARC.
p is moindex 1:norb.
\item static ea(pa): Alpha spin orbital energies matrix from the SCF, read in from
JOBARC.
pa is moaindex 1:norb.
\item static eb(pb): Restricted spin orbital energies matrix from the SCF, read in
from JOBARC. pb is mobindex 1:norb.
\end{enumerate}
\subsubsection{Ordering table of predefined constants}
The following tests return true:
\begin{enumerate}
\item i $\le$ norb \\
where i is any predefined constant \\
\item i $\ge$ 1 \\
where i is bocc, baocc, bbocc, eocc, eaocc, ebocc, bvirt, bavirt, bbvirt, evirt,
eavirt, ebvirt \\
\item i $\ge$ 0 \\
where i is nocc, naocc, nbocc, nvirt, navirt, nbvirt
\item The 6=4x3/2 relations between the 4 "no spin MO" constants are \\
eocc $\ge$ bocc \\
bvirt $>$ eocc \\
bvirt $>$ bocc \\
evirt $\ge$ bvirt \\
evirt $>$ eocc \\
evirt $>$ bocc
\item The 6=4x3/2 relations between the 4 "alpha MO" constants are \\
eaocc $\ge$ baocc \\
bavirt $>$ eaocc \\
bavirt $>$ baocc \\
eavirt $\ge$ bavirt \\
eavirt $>$ eaocc \\
eavirt $>$ baocc \\
\item The 6 relations between the 4 "beta MO" constants are \\
ebocc $\ge$ bbocc \\
bbvirt $>$ ebocc \\
bbvirt $>$ bbocc \\
ebvirt $\ge$ bbvirt \\
ebvirt $>$ ebocc \\
ebvirt $>$ bbocc
All tests that cannot be obtained from these by the logical operation of reversing the
elements and the operator, are not defined.
\end{enumerate}
\subsection{Predefined special instructions}
Special instructions are defined in the SIP execution environment so that the language
does not have to change to add new instructions, which are calls to special subroutines.
Available instructions are listed in Appendix 2. Some are listed here with some example
use.
\begin{enumerate}
\item energy\_denominator v(a,i,b,j)\\
In coupled-cluster calculations it is necessary to divide\\
a quantity, say Told(a,i,b,j) by [e(i)+e(j)-e(a)-e(b)] where\\
e(k) are the orbital eigenvalues. The instruction\\
energy\_denominator array(a,i,b,j)\\
divides each element of array(a,i,b,j) within the block by [e(i)+e(j)-e(a)-e(b)] and
locally replaces that block array(a,i,b,j) by its 'scaled' value.\\
Example:\\
The distributed array Told(a,i,b,j) is divided by [e(i)+e(j)-e(a)-e(b)] and the result
put into the distributed array T2new(a,i,b,j).\\
\begin{verbatim}
PARDO a, b, i, j
GET Told(a,i,b,j)
execute energy_denominator Told(a,i,b,j)
PUT T2new(a,i,b,j) = Told(a,i,b,j)
ENDPARDO a, b, i, j
\end{verbatim}
\item blocks\_to\_list X\\
Write the blocks of array X to a list file. This makes the data available in
succeeding SIAL programs. Note: The current implementation of this does not write an
actual ACES II list file. It writes a simple FORTRAN sequential unformatted binary
file containing all data blocks, called BLOCKDATA, as well as an index file
called BLOCK\_INDEX, used to determine the data format for reading the data blocks
in the second SIAL program.There must be a sip\_barrier after each blocks\_to\_list
execution.
\item list\_to\_blocks X\\
Read the blocks of array X from a list file. The data must have been written by a
{\bf blocks\_to\_list} execution in a preceding SIAL program. Note: The current
implementation of this does not read an actual ACES II list file. It reads the BLOCKDATA
and BLOCK\_INDEX files described in the section on blocks\_to\_list. The order of
each list\_to\_blocks execution must be the same as that of the blocks\_to\_list
statements from the first job. Also, there should be a sip\_barrier executed after
each list\_to\_blocks.
\item fmult v(a,i,b,j)\\
The instruction fmult is very much like the instruction energy\_denominator in that
the elements within an array block are effectively scaled. In the case of fmult the
instruction\\
execute fmult v(a,i,b,j)\\
multiplies each element of the array block v(a,i,b,j) by the quantity
[e(i)+e(j)-e(a)-e(b)] where in coupled-cluster applications the quantities e(i) are
the orbital eigenvalues coming from a Hartree-Fock calculation.\\
Example:\\
In the following example the distributed array v1(a,i,b,j) is copied into the
temporary array temp(a,i,b,j). Each element of the array temp(a,i,b,j) is scaled
by [e(i)+e(j)-e(a)-e(b)] and the result put in the distributed array v2(a,i,b,j).
The array V1(a,i,b,j) has not been altered (even locally).\\
\begin{verbatim}
PARDO a, i, b, j
GET V1(a,i,b,j)
temp(a,i,b,j) = V1(a,i,b,j)
execute fmult temp(a,i,b,j)
PUT v2(a,i,b,j) = temp(a,i,b,j)
ENDPARDO a, i, b, j
\end{verbatim}
\end{enumerate}
\subsection{Declarations}
\begin{enumerate}
\item aoindex mu=1,norb\\
Define the AO block index mu with range 1 through norb. Note that these indices count
blocks, not individual orbitals. Ranges must be defined using predefined constants and
the number 1, all other values generate an assembly error.
\item moindex p=1,nocc\\
defines the MO block index p with range 1 through nocc. Note that these indices count
blocks, not individual orbitals. Ranges must be defined using predefined constants
and the number 1, all other values generate an assembly error.
\item moaindex pa=1,naocc\\
defines the MO alpha block index pa with range 1 through naocc. Note that these indices
count blocks, not individual orbitals. Ranges must be defined using predefined constants
and the number 1, all other values generate an assembly error.
\item mobindex pb=bavirt,ebvirt\\
defines the MO beta block index pb with range bbvirt through ebvirt. Note that these
indices count blocks, not individual orbitals. Ranges must be defined using predefined
constants and the number 1, all other values generate an assembly error.
\item index i=1,10\\
defines a simple index i with range 1 through 10 to be used in DO loops e.g. for an
iteration.
\item laindex l=1,23 defines an index l with range 1 through 23 that has no
association with atomic or molecular orbitals, but can be used to declare a dimension
of an array. With this type of index, arrays can be created that have a mixture of
dimension indices: some dimensions can be specified with the range of AOINDEX or
MOINDEX and other dimensions with LAINDEX. The convention is that the dimensions with
type LAINDEX must come after all dimensions with type AOINDEX or MOINDEX.
\item sub\_index sub = subb, sube\\
defines a subrange of data which ranges over the pre-defined constants subb to sube.
The values subb and sube are obtained via user input at run-time.
\item scalar fac\\
defines the scalar variable fac of type real (integers are treated as real). This
value is local to each task.
\item static c(mu,p)\\
defines an array stored locally in the task allocated with a separate malloc. All
predefined arrays are of this kind.
\item temp v1(p,mu,lambda,sigma)\\
defines an array block with one MO and three AO indices that only exists locally in
the form of a single block allocated on the block stack.
\item local c(mu,p)\\
defines an array stored locally in the task and allocated on the block stack.
\item distributed v4(p,q,r,s)\\
defines an array distributed over many tasks and allocated on the block stack.
The way it is distributed is determined outside the SIAL.
\item served v(mu,nu,lambda,sigma)\\
defines the array {\bf v} with four AO indices, which must have been defined before and
specifies that the array is distributed and values will be delivered by a server on
request. The task number of the server is not specified in the language, but by the
environment and is subject to optimization. The SI processor has the option, not
controlled by the SIAL program but by the parallel environment directives, to deliver
the integrals in one of the following ways:\\
\begin{itemize}
\item compute the block in the calling task,\\
\item request the block from an integral worker task that will then compute and deliver it,\\
\item request the block from an IO manager, that will deliver it from distributed RAM,\\
\item request the block from an IO manager, that will deliver it after reading distributed
files on local disks,\\
\item request the block from an IO manager, that will deliver it after reading a parallel
file on a global parallel file system\\
\item request the block from an IO manager, that will deliver after obtaining it with
GETREC, PUTREC from a serial ACES II style file stored on some local disk or some
global disk.\\
\end{itemize}
The declaration associates the stated names of the array and the indices with the
2-electron integral matrices that are currently supported. The SI processor decides from
the combination of type of indices supplied (aoindex or moindex) and their
ranges which set of 2-electron indices are meant: AO integrals or partially of completely
transformed MO integrals. Currently these are the only arrays that can be declared served.
1-electron integrals for hamlitonian and for properties may be added to this list at a
later time. Because served arrays can overflow to disk, they can be larger than
distributed arrays.
\item temp v2(p$<$q,mu,nu)\\
specifies that the indices p and q are symmetric and that packed storage is allowed; this
syntax is allowed on all declarations.
\end{enumerate}
\subsection{Control statements}
\begin{enumerate}
\item Program
\begin{itemize}
\item sial myprog\\
start of the SIAL program called myprog. With this control line the program can be embedded
in any file and all text preceeding this line is ignored by the assembler. The line must start
with white space or the reserved word sial.
\item endsial myprog\\
marks the end of a SIAL program. Everything in the file after that is ignored by the assembler.
\end{itemize}
\item Procedures
\begin{itemize}
\item proc mywork\\
start of a procedure called mywork. Procedures are only a tool to organize executable
code, they are not to be compared to functions in C or subroutines in Fortran, if anthing
they are like internal procedures in Fortran 90. They operate in the one global scope of
the SIAL program.
\begin{itemize}
\item The proc end proc code block is inside the body of the sial end sial program definition.
\item Declarations are not allowed inside the proc body. The procedure can use temp
arrays, but they must be declared in the scope of the main program.
\item All indices and arrays defined in the program are visible inside all procedures.
\item Procedures are like declarations and must be located after other declarations and
before any executable statements.
\end{itemize}
\item endproc mywork\\
end of the procedure called mywork. No other procedure can be defined inside it. All other
control structures must be closed before this line, except end sial and that one may not
be closed, i.e. the procedure must be inside the program definition.
\item return\\
exits the running procedure and returns execution to the statement after the call to the
procedure.
\item call mywork\\
calls the previously defined procedure mywork; at the end of the procedure or at execution
of a return statement control returns to the line after the call to the procedure.
\end{itemize}
\item Distribution\\
\begin{itemize}
\item pardo mu,nu,lambda,sigma\\
starts a distributed loop over the indices mu,nu,lambda,sigma. The work inside the loop is
performed by every task only for those values of the listed block indices that have been
assigned by the master to that task as controlled by the parallel environment directives.
\item endpardo mu,nu,lambda,sigma\\
ends the distributed loop with variables mu,nu,lambda,sigma; the variables must be specified.
pardo structures cannot be nested, improperly nested loops generate an assembly error.
Two consecutive distributed loops are allowed and can use the same index variables or
different variables without error. Use of different names will not change the ranges assigned
to each task.
\end{itemize}
\item Iteration\\
\begin{itemize}
\item do mu\\
starts a loop over the index mu. The do statement is operationally equivalent to incrementing
the loop index.
\item enddo mu\\
ends the loop with variable mu; the variable must be specified; improperly nested loops
generate an assembly error. Two consecutive loops can use the same index variable without error.
\item cycle mu\\
makes control in the loop jump to the next iteration of the loop on the variable named on
the cycle statement; a cycle statement without a variable name generates an assembly error.
\item exit mu\\
makes control in the loop jump to the statement after the end do with the matching variable
named on the exit statement; a exit statement without a variable name generates an assembly error.
\end{itemize}
\item Conditions
\begin{itemize}
\item if a$<$3\\
starts an if-block with test on the scalar or index variables or expression on scalar or
index variables. If the expression contains at least one scalar variable all computations
in the expression are evaluated as C double or Fortran double precision, if the expression
contains only index variables and integers, the expression is evaluated as C int or Fortran
integer. Constants such as 3 or 3. are treated as integers and floats, respectively. The
code inside the block is executed if the expression value is non-zero (true).
\item endif\\
ends the if-block; improperly nested if-blocks generate an assembly error.
\item else\\
starts the alternative code block in the if-block; improperly nested if-blocks generate an
assembly error.
\end{itemize}
\end{enumerate}
\subsection{Operation statements}
\begin{enumerate}
%VFL\item 1. operations: +, -, *, \^, \=\=, \<, \>, \<\=, \>\=, \&\& (and), || (or), \! (not)
\item operations: +, -, *, \^\, ==, $<$, $>$, $<$=, $>$=, \&\& (and),
\vline\hspace{1mm}\vline \hspace{1mm} (or), ! (not)
\item operation-assignments: +=, -=, *=
\item allocate v3(mu,*,lambda,*)\\
allocates all blocks for arrays declared as local; the blocks with the matching indices are
allocated and are then available for processing; the allocate statement allows the user to
specify partial allocation by listing the index explicitly, implying that only blocks
with the (segment) value of the index at the time the allocate statement is executed will
be allocated; specifying an index as the widlcard "*" allocates blocks for all values of the
matcing index as defined in the declaration of the local array; an allocate on any other
array is an error.
\item deallocate v3\\
deallocates all blocks for arrays declared as local; if no allocate has been executed for the
local array when deallocate is executed, the deallocate is an error.
\item create v3\\
allocates all blocks for arrays declared as distributed; in each task the blocks with the
correct indices, e.g. as assigned by the master task to each task, are allocated and are
then available for local processing; a create on any other array is an error.
\item delete v3\\
deallocates all blocks for arrays declared as distributed; if no create has been executed
for the distributed array when delete is executed, the delete is an error.
\item array reference indexing any operation statement can include one or more valid
array references, this means that
\begin{enumerate}
\item the array has been declared
\item each index has been declared
\item the type of each index used is the same as the type of the matching index in the
declaration of the array
\item the range of each index used is a subrange of the range of the matching index in
the declaration of the array
\end{enumerate}
Any array reference that violates these conditions generates an assembly error. The
assembler uses the predefined relationships between predefined constants to determine
whether the range of an index is a subrange of the range of another index.
\item v3(p,q,r,s) = v2(p,q,r,mu) * c(mu,s)\\
is an assignment and a contraction. If the shape of the arrays does not match the contraction
an assembly error will result.
\item v3(p,q,r,s) = x(p,q) \^ y(r,s)\\
is an assignment and a tensor product. If the shape of the arrays does not match the
contraction an assembly error will result.
\item v3(p,q,r,s) += a * v1(p,q,r,s)\\
multiply v1 by a scalar a and add the result to v3.
\item v3(p,q,r,s) = a * v1(p,q,r,s)\\
multiply v1 by a scalar a. v1 and v3 can be the same array.
\item v3(p,q,r,s) *= a\\
multiply v3 by a scalar a.
\item put v3(p,q,r,s) = v2(p,q,r,s)\\
sends the local block of v2 to the owner task of the indicated block of the distributed
array v3 to replace the existing block of v3. The shape and sizes of the blocks must match.
\item put v3(p,q,r,s) += v2(p,q,r,s)\\
sends the local block of v2 to the owner task of the indicated block of the distributed
array v3 to be accumulated there to the existing block of v3. The shape and sizes of the
blocks must match.
\item get v3(p,q,r,s)\\
gets the indicated block of a distributed array v3 from the owner task. The shape and sizes
of the blocks must match.
\item prepare v4(p,q,r,s) = v2(p,q,r,s)\\
deliver a block of v2 to the server to replace the block of the served array v4 for future
requests.
\item prepare v4(p,q,r,s) += v2(pqp,r,s)\\
deliver a block of v2 to the server to be added to the block of served array v4 for future
requests.
\item request v(mu,nu,lambda,sigma) sigma\\
for served array request the block with indicated indices and indicate that the next request
will be for the listed index incremented by one, i.e. sigma+1
\item request t(mu,nu,I,j) = v(mu,nu,a,b)\\
Partial request. The array v must have been previously prepared. Then the prequest instruction
will retrieve a partial block of data (mu,nu,i,j) from the full block of (mu,nu,a,b).
Indices i and j must be declared as "index". Indices a and b can be any index type.
Care must be taken to insure that i and j will take on values that are a sub range of
indices a and b.
\item collective a += b\\
collective operation to add the local variable b from every task into the local variable a.
\item execute specinstr arg1 arg2 arg3\\
executes the predefined special instruction specinstr with arguments arg1 arg2 arg3.
See Appendix 2 for a complete list.
\item 22. execute trace\_on|off\\
turn on or off tracing features listed by their keyword.
\end{enumerate}
\section{List of special instructions}
\label{sect:list}
Several special instructions have already been developed and tested. They can be used in any
SIAL program.
\begin{enumerate}
\item return\_h1\\
syntax: execute return\_h1 h1\\
function: Computes the one-electron integrals of type kinetic and nuclear attraction, sums them
and returns them as h1.\\
restrictions: h1 must be a two-dimensional array.
\item copy\_ff\\
syntax: execute copy\_ff array1 array2\\
function: Copies the array2 into the array1 without regard to index type\\
restrictions: array1 and array2 must be two-dimensional static arrays.
\item copy\_ab\\
syntax: execute copy\_ab array1 array2\\
function: The array1 is assumed to be of type alpha-alpha (VaD) or (SD). The elements of
the array1 are put into array2 with the FIRST index offset by the number of singly
occupied orbitals. Array2 is of type beta-beta (VbD)\\
restrictions: array1 and array2 must be two-dimensional static arrays. It is assumed
that n\_alpha > n\_beta as is the case in the ACES II program.
\item copy\_ba\\
syntax: execute copy\_ba array1 array2\\
function: The array1, whichmust be of type (beta,beta) spin, is copied into array2, which
must be of type alpha, with an offset of nalpha\_occupied - nbeta\_occupied\\
restrictions: array1 and array2 must be two-dimensional static arrays. It is assumed
that n\_alpha > n\_beta as is the case in the ACES II program.
\item fmult\\
syntax: execute fmult array1\\
function: Each element of the two-dimesional array1(i,j) is scaled by the fock matrix(i,i).
larray1(i,j) = array1(i,j)*fock(i,i).\\
restrictions: array1 must be a two-dimensional array.
\item set\_index\\
syntax: execute set\_index array1\\
function: Sets the indices of a 4-d array in common block values. These indices are stored
in the SINDEX common block.\\
restrictions: array1 must be a 4-dimension array with simple indeces.
\item read\_grad\\
syntax: execute read\_grad array1\\
function: The array1(i,j) is read in and summed into the gradient which is in a common block.\\
restrictions: array must be declared with simple indices. i and j should range from 1-natoms
and 1-3, but simple indices have segement sizes of 1.
\item energy\_denominator\\
syntax: execute energy\_denominator array1\\
function: divides each element of array1(a,i,b,j,...) by the denominator
fock(i,i)+fock(j,j)+..-fock(a,a)-fock(b,b)-....\\
restrictions: array1 must be two, four, or six dimensional.\\
The indeces of array1 should have the correct spin type. i.e. (a,i) $\rightarrow$ (alpha,alpha),
(b,j) $\rightarrow$ (beta,beta), etc.. Although the instruction would execute properly even if this
were not the case but care should be taken using this instruction in the manner.
\item energy\_adenominator\\
syntax: execute energy\_adenominator array1\\
function: divides each element of array1(a,i) by the denominator
fock\_alpha(i,i) -fock\_alpha(a,a).\\
restrictions: array1 must be two dimensional.
\item energy\_bdenominator\\
syntax: execute energy\_bdenominator array1\\
function: divides each element of array1(a,i) by the denominator
fock\_beta(i,i) - fock\_beta(a,a).\\
restrictions: array1 must be two dimensional.
\item energy\_abdenominator\\
syntax: execute energy\_abdenominator array1\\
function: divides each element of array1(a,i) by the denominator
fock\_alpha(i,i) + fock\_beta(i,i) - fock\_alpha(a,a) - fock\_beta(a,a).\\
restrictions: array1 must be two dimensional.
\item eigen\_nonsymm\_calc\\
syntax: execute eigen\_nonsymm\_calc array1 array2\\
function: Calculates the eigenvalues and eigenvectors of a 2-d square matrix. The matrix
does NOT have to be symmetric. The matrix is also diagonalized on output. Array1 is the
diagonalized matrix and array2 is the matrix whose columns are the eigenvectors of Array1.\\
restrictions: array1 and array2 must be two-dimensional static arrays.
\item check\_dconf\\
syntax: check\_dconf array1 scalar1\\
function: The largest(absolute value) element of array1 is found and output as scalar1\\
restrictions: array1 must be two-dimensional and scalar1 must be declared as a scalar in
the sial program.
\item return\_diagonal4\\
syntax: execute return\_diagonal4 array1 array2\\
function: The diagonal elements of the array array1 are removed and the resulting diagonal
array is output as array2. array1 is not modified by the instruction.\\
restrictions: Both array1 and array2 must be four-dimensional.
\item return\_diagonal\\
syntax: execute return\_diagonal array1 array2\\
function: The diagonal elements of the array array1 are removed and the resulting diagonal
array is output as array2. array1 is not modified by the instruction.\\
restrictions: Both array1 and array2 must be declared as static arrays in the sial program, and
be two-dimensional.
\item return\_sval\\
syntax: execute array1 scalar1\\
function: The scalar scalar1 is set equal to the value of the array array1. The overall
purpose is to pull out the (p,q) element of the array1 and set scalar1 equal to its value.\\
restrictions: array1 must be two dimensional and scalar1 must be declared scalar in the sial
program.
\item place\_sval\\
syntax: execute place\_sval array1 scalar1\\
function: The (p,q) element of array1 is set equal to scalar1.\\
restrictions: array1 must be two-dimensional. scalar1 must be defined as scalar in the
sial program.
\item square\_root\\
syntax: execute square\_root scalar1 scalar2\\
function: scalar1 is raised to the power scalar2. scalar1 = scalar1**scalar2\\
restrictions: scalar1 and scalar2 must be declared as scalars in the sial program.
\item apply\_den2\\
syntax: execute apply\_den2 source target\\
function: each element of the array source(p,q) is divided by the corresponding element
of the array target(i,j). The array source contains the output.\\
restrictions: the arrays source and target must be two dimensional arrays.
\item apply\_den4\\
syntax: execute apply\_den4 source target\\
function: each element of the array source(p,q,r,s) is divided by the corresponding
element of the array target(i,j,k,l). The array source contains the output.\\
restrictions: the arrays source and target must be four dimensional arrays.
\item read\_hess\\
syntax: execute read\_hess array1\\
function: The elements of the four dimensional array array1 are summed into the Hessian
which is in a common block. Note that the summation is only performed on processor 0.\\
restrictions: array1 must be a four dimensional array with simple index types. It must be
dimensioned as (natoms,3,natoms,3).
\item remove\_diagonal\\
syntax: execute remove\_diagonal array1 array2\\
function: The diagonal elements of the array1 are removed and the resulting array is array2.\\
restrictions: array1 and array2 must be two-dimensional static arrays.
\item fock\_denominator\\
syntax: execute fock\_denominator array1\\
function: The elements of the array1(a,i,b,j) are divided by
fock(i,i) + fock(j,j) - fock(a,a) - fock(b,b). Note that if the denominator is zero
that element of the array is set to zero.\\
restrictions: array1 must be 2 or 4 dimensional.
\item set\_flags\\
syntax: execute set\_flags array1\\
function: Sets the indices of a 3-d array in common block values.\\
Example:
The first index is assumed to be the atom, the second is the component
index (i. e. x,y, or z), and the 3rd is the center.\\
restrictions: array1 must be three-dimensional with simple indeces.
\item set\_flags2\\
syntax: execute set\_flags2 array1\\
function: Sets the indices of a 2-d array in common block values.\\
Example: The first index is assumed to be the atom, the second is the component
index (i. e. x,y, or z).\\
restrictions: array1 must be two-dimensional with simple indeces.\\
\item der2\_comp\\
syntax: execute der2\_comp array1(m,n,r,s)\\
function: The derivative integrals for the block (m,n,r,s) are computed and returned in arrays1.
Note that set\_flags2 must have been used to define which degree of freedom to take the
derivative with respect to. i.e. atom and component.\\
restrictions: array1 must be a 4-dimesional array with AO indices and the perturbation must
have been set, by set\_flags2 probably.\\
\item fock\_der\\
syntax: execute fock\_der array1(mu,nu)\\
function: Computes the derivative of the fock matrix from only one-particle
contributions T+NAI and returns it as array1. The degree of freedom to take the derivative
with respect to, i.e. atom and component, must have been previously set, probably by set\_flags2.\\
restrictions: Array1 must be two-dimensional array with AO indices. The perturbation must have
been set before fock\_der is called.
\item overlap\_der\\
syntax: execute fock\_der array1(mu,nu)\\
function: Computes the derivative of the overlap matrix. The degree of freedom to take the
derivative with respect to, i.e. atom and component, must have been previously set, probably
by set\_flags2.\\
restrictions: Array1 must betwo-dimensional array with AO indices. The perturbation must
have been set before fock\_der is called.
\item scontxy\\
syntax: execute scontxy array1\\
function: The second derivative 1-electron overlap integrals are computed and contracted
with the array1. Note that array1 is perturbation independent and that all perturbations
are considered inside the instruction. The hessian is updated internally as well.\\
restrictions: array1 must be a two-dimensional array with AO indices.
\item hcontxy\\
syntax: execute hcontxy array1\\
function: The second derivative 1-electron kinetic and nuclear attraction
integrals(i.e. fock matrix contributions) are computed and contracted with the array1.
Note that array1 is perturbation independent and that all perturbations are considered
inside the instruction. The hessian is updated internally as well.\\
restrictions: array1 must be a two-dimensional array with AO indices.
\item compute\_sderivative\_integrals\\
syntax: execute compute\_sderivative\_integrals array1(m,n,r,s)\\
function: The second derivative of the two-electron integrals is computed and contracted
with array1. The perturbations (atom,component,jatom,jcomponent) defining the derivative are
looped over internally and the hessian is updated internally.\\
restrictions: array1 must be a 4-dimensional array with AO indeces.
\item removevv\_dd\\
syntax: execute removevv\_dd array1 array2\\
function: removes all doubly occupied indeces from the array1 with array2 being the result
of the array with the all doubly occupied indeces removed. Applicable if
array1 = array1(b,b1), b = virtual beta index.\\
restrictions: array1 and array2 must be two-dimensional arrays and they must have
beta\_virtual indices. nalpha\_occ > nbeta\_occ is required. Only used for ROHF codes.
\item removeoo\_dd\\
syntax: execute removeoo\_dd array1 array2\\
function: removes all doubly occupied indeces from the array1 with array2 being the result
of the array with the all doubly occupied indeces removed. Applicable if
array1 = array1(i,i1), i = occupied alpha index.\\
restrictions: array1 and array2 must be two-dimensional arrays and they must have
alpha\_occupied indices. nalpha\_occ > nbeta\_occ is required. Only used for ROHF codes.
\item remove\_xs\\
syntax: execute remove\_xs array1 array2\\
function: Removes the singly occupied components of the array1 which must be of type
array1(a,i) which $\rightarrow$ array1(a,i\_nosingles)\\
restrictions: array1 and array2 must be two-dimensional arrays and they must have (a,i)
indeces. a/i $\rightarrow$ alpha\_virtual/alpha\_occupied. Only used for ROHF codes
\item remove\_xd\\
syntax: execute remove\_xd array1 array2\\
function: Removes the doubly occupied components of the array1 which must be of type
array1(a,i) which $\rightarrow$ array1(a,i\_nodoubles)\\
restrictions: array1 and array2 must be two-dimensional arrays and they must have (a,i)
indices. a/i $\rightarrow$ alpha\_virtual/alpha\_occupied. Only used for ROHF codes
\item remove\_ds\\
syntax: execute remove\_ds array1 array2\\
function: Truncates the array1(i,i) to array1(i\_nodoubles,i\_nosingles)\\
restrictions: array1 and array2 must be two-dimensional arrays and they must have (i,i)
indices. i $\rightarrow$ alpha\_occupied. Only used for ROHF codes
\item remove\_ss\\
syntax: execute remove\_ss array1 array2\\
function: Truncates the array1(b,b) to array1(b\_nosingles,i\_nosingles)\\
restrictions: array1 and array2 must be two-dimensional arrays and they must have (b,b)
indeces. b $\rightarrow$ beta\_virtual OR (i,i), i $\rightarrow$ alpha\_occupied.
Only used for ROHF codes
\item comp\_ovl3c\\
syntax: execute comp\_ovl3c array1\\
function: Computes the three center overlap integrals and returns them in array1.\\
restrictions: array must be a three-dimensional array with AO indices.
\item udenominator\\
syntax: execute udenominator array1\\
function: The array1 is divided by an energy denominator just as in energy\_denominator.
udenominator does not require that the denominator not go to zero as small elements or
zero denominators are eliminated.\\
restrictions: array1 can only be a 2 or a 4 dimensional array.
\item copy\_fock\\
syntax: execute copy\_fock array1 fock\\
function: Copies array1 into the fock array and copies the diagonal elements into the
corresponding eigenvalue array which is predetermined.\\
restrictions: The fock array is predetermined so the name must be correct, fock\_a or fock\$b.
array1 must be a 2-dimension array with the same indeces as the fock array.
\item sip\_barrier\\
syntax: execute sip\_barrier\\
function: causes the worker processors to synchronize. Must be used after distributed
arrays are create, before distributed arrays are deleted, and in general whenever distributed
arrays are used a barrier must be placed before the distributed array can be used.\\
restrictions: none
\item print\_scalar\\
syntax: execute print\_scalar scalar1\\
function: prints the value of the scalar1 to standard output.\\
restrictions: scalar1 must be declared as a sclara in the sial program.
\item dump\_block\\
syntax: execute dump\_block array1(p,q,r,s)\\
function: Writes out information about the block af array1(p,q,r,s).
The first,last,maximim,and minimum values of the block are written out and the sum of
squares of all elements in the block.\\
restrictions: array1 must be of dimension 6 or less.
\item array\_copy\\
syntax: execute array\_copy array1 array2\\
function: To copy the array1 into array2 COMPLETELY.\\
restrictions: array1 and array2 must have the same dimesionality and index types.
\item server\_barrier\\
syntax: execute server\_barrier\\
function: causes the server processors to synchronize. Used in a manner similar to the
way the sip\_barrier is used when using distributed arrays except is relevant when served
arrays are being used.\\
restrictions: none
\item blocks\_to\_list/write\_blocks\_to\_list\\
syntax: execute blocks\_to\_list array(p,q,r,s)\\
syntax: execute write\_blocks\_to\_list\\
function: To write all blocks in an array to a file. To use this instruction properly you
must do the following.
\begin{itemize}
\item execute sip/server\_barrier
\item execute blocks\_to\_list array\_k for all arrays being written out
\item execute write\_blocks\_to\_list
\item execute sip/server\_barrier
\end{itemize}
restrictions: none
\item list\_to\_blocks/read\_list\_to\_blocks\\
syntax: execute list\_to\_blocks array(p,q,r,s)\\
syntax: execute read\_list\_to\_blocks\\
function: To read all files(lists) and put them into arrays(blocked) . To use this
instruction properly you must do the following.
\begin{itemize}
\item execute sip/server\_barrier
\item execute list\_to\_blocks array\_k for all arrays being read in.
\item execute execute read\_list\_to\_blocks
\item execute sip/server\_barrier
\end{itemize}
restrictions: The data being read in must match up perfectly with the data in the
blocks\_to\_list/write\_blocks\_to\_list from the previous sial program.
\end{enumerate}
\section{Example Programs}
\label{sect:examples}
\subsection{Using a procedure, a served array and a
distributed array.}
\begin{verbatim}
SIAL example1
aoindex lambda=1,norb
aoindex sigma=1,norb
aoindex mu=1,norb
aoindex nu=1,norb
moindex p=bocc,eocc
moindex q=bocc,eocc
moindex r=bvirt,evirt
moindex s=bvirt,evirt
served v(mu,nu,lambda,sigma) # the SIP knows how to distribute
# the integral requests, it is not
# specified in the language since it
# can change every run
temp v1(p,nu,lambda,sigma)
temp v2(p,q,lambda,sigma)
temp v3(p,q,r,sigma)
distributed v4(p,q,r,s)
local c(mu,p)
PROC update
# start new accumulate and checks on all outstanding ones
# to make the SIP work efficiently several v4 = v3 * c must be allowed
# to start so that of all accumulates in progress at least one
# is ready everytime accumulate is executed by the SIP
put v4(p,q,r,s) += v4(p,q,r,s)
return
ENDPROC update
create v4
pardo mu, nu
do lambda
do sigma
request v(mu,nu,lambda,sigma) sigma
# ask for an integral block
# the first call initiates a request
# subsequent calls check that at
# least one of the outstanding
# requests completed and
# makes a new request
# because this fetch happens outside a 4-fold loop most likely
# one outstanding request is sufficient
do p
v1(p,nu,lambda,sigma) = v(mu,nu,lambda,sigma) * c(mu,p)
do q
v2(p,q,lambda,sigma) = v1(p,nu,lambda,sigma) * c(nu,q)
do r
v3(p,q,r,sigma) = v2(p,q,lambda,sigma) * c(lambda,r)
do s
v4(p,q,r,s) = v3(p,q,r,sigma) * c(sigma,s)
call update
enddo s
enddo r
enddo q
enddo p
enddo sigma
enddo lambda
endpardo mu, nu
delete v4
ENDSIAL example1
\end{verbatim}
\subsection{Preparing a served array.}
\begin{verbatim}
SIAL example2
aoindex lambda=1,norb
aoindex sigma=1,norb
aoindex mu=1,norb
aoindex nu=1,norb
moindex p=bocc,eocc
moindex q=bocc,eocc
moindex r=bvirt,evirt
moindex s=bvirt,evirt
served v(mu,nu,lambda,sigma)
temp v1(p,nu,lambda,sigma)
temp v2(p,q,lambda,sigma)
temp v3(p,q,r,sigma)
temp v4tmp(p,q,r,s)
served v4(p,q,r,s)
local c(mu,p)
pardo mu, nu
do lambda
do sigma
request v(mu,nu,lambda,sigma) sigma
do p
v1(p,nu,lambda,sigma) = v(mu,nu,lambda,sigma) * c(mu,p)
do q
v2(p,q,lambda,sigma) = v1(p,nu,lambda,sigma) * c(nu,q)
do r
v3(p,q,r,sigma) = v2(p,q,lambda,sigma) * c(lambda,r)
do s
v4tmp(p,q,r,s) = v3(p,q,r,sigma) * c(sigma,s)
prepare v4(p,q,r,s) += v4tmp(p,q,r,s)
enddo s
enddo r
enddo q
enddo p
enddo sigma
enddo lambda
endpardo mu, nu
# Now the program can use v4 with request v4.
ENDSIAL example2
\end{verbatim}
\subsection{Using served arrays efficiently.}
Consider a parallelization scheme for integral transformation that
Victor Lotrich has implemented in the UHF transformation code. This
scheme basically narrows the range of the PARDO while at the same time
contracting out an entire index on one processor, thereby making it
possible to replace prepare +='s with simple prepares.
Old style:
\begin{verbatim}
PARDO mu, nu, a, i
#
REQUEST Vxxai(mu,nu,a,i) i
#
DO a1
#
Txaai(mu,a1,a,i) = Vxxai(mu,nu,a,i)*ca(nu,a1)
PREPARE Vxaai(mu,a1,a,i) += Txaai(mu,a1,a,i)
#
ENDDO a1
#
ENDPARDO mu, nu, a, i
\end{verbatim}
This loop distributes the parallelization over mu,nu,a,i in an effort
to avoid re-reading the data in the REQUEST. However, this code is
forced to use PREPARE +=, which is deadly on performance.
New style:
\begin{verbatim}
PARDO mu, a, i
#
ALLOCATE Lxaai(mu,*,a,i)
DO nu
REQUEST Vxxai(mu,nu,a,i) i
#
DO a1
#
T1xaai(mu,a1,a,i) = Vxxai(mu,nu,a,i)*ca(nu,a1)
Lxaai(mu,a1,a,i) += T1xaai(mu,a1,a,i)
#
ENDDO a1
ENDDO nu
#
DO a1
PREPARE Vxaai(mu,a1,a,i) = Lxaai(mu,a1,a,i)
ENDDO a1
DEALLOCATE Lxaai(mu,*,a,i)
ENDPARDO mu, a, i
\end{verbatim}
This loop reduced the PARDO range to mu,a,i, but a complete
contraction of the nu index is performed for each (mu,a,i)
combination. Thus we can do a PREPARE instead of PREPARE +=. Note
that we still are reading the entire set of input only once. There is
some wait time associated with the DEALLOCATE instruction until the
PREPAREs are complete, but this is much smaller than going the PREPARE
+= route. There are 3 loops in this code that can be restructured with
this same approach.
\end{document}
|