1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439
|
\chapter{The parallel version}
\label{parallel}
%--#[ Introduction :
\FORM\ has two versions that can make use of several processors
simultaneously. Which version can be used profitably depends very much on
the architecture of the computer one is using. Each version has its own
control commands which are ignored by the other version and the sequential
version of \FORM. The parallel versions are:
\begin{itemize}
\item \ParFORM\index{ParFORM}: This version runs on processors that have
their own memory and preferably their own disk. Each processor gets a copy
of the complete program and MPI\index{MPI} is used for the
communication\index{communication}. When the network connections are very
fast one can also use \ParFORM\ on computer clusters. \ParFORM\ was
developed at the university of Karlsruhe\index{Karlsruhe}.
\item \TFORM\index{TFORM}: This version uses POSIX threads and runs on computers
which have several processors with a shared memory. Data is kept as common
data as much as possible and only when a worker thread gets a task a
minimal amount of data is copied to its private buffers. Currently it seems
to perform best on computers with two or four processors.
\end{itemize}
Both \ParFORM\ and \TFORM\ suffer from the same bottlenecks\index{bottleneck}.
At the beginning of a module there is a single expression, managed by a
master process which then has to distribute the terms over the workers. At
the end of the module the sorted results of the workers have to be gathered
in by the master\index{master} and merged into a single expression again.
Efficiency depends critically on how fast the terms can be given to the
workers\index{workers}, how well the load for the workers is balanced and
how much time the master has to spend in the final stages of the sorting.
Another factor is the complexity of the operations inside the module. If
the module has very few and simple statements, the gain in performance will
be much less than when the module has much work to do for each term.
The \ParFORM\ and \TFORM\ specific code is internally completely separated.
This offers the possibility that sooner or later the two can be combined to
allow efficient running on clusters of dual or quad processor machines.
Whether this would give significant extra benefits needs to be
investigated. When this project will be undertaken depends very much on the
availability of such computers.
Because \ParFORM{} uses MPI\index{MPI} and because different MPI
environments are normally not binary compatible, the port to a new machine
requires a recompilation of the source code and a relinking to the MPI
library. Hence we do not have executables in the distribution site.
One needs to build \ParFORM{} on one's computer.
For \TFORM\ the situation is much more favorable. Its treatment of the
parallelization follows the standard for POSIX\index{POSIX} threads (or
PThreads) for which the libraries are implemented on almost any
UNIX\index{UNIX} system and many other systems.
The ideal of a parallel version of \FORM\ is that it should execute nearly
any regular \FORM\ program, whether it was written for parallelization or
not. And it should execute much faster on several processors than the
sequential version on a single processor. The performance is given by the
improvement factor which is the execution time of the sequential version
divided by the execution time of the parallel version as measured in real
time (not CPU time) on a computer that has no other major tasks. The ideal
would of course be that a computer with N processors would give an
improvement factor of N. It should be easy to see that this ideal cannot be
reached, due to the bottlenecks described above. Also the compilation takes
place on a single processor and the instructions of the preprocessor are
typically also tasks for a single thread/processor. Yet for small numbers
of processors one can do rather well. Many old calculations, when repeated
with \TFORM\ would give improvement\index{improvement factor} factors above
1.7 on a dual pentium\index{pentium} machine and around 3 or a bit higher
on a quad opteron\index{opteron} machine. This was without modifying even a
single statement in the programs. Of course these numbers depend very much
on the type of the problem and the programming style used. As of yet there
is very little experience with parallel versions of \FORM. Hence people will
have to discover what are good ways of getting the most out of their
computer. It is expected that there will be much progress in the coming
years.
First we will now discuss the running of the two versions. After that we
will describe some common syntactic problems.
%--#] Introduction :
%--#[ TFORM :
\section{TFORM}
\label{tform}
Let us assume that the executable of \TFORM\index{TFORM} is called tform. It
is used exactly the same way as the sequential version of \FORM\ (named form)
is used with the exception of the possibility to specify the number of
worker\index{worker} threads with the -w option. The command
\begin{verbatim}
tform -w4 calcdia
\end{verbatim}
would execute the program in the file calcdia.frm, using 4 worker threads,
in addition to the one master thread. When the -w option is not given or
when only one worker thread is asked for, tform will run the whole program
inside the master\index{master} thread. Because tform always has some
overhead this is usually a little bit slower than using form. Strange
enough there are exceptions although this may have to do with the fact that
measuring the time of a program doesn't always give the same numbers.
It is also possible to specify the number of worker threads in the setup
file, using the line
\begin{verbatim}
Threads 4
\end{verbatim}
for 4 threads. And as with all setup parameters one can pass this
information also via the environment variable FORM\_threads or with the line
\begin{verbatim}
#: Threads 4
\end{verbatim}
at the beginning of the program file.
When the master passes terms to the workers, it has to signal\index{signal}
the workers that there is some data. In their turn, each worker has to send
the master a signal when it has completed its task and it is ready for
more. Such signals cost time. Hence it is usually best to send terms in
groups, called buckets\index{bucket}. The optimal number of terms in a
bucket depends very much on the problem and the size of the expression.
Bigger buckets mean less overhead in signals. If the buckets are too big
the workers may have to wait too much. Values between 100 and 1000 are
usually rather good. There is a default bucket size which is typically
around 500. The user can change this value in two ways: The first is with
the ThreadBucketSize\index{threadbucketsize} setup parameter in the
form.set file (or at the startup of the program file, or with the
FORM\_threadbucketsize environment variable) and the second is with the
ThreadBucketSize statement (see \ref{substathreadbucketsize}) which is a
declaration like Symbol or Dimension. The first terms in an expression will
be sent in smaller buckets to get the workers something to do as soon as
possible.
Usually the bigger buckets give a better performance, but they suffer from
a nasty side-effect. Complicated terms that need much execution time have a
tendency to stick together. Hence there can be one bucket with most of the
difficult terms and at the end of the module all workers and the master
have to wait for one worker to finish. This can be improved with a
load\index{load balancing} balancing mechanism. The current version will
take terms from the buckets of workers that take more time than the others.
By default this mechanism is on, but it can be switched on or off with the
`on ThreadLoadBalancing\index{threadloadbalancing};' and `off
ThreadLoadBalancing;' statements. It can also be set as one of the setup
parameters in the form.set file with
\begin{verbatim}
ThreadLoadBalancing OFF
\end{verbatim}
or
\begin{verbatim}
ThreadLoadBalancing ON
\end{verbatim}
or at the start of the program or in the environment.
The LINUX\index{LINUX} operating system tries to cache\index{cache} files
that are to be written to disk. Somehow, when several big files have to be
written it gets all confused (it is not known in what way). This means that
if tform produces 4 large sort files\index{file!sort} eventually the system
becomes intolerably slow. At one time a test program was 4.5 times slower
with 4 worker processors than with just the master running, even though the
master had a single even bigger sort file. This has been improved by having
the file-to-file sort of the threads changed into a
file-to-masterbuffers-to-combined-output. Yet the writing and subsequent
merging of the 4 files at the same time can be disastrous. Work is done to
improve this, but it may not be easy to circumvent facilities of the
operating system. Apparently the quality of the drivers is crucial here.
One can switch the parallel processing on or off (for the complete module)
at any moment in the program with the
statements\index{on!threads}\index{off!threads}
\begin{verbatim}
On Threads;
Off Threads;
\end{verbatim}
or using the moduleoption statement (\ref{substamoduleoption}) that
affects \TFORM{}'s behaviour for just the current module:
\begin{verbatim}
ModuleOption Parallel;
ModuleOption NoParallel;
\end{verbatim}
Additionally one can switch the statistics per thread on or off with
\begin{verbatim}
On ThreadStats;
Off ThreadStats;
\end{verbatim}
When the thread\index{on!threadstats}\index{on!threadstats} statistics are
switched off only the statistics of the master thread are printed which is
usually only the final statistics for each of the expressions.
The timing information in the statistics is the CPU\index{CPU time} time
spent by the thread that prints the statistics. Hence the total CPU time
spent is the sum of the time of all workers and the time of the master. In
good running the time of the master should be the smallest number. When the
statistics per thread are switched off, only the statistics of the master
process will be printed with this `small' number. Hence it may look like
the program isn't progressing very much.
For debugging purposes the term by term print\index{print} statement (see
\ref{substaprint}) is equipped with the \verb:%W: and \verb:%w: format
strings. The first will cause the printing of the number of the current
thread and the CPU-time used thus far in that thread. The second will only
print the number of the current thread. The thread with the number zero is
the master thread. Putting a statement like
\begin{verbatim}
Print +f "<%W> %t";
\end{verbatim}
would show which thread is processing which term and when.
These are all the commands that specifically concern \TFORM. When more
experience is gained using \TFORM, more parameters and commands may become
available.
The fact that the threads need private\index{private} data makes that \TFORM\
will use more memory than \FORM. Most of the buffers are not very large, but
of course there are some buffers which need to be large, like the sort
buffers and the scratch input\index{input}/hide\index{hide} buffers. The
sizes that the user specifies for these buffers are for the corresponding
buffers of the master. The workers get each 1/N times the size for these
buffers, when there are N workers. In the case that makes these buffers too
small because of for instance MaxTermSize, the buffers may become larger.
%--#] TFORM :
%--#[ ParFORM :
\section{ParFORM}
\label{parform}
Let us call the executable of \ParFORM\index{ParFORM} parform.
The user must execute parform as an MPI\index{MPI} application.
In many MPI implementations, this is done by using the mpirun\index{mpirun}
command:
\begin{verbatim}
mpirun -np 4 parform calcdia
\end{verbatim}
This example executes the program in the file calcdia.frm, using 4
processes,in which one process is the master process and the other 3
processes are the worker processes.
One has to keep in mind that in some MPI implementations environment
variables will not be passed to an MPI application. Alternatively extra
options are needed for passing them.
If one wants to run \ParFORM{} under a job scheduler on a computer cluster
environment, one may need to write a job script, which depends to a great
extent on the environment.
\ParFORM{} uses MPI for communications between the master and workers.
Actually terms are distributed by using point-to-point send/receive
operations of MPI. Since there is some latency for establishing a
connection between processes, especially between those running on different
computers, it is best to send terms in groups, like buckets in \TFORM{}.
The default number of terms in a bucket is currently 1000 in \ParFORM{}. It
can be changed with the ProcessBucketSize statement
(\ref{substaprocessbucketsize}\index{processbucketsize}) if this is deemed
necessary. It can also be changed for the current module with the statement
(\ref{substamoduleoption}\index{moduleoption!processbucketsize}).
\begin{verbatim}
ModuleOption ProcessBucketSize number;
\end{verbatim}
And finally it can also be changed in the setup, using the
ProcessBucketSize (\ref{setupprocessbucketsize}) setup parameter.
The first terms in an expression will be sent in smaller buckets to get the
workers something to do as soon as possible.
One can switch the parallel processing on or off (for the complete module)
at any moment in the program with the statements\index{on!parallel}%
\index{off!parallel}
\begin{verbatim}
On Parallel;
Off Parallel;
\end{verbatim}
or using the moduleoption statement (\ref{substamoduleoption}) that
affects \ParFORM{}'s behaviour for just the current module:
\begin{verbatim}
ModuleOption Parallel;
ModuleOption NoParallel;
\end{verbatim}
Additionally one can switch the statistics per process on or off with
\begin{verbatim}
On ProcessStats;
Off ProcessStats;
\end{verbatim}
When the process\index{on!processstats}\index{on!processstats} statistics
are switched off only the statistics of the master process are printed
which are usually only the final statistics for each of the expressions.
As in \TFORM{}, \verb:%W: and \verb:%w: in the term by term
print\index{print} statement (see \ref{substaprint}) are available in
\ParFORM{}. They print the number of the current process and the
CPU-time used thus far in that process.
In principle one can run all \FORM{} or \TFORM{} programs with \ParFORM{}.
In practice \ParFORM{} is not so efficient for some problems, in which
more data have to be synchronized between the master and the workers.
The cases for which \ParFORM{} needs to send data via MPI include:
\begin{itemize}
\item The redefine statements, which modify preprocessor variables
on the workers.
\item Modifying \$-variables in regular statements with a moduleoption
statement (see \ref{pardollars}, \ref{substamoduleoption}
and~\ref{dollars-in-parallel}).
\item Expression names appearing in right hand sides of definition or
substitution statements.
\end{itemize}
The last case may need more explanation.
Consider the following code:
\begin{verbatim}
Local G = F;
id a = F;
\end{verbatim}
where the expression F is supposed to be already defined. The point is that
these substitutions of the expression F are performed on the workers. The
workers, however, do not know the contents of the expression F because it
is stored on the master. Therefore, before executing this module \ParFORM{}
needs to make the master broadcast the expression F to the workers. This
may be quite time-consuming because the expression could be very large.
%--#] ParFORM :
%--#[ Some problems :
\section{Some problems}
\label{dollars-in-parallel}
Both parallel versions share a number of problems which are inherent to
running in an environment in which the order\index{order of terms} in which
terms are processed isn't deterministic\index{deterministic}. Most of these
problems concern \verb:$:-variables. They present a mix between private and
common information. Consider the code
\begin{verbatim}
id f(x?$xvar) = g(x);
id ......
id a^n? = b^n*h($var);
\end{verbatim}
Of course one could do this simple example differently, but we are
discussing the principle. What we have here is that each term that passes
the first statement will acquire its own value of \verb:$var:, to be used a
bit later. It is clear that if we have a common administration of
\verb:$:-variables we would have to `lock'\index{lock} the value for a
considerable amount of time, thereby spoiling much of the gains of parallel
processing. Hence in this case it would be best that each worker maintains
its own local value of \verb:$var:. But in the following example we have
the opposite:
\begin{verbatim}
#$xmax = -1;
if ( count(x,1) > $xmax ) $xmax = count_(x,1);
\end{verbatim}
Here we collect a maximum power in the variable \verb:$xmax:. If each
worker would have a local value of \verb:$xmax:, the question is what to do
with all these local values at the end of the module. A human will see that
here we are collecting a maximum, but the computer cannot and should not
see this. Hence the general rule in parallel processing is that when there
are \verb:$:-variables\index{\$-variable} obtaining a value during the
algebraic phase of a module the entire module is run sequentially, unless
\FORM\ has been helped with a moduleoption statement for each of the
variables involved. Hence in the last example
\begin{verbatim}
ModuleOption Maximum $xmax;
\end{verbatim}
would tell \FORM\ how to combine the local values in \ParFORM\ (\ParFORM\
maintains local values of all \verb:$:-variables). In \TFORM\ it
would put the value directly into the central administration, provided it
is bigger than the previous value. Only during the update the variable
would have to be locked.
There are several options in the moduleoption statement:
\begin{itemize}
\item Maximum\index{moduleoption!maximum}: The variable must have a
numerical value and the maximum is collected.
\item Minimum\index{moduleoption!minimum}: The variable must have a
numerical value and the minimum is collected.
\item Sum\index{moduleoption!sum}: The variable must have a numerical value
and the sum is collected.
\item Local\index{moduleoption!local}: The value will be kept privately and
no attempt is made to put it in the central administration, neither during
the execution of the module, nor at the end. If there was already a
variable by this name in the central administration it will keep the value
it had before the module started execution. At the end of the module, all
private values will be forgotten.
\end{itemize}
The redefine statement is a major inefficiency in a parallel environment.
It redefines a preprocessor variable and there is only a single bookkeeping
for such variables. This means that the variable has to be sent to the
master process (\ParFORM) or that a lock has to be placed to prevent other
workers to write to the same storage simultaneously (\TFORM). In addition
the final value in the preprocessor variable will be determined by the last
term processed in any of the workers. This may not be the same term in
different runs. It is up to the user to write programs that still give
correct results under such conditions. The best way around the inefficiency
is using \verb:$:-variables and preprocessor instructions. We show this in
an example in which we construct the equivalent of a conditional repeat
that includes a .sort instruction.
\begin{verbatim}
#do i = 1,1
statements
if ( count(x,1) > 0 ) redefine i "0";
.sort
#enddo
\end{verbatim}
To run this in parallel, it is better to use the following code.
\begin{verbatim}
#do i = 1,1
#$i = 1;
statements
if ( count(x,1) > 0 ) $i = 0;
ModuleOption minimum $i;
.sort
#redefine i "`$i'"
#enddo
\end{verbatim}
In this program the centrally stored value of \verb:$i: is updated at most
once. Admitedly it isn't as simple as the redefine statement, but it
works in all versions of \FORM\ starting with version 3.0.
It should be noted that when a new expression is defined in its defining
module it starts out as a single term. Hence it cannot benefit from
parallelization in that module. Therefore the code
\begin{verbatim}
#define MAX "200"
Symbols x0,...,x10;
Local F = (x0+...+x`MAX')^3;
id x1 = -x2-...-x`MAX';
.end
\end{verbatim}
will execute inside a single worker while
\begin{verbatim}
#define MAX "200"
Symbols x0,...,x10;
Local F = (x0+...+x`MAX')^3;
.sort
id x1 = -x2-...-x`MAX';
.end
\end{verbatim}
will make the first expansion inside a single worker and the more costly
substitution can be made in parallel. A better load\index{load balancing}
balancing algorithm in which at any node in the expansion tree tasks can be
given to idle workers would solve this problem, but due to some
complications this has not yet been implemented. The structure of \FORM\ will
however allow such an implementation.
%\footnote{In the year 1991 version 1 of FORM was parallelized on a
%computer at FNAL along these lines. It was however rather primitive and
%lack of access to suitable computers stopped further development at that
%moment.}
%--#] Some problems :
|