1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172
|
<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<!--Converted with LaTeX2HTML 96.1-h (September 30, 1996) by Nikos Drakos (nikos@cbl.leeds.ac.uk), CBLU, University of Leeds -->
<HTML>
<HEAD>
<TITLE>New Sources of Error in Parallel Numerical Computations</TITLE>
<META NAME="description" CONTENT="New Sources of Error in Parallel Numerical Computations">
<META NAME="keywords" CONTENT="slug">
<META NAME="resource-type" CONTENT="document">
<META NAME="distribution" CONTENT="global">
<LINK REL=STYLESHEET HREF="slug.css">
</HEAD>
<BODY LANG="EN" >
<A NAME="tex2html3890" HREF="node135.html"><IMG WIDTH=37 HEIGHT=24 ALIGN=BOTTOM ALT="next" SRC="http://www.netlib.org/utk/icons/next_motif.gif"></A> <A NAME="tex2html3888" HREF="node132.html"><IMG WIDTH=26 HEIGHT=24 ALIGN=BOTTOM ALT="up" SRC="http://www.netlib.org/utk/icons/up_motif.gif"></A> <A NAME="tex2html3882" HREF="node133.html"><IMG WIDTH=63 HEIGHT=24 ALIGN=BOTTOM ALT="previous" SRC="http://www.netlib.org/utk/icons/previous_motif.gif"></A> <A NAME="tex2html3892" HREF="node1.html"><IMG WIDTH=65 HEIGHT=24 ALIGN=BOTTOM ALT="contents" SRC="http://www.netlib.org/utk/icons/contents_motif.gif"></A> <A NAME="tex2html3893" HREF="node190.html"><IMG WIDTH=43 HEIGHT=24 ALIGN=BOTTOM ALT="index" SRC="http://www.netlib.org/utk/icons/index_motif.gif"></A> <BR>
<B> Next:</B> <A NAME="tex2html3891" HREF="node135.html">How to Measure Errors</A>
<B>Up:</B> <A NAME="tex2html3889" HREF="node132.html">Accuracy and Stability</A>
<B> Previous:</B> <A NAME="tex2html3883" HREF="node133.html">Sources of Error in </A>
<BR> <P>
<H1><A NAME="SECTION04620000000000000000">New Sources of Error in Parallel Numerical Computations</A></H1>
<P>
<A NAME="sec_Hetero"> </A>
<P>
An important difference between ScaLAPACK and LAPACK is that
a parallel computing environment, possibly consisting of
a heterogeneous collection of processors,
introduces new sources of possible errors not found in the
serial environment in which LAPACK runs.
These errors could indeed afflict any parallel algorithm that uses
floating-point arithmetic.
For example, consider the following pseudocode, executed in parallel by
several processors:
<P>
<PRE><TT>
<I>s</I>= global_sum(<I>x</I>) ... each processor receives the sum <I>s</I> of global array <I>x</I>
<P>
if <I>s</I> < <I>thresh</I> then
<P>
return my part of answer 1
<P>
else
<P>
do more computations
<P>
return my part of answer 2
<P>
end if
<P>
</TT></PRE>
<P>
It is possible for the value of <I>s</I> to differ from processor to processor;
we call this <EM>incoherence</EM>.<A NAME="4496"> </A>
This can happen if the floating-point arithmetic varies
from processor to processor (we call this <EM>heterogeneity</EM>),<A NAME="4498"> </A> since
processors may not even share the same set of floating-point numbers.
The value of <I>s</I> can also vary if global_sum accumulates the sum
in different orders on different processors,
since floating-point addition is not associative.
In either case, the
test <I>s</I>< <I>thresh</I> may be true on one processor but not another, so
that the program may inconsistently return answer 1 on some processors
and answer 2 on others. If the ``more computations'' include
communication with synchronization, even deadlock could result.<A NAME="4499"> </A>
<A NAME="4500"> </A>
<P>
Deadlock can also result if the floating-point numbers communicated from
one processor to another cause fatal floating-point errors on the receiving
processor. For example, if an IBM RS/6000, running in its default
mode, sends a message containing a denormalized number
[<A HREF="node189.html#ieee754">7</A>, <A HREF="node189.html#ieee854">8</A>] <A NAME="4502"> </A>
to a DEC Alpha running in its default mode,
then the DEC Alpha aborts [<A HREF="node189.html#lawn112">19</A>].<A NAME="tex2html1139" HREF="footnode.html#8008"><IMG ALIGN=BOTTOM ALT="gif" SRC="http://www.netlib.org/utk/icons/foot_motif.gif"></A>
<P>
It is also possible for global_sum to compute the same <I>s</I> on all processors
but compute a different <I>s</I> from run to run of the program,
for example, if global_sum computes the sum in a nondeterministic order
on one processor and broadcasts the result to all processors.
We call this <EM>nonrepeatability</EM>.<A NAME="4506"> </A>
If this happens, debugging the overall code can be more difficult.
<P>
Coherence<A NAME="4507"> </A><A NAME="4508"> </A> and
repeatability<A NAME="4509"> </A><A NAME="4510"> </A>
are independent properties of an algorithm.
It is possible in principle for an algorithm running on a particular
platform to be incoherent and repeatable, coherent and nonrepeatable,
or any other combination. On a different platform, the same algorithm
may have different properties.
<P>
Reference [<A HREF="node189.html#lawn112">19</A>]
contains a more extensive discussion of these possible errors.
<P>
One run of a ScaLAPACK routine is designed to be as reliable<A NAME="4512"> </A> as LAPACK,
so that errors due to incoherence cannot occur as long as ScaLAPACK is
executed on a <EM>homogeneous network</EM><A NAME="4514"> </A> of
processors.
The following conditions apply:
<UL>
<LI> The processors are completely identical. This also means that
relevant flags, like those controlling the way overflow and underflow
are handled in IEEE floating-point arithmetic,
<A NAME="4516"> </A>
must be identical.
<LI> The communication library used by the BLACS may only
``copy bits'' and not modify any floating-point numbers (by translation
to a different internal floating-point format, as XDR [<A HREF="node189.html#SunSoft:XDR">111</A>]
may do).<A NAME="4518"> </A><A NAME="4519"> </A>
<LI> The identical ScaLAPACK object code must be executed by each processor.
</UL>
<P>
The above conditions guarantee that a single ScaLAPACK call is as reliable<A NAME="4521"> </A>
as its LAPACK counterpart.
If, in addition, identical answers from one run to another
are desired (i.e., <EM>repeatability</EM>),<A NAME="4523"> </A>
this can be guaranteed at runtime by calling
BLACS_SET<A NAME="4524"> </A> to enforce repeatability<A NAME="4525"> </A> of the BLACS, and the ScaLAPACK
routines that use them, by using an appropriate topology
(see the BLACS users guide [<A HREF="node189.html#lawn94">54</A>] for details).
<P>
Maintaining coherence<A NAME="4527"> </A> on a heterogeneous network<A NAME="4528"> </A> is harder, and not always
possible. If floating-point formats differ
(say, on a Cray C90 and IBM RS/6000, which uses IEEE arithmetic),
there is no cost-effective way to guarantee coherence.
If floating-point formats are the same, however, operations such as
global sums can accumulate the result on one processor and broadcast it
to guarantee coherence (except for the problem of DEC Alphas and denormalized
numbers mentioned above).
The BLACS do this, except when using the
``bidirectional exchange'' topology. One can avoid using
``bidirectional exchange'' and so guarantee coherence whenever possible,
by calling BLACS_SET<A NAME="4529"> </A> to enforce coherence<A NAME="4530"> </A>
(see the BLACS users guide [<A HREF="node189.html#lawn94">54</A>] for details).
<P>
Still other ScaLAPACK routines are guaranteed to work
only on homogeneous networks (PxGESVD and PxSYEV). These routines do
large numbers of redundant calculations on all processors and depend on
the results of these calculations being the same. There are too many
of these calculations to cost-effectively compute them all on one processor
and broadcast the results.
<P>
The user may wonder why ScaLAPACK and the BLACS are not designed to
guarantee coherence and repeatability in the most general possible situations,
so that calling BLACS_SET would not be necessary.
The reason is that the possible bugs described above are quite rare,
and so ScaLAPACK and the BLACS were designed to maximize performance instead.
Provided the mere sending of floating-point numbers does not cause a
fatal error, these bugs cannot occur at all in most ScaLAPACK routines,
because branches depending on a supposedly identical floating-point value
like <I>s</I> do not occur.
For most other ScaLAPACK routines where such branches do occur,
we have not seen these bugs despite extensive testing, including attempts
to cause them to occur.
Complete understanding and cost-effective
elimination of such possible bugs are future work.
<P>
In the meantime, to get repeatability when running on a homogeneous network,
we recommend calling BLACS_SET<A NAME="4532"> </A> as described above when using the following
ScaLAPACK drivers: PxGESVX, PxPOSVX, PxSYEV, PxSYEVX, PxGESVD, and PxSYGVX.
<A NAME="4533"> </A><A NAME="4534"> </A><A NAME="4535"> </A><A NAME="4536"> </A>
<A NAME="4537"> </A><A NAME="4538"> </A><A NAME="4539"> </A><A NAME="4540"> </A>
<A NAME="4541"> </A><A NAME="4542"> </A>
<A NAME="4543"> </A><A NAME="4544"> </A><A NAME="4545"> </A><A NAME="4546"> </A>
<A NAME="4547"> </A><A NAME="4548"> </A><A NAME="4549"> </A><A NAME="4550"> </A>
<A NAME="4551"> </A><A NAME="4552"> </A>
<P>
<HR><A NAME="tex2html3890" HREF="node135.html"><IMG WIDTH=37 HEIGHT=24 ALIGN=BOTTOM ALT="next" SRC="http://www.netlib.org/utk/icons/next_motif.gif"></A> <A NAME="tex2html3888" HREF="node132.html"><IMG WIDTH=26 HEIGHT=24 ALIGN=BOTTOM ALT="up" SRC="http://www.netlib.org/utk/icons/up_motif.gif"></A> <A NAME="tex2html3882" HREF="node133.html"><IMG WIDTH=63 HEIGHT=24 ALIGN=BOTTOM ALT="previous" SRC="http://www.netlib.org/utk/icons/previous_motif.gif"></A> <A NAME="tex2html3892" HREF="node1.html"><IMG WIDTH=65 HEIGHT=24 ALIGN=BOTTOM ALT="contents" SRC="http://www.netlib.org/utk/icons/contents_motif.gif"></A> <A NAME="tex2html3893" HREF="node190.html"><IMG WIDTH=43 HEIGHT=24 ALIGN=BOTTOM ALT="index" SRC="http://www.netlib.org/utk/icons/index_motif.gif"></A> <BR>
<B> Next:</B> <A NAME="tex2html3891" HREF="node135.html">How to Measure Errors</A>
<B>Up:</B> <A NAME="tex2html3889" HREF="node132.html">Accuracy and Stability</A>
<B> Previous:</B> <A NAME="tex2html3883" HREF="node133.html">Sources of Error in </A>
<P><ADDRESS>
<I>Susan Blackford <BR>
Tue May 13 09:21:01 EDT 1997</I>
</ADDRESS>
</BODY>
</HTML>
|