1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216
|
<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<!--Converted with LaTeX2HTML 96.1-h (September 30, 1996) by Nikos Drakos (nikos@cbl.leeds.ac.uk), CBLU, University of Leeds -->
<HTML>
<HEAD>
<TITLE>Parallel Efficiency</TITLE>
<META NAME="description" CONTENT="Parallel Efficiency">
<META NAME="keywords" CONTENT="slug">
<META NAME="resource-type" CONTENT="document">
<META NAME="distribution" CONTENT="global">
<LINK REL=STYLESHEET HREF="slug.css">
</HEAD>
<BODY LANG="EN" >
<A NAME="tex2html3613" HREF="node113.html"><IMG WIDTH=37 HEIGHT=24 ALIGN=BOTTOM ALT="next" SRC="http://www.netlib.org/utk/icons/next_motif.gif"></A> <A NAME="tex2html3611" HREF="node111.html"><IMG WIDTH=26 HEIGHT=24 ALIGN=BOTTOM ALT="up" SRC="http://www.netlib.org/utk/icons/up_motif.gif"></A> <A NAME="tex2html3607" HREF="node111.html"><IMG WIDTH=63 HEIGHT=24 ALIGN=BOTTOM ALT="previous" SRC="http://www.netlib.org/utk/icons/previous_motif.gif"></A> <A NAME="tex2html3615" HREF="node1.html"><IMG WIDTH=65 HEIGHT=24 ALIGN=BOTTOM ALT="contents" SRC="http://www.netlib.org/utk/icons/contents_motif.gif"></A> <A NAME="tex2html3616" HREF="node190.html"><IMG WIDTH=43 HEIGHT=24 ALIGN=BOTTOM ALT="index" SRC="http://www.netlib.org/utk/icons/index_motif.gif"></A> <BR>
<B> Next:</B> <A NAME="tex2html3614" HREF="node113.html">ScaLAPACK Performance</A>
<B>Up:</B> <A NAME="tex2html3612" HREF="node111.html">BLACS as an Efficient</A>
<B> Previous:</B> <A NAME="tex2html3608" HREF="node111.html">BLACS as an Efficient</A>
<BR> <P>
<H3><A NAME="SECTION04523100000000000000">Parallel Efficiency</A></H3>
<A NAME="subsecparalleleff"> </A>
<A NAME="3658"> </A>
<P>
An important performance metric is <EM>parallel efficiency</EM>. Parallel
efficiency, <I>E</I>(<I>N</I>, <I>P</I>),<A NAME="3660"> </A>
for a problem of size <I>N</I> on <I>P</I> nodes is
defined in the usual
way [<A HREF="node189.html#fox88ab">65</A>, <A HREF="node189.html#kumar94a">92</A>] by
<BR><IMG WIDTH=337 HEIGHT=41 ALIGN=BOTTOM ALT="displaymath16290" SRC="img372.gif"><BR>
where <I>T</I>(<I>N</I>,<I>P</I>)<A NAME="3665"> </A> is the
runtime of the parallel algorithm, and <IMG WIDTH=57 HEIGHT=28 ALIGN=MIDDLE ALT="tex2html_wrap_inline16300" SRC="img373.gif"><A NAME="7983"> </A> is the runtime
of the best sequential
algorithm. For dense
matrix computations,
an implementation is
said to be <EM>scalable</EM><A NAME="3669"> </A>
if the parallel efficiency
is an increasing function
of <IMG WIDTH=45 HEIGHT=31 ALIGN=MIDDLE ALT="tex2html_wrap_inline16304" SRC="img375.gif">, the problem
size per node. The
algorithms implemented in
the ScaLAPACK library are
scalable in this sense.
<P>
Figure <A HREF="node112.html#figiso_pgon">5.1</A>
shows the scalability
of the ScaLAPACK
implementation of
the <I>LU</I> factorization
on the Intel XP/S
Paragon computer.
The nodes of the
Intel XP/S Paragon
computer are
general-purpose (GP)
or multiprocessor (MP)
nodes, based on the
Intel i860
XP RISC processors. Each
Intel i860 processor is
capable of a peak performance
of 50 Mflop/s. On such a
processor, however, the
vendor-supplied BLAS
matrix-matrix multiply
routine DGEMM can
achieve only approximately
45 Mflop/s. The
computer used for
obtaining the
performance results
presented in this
chapter consisted
of MP nodes configured
as follows: each
MP node had three
Intel i860 XP processors
-- two to execute
application code
and a third used
exclusively as a
message coprocessor<A NAME="3671"> </A><A NAME="3672"> </A>.
On such a node, the
vendor-supplied BLAS
matrix-matrix multiply
routine DGEMM can
achieve approximately
90 Mflop/s.
<P><A NAME="3675"> </A><A NAME="figiso_pgon"> </A><IMG WIDTH=378 HEIGHT=280 ALIGN=BOTTOM ALT="figure3673" SRC="img376.gif"><BR>
<STRONG>Figure 5.1:</STRONG> LU Performance per Intel XP/S MP Paragon node<BR>
<P>
<P>
Figure <A HREF="node112.html#figiso_pgon">5.1</A>
shows the speed in
Mflop/s per node
of the ScaLAPACK
<I>LU</I> factorization
routine PDGETRF
for different
computer configurations.
This figure illustrates
that when the number of
nodes is scaled by a
constant factor, the
same efficiency or
speed per node is
achieved for equidistant
problem sizes on a
logarithmic scale.
In other words,
maintaining a constant
memory use per node
allows efficiency
to be maintained.
(This scalability behavior is also referred to as <EM>isoefficiency,</EM><A NAME="3680"> </A>
or <EM>isogranularity</EM><A NAME="3682"> </A>.)
In practice, however,
a slight degradation
is acceptable.
The ScaLAPACK driver
routines, in general,
feature the same
scalability behavior
up to a constant
factor that depends
on the exact number
of floating-point
operations and the
total volume of data
exchanged during the
computation.
<P>
In large dense linear
algebra computations,
the computation
cost dominates the
communication cost.
In the following, the
time to execute one
floating-point operation
by one node is denoted
by <IMG WIDTH=13 HEIGHT=23 ALIGN=MIDDLE ALT="tex2html_wrap_inline12202" SRC="img28.gif"><A NAME="3683"> </A>. The time to
communicate a message
between two nodes is
approximated by a
linear function of
the number of items
communicated. The function
is the sum of the time
to prepare the message
for transmission (<IMG WIDTH=18 HEIGHT=23 ALIGN=MIDDLE ALT="tex2html_wrap_inline12208" SRC="img30.gif">)<A NAME="3684"> </A>
and the time taken
by the message
to traverse the
network to its
destination, that is,
the product of its
length by the time
to transfer one
data item (<IMG WIDTH=12 HEIGHT=23 ALIGN=MIDDLE ALT="tex2html_wrap_inline12228" SRC="img35.gif">)<A NAME="3685"> </A>.
Alternatively, <IMG WIDTH=18 HEIGHT=23 ALIGN=MIDDLE ALT="tex2html_wrap_inline12208" SRC="img30.gif">
is also called the
<I>latency,</I><A NAME="3687"> </A><A NAME="3688"> </A> since
it is the time to
communicate a
message of zero
length. On most
modern interconnection
networks, the order of
magnitude of the latency
varies between a
microsecond and a
millisecond.
<P>
The bandwidth<A NAME="3689"> </A><A NAME="3690"> </A> of
the network is also referred to as
its <I>throughput.</I><A NAME="3692"> </A>
It is proportional to the reciprocal of <IMG WIDTH=12 HEIGHT=23 ALIGN=MIDDLE ALT="tex2html_wrap_inline12228" SRC="img35.gif">. On modern
networks, the order of magnitude of the bandwidth is the
megabyte per second. For a scalable algorithm with
<IMG WIDTH=45 HEIGHT=31 ALIGN=MIDDLE ALT="tex2html_wrap_inline16304" SRC="img375.gif"> held constant, one expects the performance to
be proportional to <I>P</I>. The algorithms implemented in
ScaLAPACK are scalable in this sense. Table <A HREF="node112.html#tabvars">5.1</A>
summarizes the relevant constants used in our scalability
analysis.
<P><A NAME="3695"> </A><A NAME="tabvars"> </A><IMG WIDTH=706 HEIGHT=314 ALIGN=BOTTOM ALT="table3694" SRC="img377.gif"><BR>
<STRONG>Table 5.1:</STRONG> Variable definitions<BR>
<P>
<P>
Using the notation presented in table <A HREF="node112.html#tabvars">5.1</A>,
the execution time of the ScaLAPACK drivers can be
approximated by
<BR><A NAME="eqntim"> </A><IMG WIDTH=647 HEIGHT=44 ALIGN=BOTTOM ALT="equation3728" SRC="img378.gif"><BR>
<P>
The corresponding parallel efficiency<A NAME="3738"> </A>
can then be approximated by
<BR><A NAME="eqneff"> </A><IMG WIDTH=564 HEIGHT=52 ALIGN=BOTTOM ALT="equation3739" SRC="img379.gif"><BR>
Equation <A HREF="node112.html#eqneff">5.2</A> illustrates, in particular, that
the communication versus computation performance ratio of a
distributed-memory computer significantly affects parallel efficiency. The
ratio of the latency to the time per flop <IMG WIDTH=53 HEIGHT=28 ALIGN=MIDDLE ALT="tex2html_wrap_inline16374" SRC="img380.gif"> greatly
affects the parallel efficiency of small problems. The ratio
of the network throughput to the flop rate <IMG WIDTH=48 HEIGHT=28 ALIGN=MIDDLE ALT="tex2html_wrap_inline16376" SRC="img381.gif">
significantly affects the parallel efficiency of medium-sized
problems. For large problems, the node flop rate <IMG WIDTH=43 HEIGHT=28 ALIGN=MIDDLE ALT="tex2html_wrap_inline16378" SRC="img382.gif">
is the dominant factor contributing to the parallel
efficiency of the parallel algorithms implemented in ScaLAPACK.
<P>
<HR><A NAME="tex2html3613" HREF="node113.html"><IMG WIDTH=37 HEIGHT=24 ALIGN=BOTTOM ALT="next" SRC="http://www.netlib.org/utk/icons/next_motif.gif"></A> <A NAME="tex2html3611" HREF="node111.html"><IMG WIDTH=26 HEIGHT=24 ALIGN=BOTTOM ALT="up" SRC="http://www.netlib.org/utk/icons/up_motif.gif"></A> <A NAME="tex2html3607" HREF="node111.html"><IMG WIDTH=63 HEIGHT=24 ALIGN=BOTTOM ALT="previous" SRC="http://www.netlib.org/utk/icons/previous_motif.gif"></A> <A NAME="tex2html3615" HREF="node1.html"><IMG WIDTH=65 HEIGHT=24 ALIGN=BOTTOM ALT="contents" SRC="http://www.netlib.org/utk/icons/contents_motif.gif"></A> <A NAME="tex2html3616" HREF="node190.html"><IMG WIDTH=43 HEIGHT=24 ALIGN=BOTTOM ALT="index" SRC="http://www.netlib.org/utk/icons/index_motif.gif"></A> <BR>
<B> Next:</B> <A NAME="tex2html3614" HREF="node113.html">ScaLAPACK Performance</A>
<B>Up:</B> <A NAME="tex2html3612" HREF="node111.html">BLACS as an Efficient</A>
<B> Previous:</B> <A NAME="tex2html3608" HREF="node111.html">BLACS as an Efficient</A>
<P><ADDRESS>
<I>Susan Blackford <BR>
Tue May 13 09:21:01 EDT 1997</I>
</ADDRESS>
</BODY>
</HTML>
|