File: node112.html

package info (click to toggle)
scalapack-doc 1.5-11
links: PTS
area: main
in suites: bullseye, buster, stretch
size: 10,336 kB
ctags: 4,931
sloc: makefile: 47; sh: 18
file content (216 lines) | stat: -rw-r--r-- 9,453 bytes
parent folder | download | duplicates (4)
<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<!--Converted with LaTeX2HTML 96.1-h (September 30, 1996) by Nikos Drakos (nikos@cbl.leeds.ac.uk), CBLU, University of Leeds -->
<HTML>
<HEAD>
<TITLE>Parallel Efficiency</TITLE>
<META NAME="description" CONTENT="Parallel Efficiency">
<META NAME="keywords" CONTENT="slug">
<META NAME="resource-type" CONTENT="document">
<META NAME="distribution" CONTENT="global">
<LINK REL=STYLESHEET HREF="slug.css">
</HEAD>
<BODY LANG="EN" >
 <A NAME="tex2html3613" HREF="node113.html"><IMG WIDTH=37 HEIGHT=24 ALIGN=BOTTOM ALT="next" SRC="http://www.netlib.org/utk/icons/next_motif.gif"></A> <A NAME="tex2html3611" HREF="node111.html"><IMG WIDTH=26 HEIGHT=24 ALIGN=BOTTOM ALT="up" SRC="http://www.netlib.org/utk/icons/up_motif.gif"></A> <A NAME="tex2html3607" HREF="node111.html"><IMG WIDTH=63 HEIGHT=24 ALIGN=BOTTOM ALT="previous" SRC="http://www.netlib.org/utk/icons/previous_motif.gif"></A> <A NAME="tex2html3615" HREF="node1.html"><IMG WIDTH=65 HEIGHT=24 ALIGN=BOTTOM ALT="contents" SRC="http://www.netlib.org/utk/icons/contents_motif.gif"></A> <A NAME="tex2html3616" HREF="node190.html"><IMG WIDTH=43 HEIGHT=24 ALIGN=BOTTOM ALT="index" SRC="http://www.netlib.org/utk/icons/index_motif.gif"></A> <BR>
<B> Next:</B> <A NAME="tex2html3614" HREF="node113.html">ScaLAPACK Performance</A>
<B>Up:</B> <A NAME="tex2html3612" HREF="node111.html">BLACS as an Efficient</A>
<B> Previous:</B> <A NAME="tex2html3608" HREF="node111.html">BLACS as an Efficient</A>
<BR> <P>
<H3><A NAME="SECTION04523100000000000000">Parallel Efficiency</A></H3>
                   <A NAME="subsecparalleleff">&#160;</A>
<A NAME="3658">&#160;</A>
<P>
An important performance metric is <EM>parallel efficiency</EM>. Parallel
efficiency, <I>E</I>(<I>N</I>, <I>P</I>),<A NAME="3660">&#160;</A>
for a problem of size <I>N</I> on <I>P</I> nodes is 
defined in the usual
way [<A HREF="node189.html#fox88ab">65</A>, <A HREF="node189.html#kumar94a">92</A>] by
<BR><IMG WIDTH=337 HEIGHT=41 ALIGN=BOTTOM ALT="displaymath16290" SRC="img372.gif"><BR>
where <I>T</I>(<I>N</I>,<I>P</I>)<A NAME="3665">&#160;</A> is the
runtime of the parallel algorithm, and <IMG WIDTH=57 HEIGHT=28 ALIGN=MIDDLE ALT="tex2html_wrap_inline16300" SRC="img373.gif"><A NAME="7983">&#160;</A> is the runtime
of the best sequential
algorithm. For dense 
matrix computations,
an implementation is
said to be <EM>scalable</EM><A NAME="3669">&#160;</A>
if the parallel efficiency
is an increasing function
of <IMG WIDTH=45 HEIGHT=31 ALIGN=MIDDLE ALT="tex2html_wrap_inline16304" SRC="img375.gif">, the problem 
size per node. The
algorithms implemented in
the ScaLAPACK library are 
scalable in this sense.
<P>
Figure <A HREF="node112.html#figiso_pgon">5.1</A>
shows the scalability
of the ScaLAPACK 
implementation of
the <I>LU</I> factorization
on the Intel XP/S
Paragon computer.
The nodes of the
Intel XP/S Paragon
computer are 
general-purpose (GP)
or multiprocessor (MP)
nodes, based on the
Intel i860
XP RISC processors.  Each 
Intel i860 processor is
capable of a peak performance
of 50&nbsp;Mflop/s. On such a
processor, however, the 
vendor-supplied BLAS
matrix-matrix multiply
routine DGEMM can 
achieve only approximately
45&nbsp;Mflop/s. The 
computer used for 
obtaining the 
performance results
presented in this
chapter consisted
of MP nodes configured
as follows:  each
MP node had three
Intel i860 XP processors
-- two to execute 
application code
and a third used
exclusively as a 
message coprocessor<A NAME="3671">&#160;</A><A NAME="3672">&#160;</A>.
On such a node, the
vendor-supplied BLAS
matrix-matrix multiply
routine DGEMM can 
achieve approximately
90&nbsp;Mflop/s.
<P><A NAME="3675">&#160;</A><A NAME="figiso_pgon">&#160;</A><IMG WIDTH=378 HEIGHT=280 ALIGN=BOTTOM ALT="figure3673" SRC="img376.gif"><BR>
<STRONG>Figure 5.1:</STRONG> LU Performance per Intel XP/S MP Paragon node<BR>
<P>
<P>
Figure&nbsp;<A HREF="node112.html#figiso_pgon">5.1</A>
shows the speed in
Mflop/s per node
of the ScaLAPACK
<I>LU</I> factorization
routine PDGETRF 
for different
computer configurations.
This figure illustrates
that when the number of
nodes is scaled by a 
constant factor, the
same efficiency or
speed per node is 
achieved for equidistant
problem sizes on a 
logarithmic scale.
In other words,
maintaining a constant
memory use per node
allows efficiency
to be maintained.
(This scalability behavior is also referred to as <EM>isoefficiency,</EM><A NAME="3680">&#160;</A>
or <EM>isogranularity</EM><A NAME="3682">&#160;</A>.)
In practice, however,
a slight degradation
is acceptable.
The ScaLAPACK driver
routines, in general,
feature the same 
scalability behavior
up to a constant
factor that depends
on the exact number
of floating-point
operations and the
total volume of data
exchanged during the
computation.
<P>
In large dense linear
algebra computations,
the computation
cost dominates the
communication cost.
In the following, the
time to execute one
floating-point operation
by one node is denoted
by <IMG WIDTH=13 HEIGHT=23 ALIGN=MIDDLE ALT="tex2html_wrap_inline12202" SRC="img28.gif"><A NAME="3683">&#160;</A>. The time to
communicate a message
between two nodes is
approximated by a
linear function of
the number of items
communicated. The function
is the sum of the time
to prepare the message
for transmission (<IMG WIDTH=18 HEIGHT=23 ALIGN=MIDDLE ALT="tex2html_wrap_inline12208" SRC="img30.gif">)<A NAME="3684">&#160;</A>
and the time taken
by the message
to traverse the
network to its
destination, that is,
the product of its
length by the time
to transfer one
data item (<IMG WIDTH=12 HEIGHT=23 ALIGN=MIDDLE ALT="tex2html_wrap_inline12228" SRC="img35.gif">)<A NAME="3685">&#160;</A>.
Alternatively, <IMG WIDTH=18 HEIGHT=23 ALIGN=MIDDLE ALT="tex2html_wrap_inline12208" SRC="img30.gif">
is also called the
<I>latency,</I><A NAME="3687">&#160;</A><A NAME="3688">&#160;</A> since
it is the time to
communicate a
message of zero
length. On most
modern interconnection
networks, the order of
magnitude of the latency
varies between a 
microsecond and a 
millisecond.
<P>
The bandwidth<A NAME="3689">&#160;</A><A NAME="3690">&#160;</A> of
the network is also referred to as 
its <I>throughput.</I><A NAME="3692">&#160;</A>
It is proportional to the reciprocal of <IMG WIDTH=12 HEIGHT=23 ALIGN=MIDDLE ALT="tex2html_wrap_inline12228" SRC="img35.gif">. On modern
networks, the order of magnitude of the bandwidth is the 
megabyte per second.  For a scalable algorithm with
<IMG WIDTH=45 HEIGHT=31 ALIGN=MIDDLE ALT="tex2html_wrap_inline16304" SRC="img375.gif"> held constant, one expects the performance to 
be proportional to <I>P</I>. The algorithms implemented in
ScaLAPACK are scalable in this sense. Table&nbsp;<A HREF="node112.html#tabvars">5.1</A>
summarizes the relevant constants used in our scalability
analysis.
<P><A NAME="3695">&#160;</A><A NAME="tabvars">&#160;</A><IMG WIDTH=706 HEIGHT=314 ALIGN=BOTTOM ALT="table3694" SRC="img377.gif"><BR>
<STRONG>Table 5.1:</STRONG> Variable definitions<BR>
<P>
<P>
Using the notation presented in table&nbsp;<A HREF="node112.html#tabvars">5.1</A>,
the execution time of the ScaLAPACK drivers can be
approximated by
<BR><A NAME="eqntim">&#160;</A><IMG WIDTH=647 HEIGHT=44 ALIGN=BOTTOM ALT="equation3728" SRC="img378.gif"><BR>
<P>
The corresponding parallel efficiency<A NAME="3738">&#160;</A>
can then be approximated by
<BR><A NAME="eqneff">&#160;</A><IMG WIDTH=564 HEIGHT=52 ALIGN=BOTTOM ALT="equation3739" SRC="img379.gif"><BR>
Equation <A HREF="node112.html#eqneff">5.2</A> illustrates, in particular, that
the communication versus computation performance ratio of a
distributed-memory computer significantly affects parallel efficiency. The
ratio of the latency to the time per flop <IMG WIDTH=53 HEIGHT=28 ALIGN=MIDDLE ALT="tex2html_wrap_inline16374" SRC="img380.gif"> greatly
affects the parallel efficiency of small problems. The ratio
of the network throughput to the flop rate <IMG WIDTH=48 HEIGHT=28 ALIGN=MIDDLE ALT="tex2html_wrap_inline16376" SRC="img381.gif">
significantly affects the parallel efficiency of medium-sized
problems. For large problems, the node flop rate <IMG WIDTH=43 HEIGHT=28 ALIGN=MIDDLE ALT="tex2html_wrap_inline16378" SRC="img382.gif">
is the dominant factor contributing to the parallel
efficiency of the parallel algorithms implemented in ScaLAPACK.
<P>
<HR><A NAME="tex2html3613" HREF="node113.html"><IMG WIDTH=37 HEIGHT=24 ALIGN=BOTTOM ALT="next" SRC="http://www.netlib.org/utk/icons/next_motif.gif"></A> <A NAME="tex2html3611" HREF="node111.html"><IMG WIDTH=26 HEIGHT=24 ALIGN=BOTTOM ALT="up" SRC="http://www.netlib.org/utk/icons/up_motif.gif"></A> <A NAME="tex2html3607" HREF="node111.html"><IMG WIDTH=63 HEIGHT=24 ALIGN=BOTTOM ALT="previous" SRC="http://www.netlib.org/utk/icons/previous_motif.gif"></A> <A NAME="tex2html3615" HREF="node1.html"><IMG WIDTH=65 HEIGHT=24 ALIGN=BOTTOM ALT="contents" SRC="http://www.netlib.org/utk/icons/contents_motif.gif"></A> <A NAME="tex2html3616" HREF="node190.html"><IMG WIDTH=43 HEIGHT=24 ALIGN=BOTTOM ALT="index" SRC="http://www.netlib.org/utk/icons/index_motif.gif"></A> <BR>
<B> Next:</B> <A NAME="tex2html3614" HREF="node113.html">ScaLAPACK Performance</A>
<B>Up:</B> <A NAME="tex2html3612" HREF="node111.html">BLACS as an Efficient</A>
<B> Previous:</B> <A NAME="tex2html3608" HREF="node111.html">BLACS as an Efficient</A>
<P><ADDRESS>
<I>Susan Blackford <BR>
Tue May 13 09:21:01 EDT 1997</I>
</ADDRESS>
</BODY>
</HTML>