File: node112.html

package info (click to toggle)
scalapack-doc 1.5-11
  • links: PTS
  • area: main
  • in suites: bullseye, buster, stretch
  • size: 10,336 kB
  • ctags: 4,931
  • sloc: makefile: 47; sh: 18
file content (216 lines) | stat: -rw-r--r-- 9,453 bytes parent folder | download | duplicates (4)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<!--Converted with LaTeX2HTML 96.1-h (September 30, 1996) by Nikos Drakos (nikos@cbl.leeds.ac.uk), CBLU, University of Leeds -->
<HTML>
<HEAD>
<TITLE>Parallel Efficiency</TITLE>
<META NAME="description" CONTENT="Parallel Efficiency">
<META NAME="keywords" CONTENT="slug">
<META NAME="resource-type" CONTENT="document">
<META NAME="distribution" CONTENT="global">
<LINK REL=STYLESHEET HREF="slug.css">
</HEAD>
<BODY LANG="EN" >
 <A NAME="tex2html3613" HREF="node113.html"><IMG WIDTH=37 HEIGHT=24 ALIGN=BOTTOM ALT="next" SRC="http://www.netlib.org/utk/icons/next_motif.gif"></A> <A NAME="tex2html3611" HREF="node111.html"><IMG WIDTH=26 HEIGHT=24 ALIGN=BOTTOM ALT="up" SRC="http://www.netlib.org/utk/icons/up_motif.gif"></A> <A NAME="tex2html3607" HREF="node111.html"><IMG WIDTH=63 HEIGHT=24 ALIGN=BOTTOM ALT="previous" SRC="http://www.netlib.org/utk/icons/previous_motif.gif"></A> <A NAME="tex2html3615" HREF="node1.html"><IMG WIDTH=65 HEIGHT=24 ALIGN=BOTTOM ALT="contents" SRC="http://www.netlib.org/utk/icons/contents_motif.gif"></A> <A NAME="tex2html3616" HREF="node190.html"><IMG WIDTH=43 HEIGHT=24 ALIGN=BOTTOM ALT="index" SRC="http://www.netlib.org/utk/icons/index_motif.gif"></A> <BR>
<B> Next:</B> <A NAME="tex2html3614" HREF="node113.html">ScaLAPACK Performance</A>
<B>Up:</B> <A NAME="tex2html3612" HREF="node111.html">BLACS as an Efficient</A>
<B> Previous:</B> <A NAME="tex2html3608" HREF="node111.html">BLACS as an Efficient</A>
<BR> <P>
<H3><A NAME="SECTION04523100000000000000">Parallel Efficiency</A></H3>
                   <A NAME="subsecparalleleff">&#160;</A>
<A NAME="3658">&#160;</A>
<P>
An important performance metric is <EM>parallel efficiency</EM>. Parallel
efficiency, <I>E</I>(<I>N</I>, <I>P</I>),<A NAME="3660">&#160;</A>
for a problem of size <I>N</I> on <I>P</I> nodes is 
defined in the usual
way [<A HREF="node189.html#fox88ab">65</A>, <A HREF="node189.html#kumar94a">92</A>] by
<BR><IMG WIDTH=337 HEIGHT=41 ALIGN=BOTTOM ALT="displaymath16290" SRC="img372.gif"><BR>
where <I>T</I>(<I>N</I>,<I>P</I>)<A NAME="3665">&#160;</A> is the
runtime of the parallel algorithm, and <IMG WIDTH=57 HEIGHT=28 ALIGN=MIDDLE ALT="tex2html_wrap_inline16300" SRC="img373.gif"><A NAME="7983">&#160;</A> is the runtime
of the best sequential
algorithm. For dense 
matrix computations,
an implementation is
said to be <EM>scalable</EM><A NAME="3669">&#160;</A>
if the parallel efficiency
is an increasing function
of <IMG WIDTH=45 HEIGHT=31 ALIGN=MIDDLE ALT="tex2html_wrap_inline16304" SRC="img375.gif">, the problem 
size per node. The
algorithms implemented in
the ScaLAPACK library are 
scalable in this sense.
<P>
Figure <A HREF="node112.html#figiso_pgon">5.1</A>
shows the scalability
of the ScaLAPACK 
implementation of
the <I>LU</I> factorization
on the Intel XP/S
Paragon computer.
The nodes of the
Intel XP/S Paragon
computer are 
general-purpose (GP)
or multiprocessor (MP)
nodes, based on the
Intel i860
XP RISC processors.  Each 
Intel i860 processor is
capable of a peak performance
of 50&nbsp;Mflop/s. On such a
processor, however, the 
vendor-supplied BLAS
matrix-matrix multiply
routine DGEMM can 
achieve only approximately
45&nbsp;Mflop/s. The 
computer used for 
obtaining the 
performance results
presented in this
chapter consisted
of MP nodes configured
as follows:  each
MP node had three
Intel i860 XP processors
-- two to execute 
application code
and a third used
exclusively as a 
message coprocessor<A NAME="3671">&#160;</A><A NAME="3672">&#160;</A>.
On such a node, the
vendor-supplied BLAS
matrix-matrix multiply
routine DGEMM can 
achieve approximately
90&nbsp;Mflop/s.
<P><A NAME="3675">&#160;</A><A NAME="figiso_pgon">&#160;</A><IMG WIDTH=378 HEIGHT=280 ALIGN=BOTTOM ALT="figure3673" SRC="img376.gif"><BR>
<STRONG>Figure 5.1:</STRONG> LU Performance per Intel XP/S MP Paragon node<BR>
<P>
<P>
Figure&nbsp;<A HREF="node112.html#figiso_pgon">5.1</A>
shows the speed in
Mflop/s per node
of the ScaLAPACK
<I>LU</I> factorization
routine PDGETRF 
for different
computer configurations.
This figure illustrates
that when the number of
nodes is scaled by a 
constant factor, the
same efficiency or
speed per node is 
achieved for equidistant
problem sizes on a 
logarithmic scale.
In other words,
maintaining a constant
memory use per node
allows efficiency
to be maintained.
(This scalability behavior is also referred to as <EM>isoefficiency,</EM><A NAME="3680">&#160;</A>
or <EM>isogranularity</EM><A NAME="3682">&#160;</A>.)
In practice, however,
a slight degradation
is acceptable.
The ScaLAPACK driver
routines, in general,
feature the same 
scalability behavior
up to a constant
factor that depends
on the exact number
of floating-point
operations and the
total volume of data
exchanged during the
computation.
<P>
In large dense linear
algebra computations,
the computation
cost dominates the
communication cost.
In the following, the
time to execute one
floating-point operation
by one node is denoted
by <IMG WIDTH=13 HEIGHT=23 ALIGN=MIDDLE ALT="tex2html_wrap_inline12202" SRC="img28.gif"><A NAME="3683">&#160;</A>. The time to
communicate a message
between two nodes is
approximated by a
linear function of
the number of items
communicated. The function
is the sum of the time
to prepare the message
for transmission (<IMG WIDTH=18 HEIGHT=23 ALIGN=MIDDLE ALT="tex2html_wrap_inline12208" SRC="img30.gif">)<A NAME="3684">&#160;</A>
and the time taken
by the message
to traverse the
network to its
destination, that is,
the product of its
length by the time
to transfer one
data item (<IMG WIDTH=12 HEIGHT=23 ALIGN=MIDDLE ALT="tex2html_wrap_inline12228" SRC="img35.gif">)<A NAME="3685">&#160;</A>.
Alternatively, <IMG WIDTH=18 HEIGHT=23 ALIGN=MIDDLE ALT="tex2html_wrap_inline12208" SRC="img30.gif">
is also called the
<I>latency,</I><A NAME="3687">&#160;</A><A NAME="3688">&#160;</A> since
it is the time to
communicate a
message of zero
length. On most
modern interconnection
networks, the order of
magnitude of the latency
varies between a 
microsecond and a 
millisecond.
<P>
The bandwidth<A NAME="3689">&#160;</A><A NAME="3690">&#160;</A> of
the network is also referred to as 
its <I>throughput.</I><A NAME="3692">&#160;</A>
It is proportional to the reciprocal of <IMG WIDTH=12 HEIGHT=23 ALIGN=MIDDLE ALT="tex2html_wrap_inline12228" SRC="img35.gif">. On modern
networks, the order of magnitude of the bandwidth is the 
megabyte per second.  For a scalable algorithm with
<IMG WIDTH=45 HEIGHT=31 ALIGN=MIDDLE ALT="tex2html_wrap_inline16304" SRC="img375.gif"> held constant, one expects the performance to 
be proportional to <I>P</I>. The algorithms implemented in
ScaLAPACK are scalable in this sense. Table&nbsp;<A HREF="node112.html#tabvars">5.1</A>
summarizes the relevant constants used in our scalability
analysis.
<P><A NAME="3695">&#160;</A><A NAME="tabvars">&#160;</A><IMG WIDTH=706 HEIGHT=314 ALIGN=BOTTOM ALT="table3694" SRC="img377.gif"><BR>
<STRONG>Table 5.1:</STRONG> Variable definitions<BR>
<P>
<P>
Using the notation presented in table&nbsp;<A HREF="node112.html#tabvars">5.1</A>,
the execution time of the ScaLAPACK drivers can be
approximated by
<BR><A NAME="eqntim">&#160;</A><IMG WIDTH=647 HEIGHT=44 ALIGN=BOTTOM ALT="equation3728" SRC="img378.gif"><BR>
<P>
The corresponding parallel efficiency<A NAME="3738">&#160;</A>
can then be approximated by
<BR><A NAME="eqneff">&#160;</A><IMG WIDTH=564 HEIGHT=52 ALIGN=BOTTOM ALT="equation3739" SRC="img379.gif"><BR>
Equation <A HREF="node112.html#eqneff">5.2</A> illustrates, in particular, that
the communication versus computation performance ratio of a
distributed-memory computer significantly affects parallel efficiency. The
ratio of the latency to the time per flop <IMG WIDTH=53 HEIGHT=28 ALIGN=MIDDLE ALT="tex2html_wrap_inline16374" SRC="img380.gif"> greatly
affects the parallel efficiency of small problems. The ratio
of the network throughput to the flop rate <IMG WIDTH=48 HEIGHT=28 ALIGN=MIDDLE ALT="tex2html_wrap_inline16376" SRC="img381.gif">
significantly affects the parallel efficiency of medium-sized
problems. For large problems, the node flop rate <IMG WIDTH=43 HEIGHT=28 ALIGN=MIDDLE ALT="tex2html_wrap_inline16378" SRC="img382.gif">
is the dominant factor contributing to the parallel
efficiency of the parallel algorithms implemented in ScaLAPACK.
<P>
<HR><A NAME="tex2html3613" HREF="node113.html"><IMG WIDTH=37 HEIGHT=24 ALIGN=BOTTOM ALT="next" SRC="http://www.netlib.org/utk/icons/next_motif.gif"></A> <A NAME="tex2html3611" HREF="node111.html"><IMG WIDTH=26 HEIGHT=24 ALIGN=BOTTOM ALT="up" SRC="http://www.netlib.org/utk/icons/up_motif.gif"></A> <A NAME="tex2html3607" HREF="node111.html"><IMG WIDTH=63 HEIGHT=24 ALIGN=BOTTOM ALT="previous" SRC="http://www.netlib.org/utk/icons/previous_motif.gif"></A> <A NAME="tex2html3615" HREF="node1.html"><IMG WIDTH=65 HEIGHT=24 ALIGN=BOTTOM ALT="contents" SRC="http://www.netlib.org/utk/icons/contents_motif.gif"></A> <A NAME="tex2html3616" HREF="node190.html"><IMG WIDTH=43 HEIGHT=24 ALIGN=BOTTOM ALT="index" SRC="http://www.netlib.org/utk/icons/index_motif.gif"></A> <BR>
<B> Next:</B> <A NAME="tex2html3614" HREF="node113.html">ScaLAPACK Performance</A>
<B>Up:</B> <A NAME="tex2html3612" HREF="node111.html">BLACS as an Efficient</A>
<B> Previous:</B> <A NAME="tex2html3608" HREF="node111.html">BLACS as an Efficient</A>
<P><ADDRESS>
<I>Susan Blackford <BR>
Tue May 13 09:21:01 EDT 1997</I>
</ADDRESS>
</BODY>
</HTML>