File: node109.html

package info (click to toggle)
scalapack-doc 1.5-11
  • links: PTS
  • area: main
  • in suites: bullseye, buster, stretch
  • size: 10,336 kB
  • ctags: 4,931
  • sloc: makefile: 47; sh: 18
file content (282 lines) | stat: -rw-r--r-- 8,963 bytes parent folder | download | duplicates (4)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<!--Converted with LaTeX2HTML 96.1-h (September 30, 1996) by Nikos Drakos (nikos@cbl.leeds.ac.uk), CBLU, University of Leeds -->
<HTML>
<HEAD>
<TITLE>The BLAS as the Key to (Trans)portable Efficiency</TITLE>
<META NAME="description" CONTENT="The BLAS as the Key to (Trans)portable Efficiency">
<META NAME="keywords" CONTENT="slug">
<META NAME="resource-type" CONTENT="document">
<META NAME="distribution" CONTENT="global">
<LINK REL=STYLESHEET HREF="slug.css">
</HEAD>
<BODY LANG="EN" >
 <A NAME="tex2html3578" HREF="node110.html"><IMG WIDTH=37 HEIGHT=24 ALIGN=BOTTOM ALT="next" SRC="http://www.netlib.org/utk/icons/next_motif.gif"></A> <A NAME="tex2html3576" HREF="node108.html"><IMG WIDTH=26 HEIGHT=24 ALIGN=BOTTOM ALT="up" SRC="http://www.netlib.org/utk/icons/up_motif.gif"></A> <A NAME="tex2html3570" HREF="node108.html"><IMG WIDTH=63 HEIGHT=24 ALIGN=BOTTOM ALT="previous" SRC="http://www.netlib.org/utk/icons/previous_motif.gif"></A> <A NAME="tex2html3580" HREF="node1.html"><IMG WIDTH=65 HEIGHT=24 ALIGN=BOTTOM ALT="contents" SRC="http://www.netlib.org/utk/icons/contents_motif.gif"></A> <A NAME="tex2html3581" HREF="node190.html"><IMG WIDTH=43 HEIGHT=24 ALIGN=BOTTOM ALT="index" SRC="http://www.netlib.org/utk/icons/index_motif.gif"></A> <BR>
<B> Next:</B> <A NAME="tex2html3579" HREF="node110.html">Two-Dimensional Block Cyclic Data </A>
<B>Up:</B> <A NAME="tex2html3577" HREF="node108.html">PerformancePortability and Scalability</A>
<B> Previous:</B> <A NAME="tex2html3571" HREF="node108.html">PerformancePortability and Scalability</A>
<BR> <P>
<H2><A NAME="SECTION04521000000000000000">The BLAS as the Key to (Trans)portable Efficiency</A></H2>
<P>
                                                        <A NAME="subsecblas">&#160;</A>
The total number
of floating-point 
operations performed
by most of the ScaLAPACK 
driver routines for
dense matrices
can be approximated
by the quantity
<IMG WIDTH=45 HEIGHT=31 ALIGN=MIDDLE ALT="tex2html_wrap_inline12066" SRC="img3.gif">, where
<IMG WIDTH=18 HEIGHT=25 ALIGN=MIDDLE ALT="tex2html_wrap_inline16191" SRC="img364.gif"> is a constant
and <I>N</I> is the
order of the
largest matrix
operand.  For
solving linear
equations or 
linear least
squares, <IMG WIDTH=18 HEIGHT=25 ALIGN=MIDDLE ALT="tex2html_wrap_inline16191" SRC="img364.gif"><A NAME="3595">&#160;</A>
is a constant
depending solely
on the selected
algorithm.  The
algorithms used
to find eigenvalues
and singular 
values are
iterative; hence,
for these operations
the constant <IMG WIDTH=18 HEIGHT=25 ALIGN=MIDDLE ALT="tex2html_wrap_inline16191" SRC="img364.gif">
truly depends
on the input
data as well.
It is, however,
customary or
``standard'' to
consider the
values of the
constants <IMG WIDTH=18 HEIGHT=25 ALIGN=MIDDLE ALT="tex2html_wrap_inline16191" SRC="img364.gif">
for a fixed
number of
iterations.
The ``standard''
constants <IMG WIDTH=18 HEIGHT=25 ALIGN=MIDDLE ALT="tex2html_wrap_inline16191" SRC="img364.gif">
range from 1/3
to approximately
18, as shown
in Table&nbsp;<A HREF="node116.html#standardflopcount">5.8</A>.
<P>
The performance
of the ScaLAPACK
drivers is thus
bounded above
by the performance
of a computation
that could be
partitioned into
<I>P</I> independent
chunks of 
<IMG WIDTH=68 HEIGHT=31 ALIGN=MIDDLE ALT="tex2html_wrap_inline16211" SRC="img365.gif">
floating-point
operations each.
This upper bound,
referred to 
hereafter as
the <I>peak
performance</I><A NAME="3598">&#160;</A>, 
can be computed
as the product
of <IMG WIDTH=68 HEIGHT=31 ALIGN=MIDDLE ALT="tex2html_wrap_inline16211" SRC="img365.gif">
and the highest 
reachable local
node flop rate.
Hence, for
a given problem
size <I>N</I> and 
assuming a uniform
distribution of the
computational tasks,
the most important 
factors determining
the overall performance
are the number
<I>P</I> of nodes
involved in the
computation and
the local node
flop rate.
<P>
In a serial 
computational
environment,
<I>transportable
efficiency</I><A NAME="3600">&#160;</A><A NAME="3601">&#160;</A> is
the essential
motivation for
developing blocking
strategies and
block-partitioned
algorithms
[<A HREF="node189.html#agarwal94b">2</A>, <A HREF="node189.html#laug">3</A>, <A HREF="node189.html#dayde94a">35</A>, <A HREF="node189.html#kagstrom95b">90</A>]<A NAME="3603">&#160;</A><A NAME="3604">&#160;</A>.
The linear algebra
package (LAPACK)
[<A HREF="node189.html#laug">3</A>]<A NAME="3606">&#160;</A> is 
the archetype of
such a strategy.
The LAPACK software
is constructed as
much as possible
out of calls to
the BLAS. These
kernels confine
the impact of
the computer
architecture
differences
to a small
number of 
routines. The
efficiency and
portability of
the LAPACK 
software are
then achieved
by combining
native and 
efficient BLAS
implementations
with portable
high-level
components.
<P>
The BLAS<A NAME="3607">&#160;</A> are
subdivided
into three 
levels, each
of which 
offers
increased 
scope for
exploiting
parallelism.
This subdivision
corresponds to
three different
types of basic
linear algebra
operations:
<UL>
<LI> Level&nbsp;1 BLAS [<A HREF="node189.html#blas1">93</A>]<A NAME="3610">&#160;</A>:
       for vector operations, such as
       <IMG WIDTH=88 HEIGHT=21 ALIGN=MIDDLE ALT="tex2html_wrap_inline16219" SRC="img366.gif">,
<LI> Level&nbsp;2 BLAS [<A HREF="node189.html#blas2">59</A>, <A HREF="node189.html#blas2alg">58</A>]<A NAME="3612">&#160;</A>:
       for matrix-vector operations, such as
       <IMG WIDTH=111 HEIGHT=25 ALIGN=MIDDLE ALT="tex2html_wrap_inline16221" SRC="img367.gif">,
<LI> Level&nbsp;3 BLAS [<A HREF="node189.html#blas3">57</A>, <A HREF="node189.html#blas3alg">56</A>]<A NAME="3614">&#160;</A>:
       for matrix-matrix operations, such as
       <IMG WIDTH=123 HEIGHT=25 ALIGN=MIDDLE ALT="tex2html_wrap_inline16223" SRC="img368.gif">.
</UL>
Here, <I>A</I>, <I>B</I>,
and <I>C</I> are 
matrices, <I>x</I>
and <I>y</I> are 
vectors, and
<IMG WIDTH=10 HEIGHT=8 ALIGN=BOTTOM ALT="tex2html_wrap_inline16235" SRC="img369.gif"> and
<IMG WIDTH=11 HEIGHT=25 ALIGN=MIDDLE ALT="tex2html_wrap_inline14473" SRC="img243.gif"> are
scalars.
<P>
The performance
potential of
the three levels
of BLAS is 
strongly related
to the ratio of
floating-point
operations to
memory references,
as well as to the
reuse of data when
it is stored in
the higher levels
of the memory
hierarchy.
Consequently,
the Level&nbsp;1
BLAS cannot 
achieve high
efficiency on
most modern
supercomputers.
The Level&nbsp;2
BLAS can achieve
near-peak 
performance
on many vector
processors; on
RISC microprocessors,
however, their 
performance is
limited by the
memory access
bandwidth
bottleneck. The
greatest scope
for exploiting
the highest
levels of the
memory hierarchy
as well as other
forms of parallelism
is offered by the
Level&nbsp;3 BLAS
[<A HREF="node189.html#laug">3</A>].
<P>
The previous
reasoning applies to
distributed-memory
computational
environments in two
ways. First, in order
to achieve overall
high performance,
it is necessary to
express the bulk
of the computation
local to each node
in terms of Level&nbsp;3
BLAS operations.
Second, designing
and developing
a set of parallel
BLAS (PBLAS)<A NAME="3617">&#160;</A> for
distributed-memory
 computers
should lead to an 
efficient and 
straightforward
port of the
LAPACK software.
This is the path
followed by the
ScaLAPACK initiative
[<A HREF="node189.html#lawn95">25</A>, <A HREF="node189.html#dongarra95a">53</A>]
as well as others
[<A HREF="node189.html#aboelaze91a">1</A>, <A HREF="node189.html#brent93a">21</A>, <A HREF="node189.html#chtchelkanova95a">30</A>, <A HREF="node189.html#falgout93a">63</A>].
As part of the
ScaLAPACK project,
a set of PBLAS
has been early
designed and
developed
[<A HREF="node189.html#choi94a">29</A>, <A HREF="node189.html#lawn100">26</A>].
<P>
<HR><A NAME="tex2html3578" HREF="node110.html"><IMG WIDTH=37 HEIGHT=24 ALIGN=BOTTOM ALT="next" SRC="http://www.netlib.org/utk/icons/next_motif.gif"></A> <A NAME="tex2html3576" HREF="node108.html"><IMG WIDTH=26 HEIGHT=24 ALIGN=BOTTOM ALT="up" SRC="http://www.netlib.org/utk/icons/up_motif.gif"></A> <A NAME="tex2html3570" HREF="node108.html"><IMG WIDTH=63 HEIGHT=24 ALIGN=BOTTOM ALT="previous" SRC="http://www.netlib.org/utk/icons/previous_motif.gif"></A> <A NAME="tex2html3580" HREF="node1.html"><IMG WIDTH=65 HEIGHT=24 ALIGN=BOTTOM ALT="contents" SRC="http://www.netlib.org/utk/icons/contents_motif.gif"></A> <A NAME="tex2html3581" HREF="node190.html"><IMG WIDTH=43 HEIGHT=24 ALIGN=BOTTOM ALT="index" SRC="http://www.netlib.org/utk/icons/index_motif.gif"></A> <BR>
<B> Next:</B> <A NAME="tex2html3579" HREF="node110.html">Two-Dimensional Block Cyclic Data </A>
<B>Up:</B> <A NAME="tex2html3577" HREF="node108.html">PerformancePortability and Scalability</A>
<B> Previous:</B> <A NAME="tex2html3571" HREF="node108.html">PerformancePortability and Scalability</A>
<P><ADDRESS>
<I>Susan Blackford <BR>
Tue May 13 09:21:01 EDT 1997</I>
</ADDRESS>
</BODY>
</HTML>