1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282
|
<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<!--Converted with LaTeX2HTML 96.1-h (September 30, 1996) by Nikos Drakos (nikos@cbl.leeds.ac.uk), CBLU, University of Leeds -->
<HTML>
<HEAD>
<TITLE>The BLAS as the Key to (Trans)portable Efficiency</TITLE>
<META NAME="description" CONTENT="The BLAS as the Key to (Trans)portable Efficiency">
<META NAME="keywords" CONTENT="slug">
<META NAME="resource-type" CONTENT="document">
<META NAME="distribution" CONTENT="global">
<LINK REL=STYLESHEET HREF="slug.css">
</HEAD>
<BODY LANG="EN" >
<A NAME="tex2html3578" HREF="node110.html"><IMG WIDTH=37 HEIGHT=24 ALIGN=BOTTOM ALT="next" SRC="http://www.netlib.org/utk/icons/next_motif.gif"></A> <A NAME="tex2html3576" HREF="node108.html"><IMG WIDTH=26 HEIGHT=24 ALIGN=BOTTOM ALT="up" SRC="http://www.netlib.org/utk/icons/up_motif.gif"></A> <A NAME="tex2html3570" HREF="node108.html"><IMG WIDTH=63 HEIGHT=24 ALIGN=BOTTOM ALT="previous" SRC="http://www.netlib.org/utk/icons/previous_motif.gif"></A> <A NAME="tex2html3580" HREF="node1.html"><IMG WIDTH=65 HEIGHT=24 ALIGN=BOTTOM ALT="contents" SRC="http://www.netlib.org/utk/icons/contents_motif.gif"></A> <A NAME="tex2html3581" HREF="node190.html"><IMG WIDTH=43 HEIGHT=24 ALIGN=BOTTOM ALT="index" SRC="http://www.netlib.org/utk/icons/index_motif.gif"></A> <BR>
<B> Next:</B> <A NAME="tex2html3579" HREF="node110.html">Two-Dimensional Block Cyclic Data </A>
<B>Up:</B> <A NAME="tex2html3577" HREF="node108.html">PerformancePortability and Scalability</A>
<B> Previous:</B> <A NAME="tex2html3571" HREF="node108.html">PerformancePortability and Scalability</A>
<BR> <P>
<H2><A NAME="SECTION04521000000000000000">The BLAS as the Key to (Trans)portable Efficiency</A></H2>
<P>
<A NAME="subsecblas"> </A>
The total number
of floating-point
operations performed
by most of the ScaLAPACK
driver routines for
dense matrices
can be approximated
by the quantity
<IMG WIDTH=45 HEIGHT=31 ALIGN=MIDDLE ALT="tex2html_wrap_inline12066" SRC="img3.gif">, where
<IMG WIDTH=18 HEIGHT=25 ALIGN=MIDDLE ALT="tex2html_wrap_inline16191" SRC="img364.gif"> is a constant
and <I>N</I> is the
order of the
largest matrix
operand. For
solving linear
equations or
linear least
squares, <IMG WIDTH=18 HEIGHT=25 ALIGN=MIDDLE ALT="tex2html_wrap_inline16191" SRC="img364.gif"><A NAME="3595"> </A>
is a constant
depending solely
on the selected
algorithm. The
algorithms used
to find eigenvalues
and singular
values are
iterative; hence,
for these operations
the constant <IMG WIDTH=18 HEIGHT=25 ALIGN=MIDDLE ALT="tex2html_wrap_inline16191" SRC="img364.gif">
truly depends
on the input
data as well.
It is, however,
customary or
``standard'' to
consider the
values of the
constants <IMG WIDTH=18 HEIGHT=25 ALIGN=MIDDLE ALT="tex2html_wrap_inline16191" SRC="img364.gif">
for a fixed
number of
iterations.
The ``standard''
constants <IMG WIDTH=18 HEIGHT=25 ALIGN=MIDDLE ALT="tex2html_wrap_inline16191" SRC="img364.gif">
range from 1/3
to approximately
18, as shown
in Table <A HREF="node116.html#standardflopcount">5.8</A>.
<P>
The performance
of the ScaLAPACK
drivers is thus
bounded above
by the performance
of a computation
that could be
partitioned into
<I>P</I> independent
chunks of
<IMG WIDTH=68 HEIGHT=31 ALIGN=MIDDLE ALT="tex2html_wrap_inline16211" SRC="img365.gif">
floating-point
operations each.
This upper bound,
referred to
hereafter as
the <I>peak
performance</I><A NAME="3598"> </A>,
can be computed
as the product
of <IMG WIDTH=68 HEIGHT=31 ALIGN=MIDDLE ALT="tex2html_wrap_inline16211" SRC="img365.gif">
and the highest
reachable local
node flop rate.
Hence, for
a given problem
size <I>N</I> and
assuming a uniform
distribution of the
computational tasks,
the most important
factors determining
the overall performance
are the number
<I>P</I> of nodes
involved in the
computation and
the local node
flop rate.
<P>
In a serial
computational
environment,
<I>transportable
efficiency</I><A NAME="3600"> </A><A NAME="3601"> </A> is
the essential
motivation for
developing blocking
strategies and
block-partitioned
algorithms
[<A HREF="node189.html#agarwal94b">2</A>, <A HREF="node189.html#laug">3</A>, <A HREF="node189.html#dayde94a">35</A>, <A HREF="node189.html#kagstrom95b">90</A>]<A NAME="3603"> </A><A NAME="3604"> </A>.
The linear algebra
package (LAPACK)
[<A HREF="node189.html#laug">3</A>]<A NAME="3606"> </A> is
the archetype of
such a strategy.
The LAPACK software
is constructed as
much as possible
out of calls to
the BLAS. These
kernels confine
the impact of
the computer
architecture
differences
to a small
number of
routines. The
efficiency and
portability of
the LAPACK
software are
then achieved
by combining
native and
efficient BLAS
implementations
with portable
high-level
components.
<P>
The BLAS<A NAME="3607"> </A> are
subdivided
into three
levels, each
of which
offers
increased
scope for
exploiting
parallelism.
This subdivision
corresponds to
three different
types of basic
linear algebra
operations:
<UL>
<LI> Level 1 BLAS [<A HREF="node189.html#blas1">93</A>]<A NAME="3610"> </A>:
for vector operations, such as
<IMG WIDTH=88 HEIGHT=21 ALIGN=MIDDLE ALT="tex2html_wrap_inline16219" SRC="img366.gif">,
<LI> Level 2 BLAS [<A HREF="node189.html#blas2">59</A>, <A HREF="node189.html#blas2alg">58</A>]<A NAME="3612"> </A>:
for matrix-vector operations, such as
<IMG WIDTH=111 HEIGHT=25 ALIGN=MIDDLE ALT="tex2html_wrap_inline16221" SRC="img367.gif">,
<LI> Level 3 BLAS [<A HREF="node189.html#blas3">57</A>, <A HREF="node189.html#blas3alg">56</A>]<A NAME="3614"> </A>:
for matrix-matrix operations, such as
<IMG WIDTH=123 HEIGHT=25 ALIGN=MIDDLE ALT="tex2html_wrap_inline16223" SRC="img368.gif">.
</UL>
Here, <I>A</I>, <I>B</I>,
and <I>C</I> are
matrices, <I>x</I>
and <I>y</I> are
vectors, and
<IMG WIDTH=10 HEIGHT=8 ALIGN=BOTTOM ALT="tex2html_wrap_inline16235" SRC="img369.gif"> and
<IMG WIDTH=11 HEIGHT=25 ALIGN=MIDDLE ALT="tex2html_wrap_inline14473" SRC="img243.gif"> are
scalars.
<P>
The performance
potential of
the three levels
of BLAS is
strongly related
to the ratio of
floating-point
operations to
memory references,
as well as to the
reuse of data when
it is stored in
the higher levels
of the memory
hierarchy.
Consequently,
the Level 1
BLAS cannot
achieve high
efficiency on
most modern
supercomputers.
The Level 2
BLAS can achieve
near-peak
performance
on many vector
processors; on
RISC microprocessors,
however, their
performance is
limited by the
memory access
bandwidth
bottleneck. The
greatest scope
for exploiting
the highest
levels of the
memory hierarchy
as well as other
forms of parallelism
is offered by the
Level 3 BLAS
[<A HREF="node189.html#laug">3</A>].
<P>
The previous
reasoning applies to
distributed-memory
computational
environments in two
ways. First, in order
to achieve overall
high performance,
it is necessary to
express the bulk
of the computation
local to each node
in terms of Level 3
BLAS operations.
Second, designing
and developing
a set of parallel
BLAS (PBLAS)<A NAME="3617"> </A> for
distributed-memory
computers
should lead to an
efficient and
straightforward
port of the
LAPACK software.
This is the path
followed by the
ScaLAPACK initiative
[<A HREF="node189.html#lawn95">25</A>, <A HREF="node189.html#dongarra95a">53</A>]
as well as others
[<A HREF="node189.html#aboelaze91a">1</A>, <A HREF="node189.html#brent93a">21</A>, <A HREF="node189.html#chtchelkanova95a">30</A>, <A HREF="node189.html#falgout93a">63</A>].
As part of the
ScaLAPACK project,
a set of PBLAS
has been early
designed and
developed
[<A HREF="node189.html#choi94a">29</A>, <A HREF="node189.html#lawn100">26</A>].
<P>
<HR><A NAME="tex2html3578" HREF="node110.html"><IMG WIDTH=37 HEIGHT=24 ALIGN=BOTTOM ALT="next" SRC="http://www.netlib.org/utk/icons/next_motif.gif"></A> <A NAME="tex2html3576" HREF="node108.html"><IMG WIDTH=26 HEIGHT=24 ALIGN=BOTTOM ALT="up" SRC="http://www.netlib.org/utk/icons/up_motif.gif"></A> <A NAME="tex2html3570" HREF="node108.html"><IMG WIDTH=63 HEIGHT=24 ALIGN=BOTTOM ALT="previous" SRC="http://www.netlib.org/utk/icons/previous_motif.gif"></A> <A NAME="tex2html3580" HREF="node1.html"><IMG WIDTH=65 HEIGHT=24 ALIGN=BOTTOM ALT="contents" SRC="http://www.netlib.org/utk/icons/contents_motif.gif"></A> <A NAME="tex2html3581" HREF="node190.html"><IMG WIDTH=43 HEIGHT=24 ALIGN=BOTTOM ALT="index" SRC="http://www.netlib.org/utk/icons/index_motif.gif"></A> <BR>
<B> Next:</B> <A NAME="tex2html3579" HREF="node110.html">Two-Dimensional Block Cyclic Data </A>
<B>Up:</B> <A NAME="tex2html3577" HREF="node108.html">PerformancePortability and Scalability</A>
<B> Previous:</B> <A NAME="tex2html3571" HREF="node108.html">PerformancePortability and Scalability</A>
<P><ADDRESS>
<I>Susan Blackford <BR>
Tue May 13 09:21:01 EDT 1997</I>
</ADDRESS>
</BODY>
</HTML>
|