File: node109.html

package info (click to toggle)
scalapack-doc 1.5-11
links: PTS
area: main
in suites: bullseye, buster, stretch
size: 10,336 kB
ctags: 4,931
sloc: makefile: 47; sh: 18
file content (282 lines) | stat: -rw-r--r-- 8,963 bytes
parent folder | download | duplicates (4)
<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<!--Converted with LaTeX2HTML 96.1-h (September 30, 1996) by Nikos Drakos (nikos@cbl.leeds.ac.uk), CBLU, University of Leeds -->
<HTML>
<HEAD>
<TITLE>The BLAS as the Key to (Trans)portable Efficiency</TITLE>
<META NAME="description" CONTENT="The BLAS as the Key to (Trans)portable Efficiency">
<META NAME="keywords" CONTENT="slug">
<META NAME="resource-type" CONTENT="document">
<META NAME="distribution" CONTENT="global">
<LINK REL=STYLESHEET HREF="slug.css">
</HEAD>
<BODY LANG="EN" >
 <A NAME="tex2html3578" HREF="node110.html"><IMG WIDTH=37 HEIGHT=24 ALIGN=BOTTOM ALT="next" SRC="http://www.netlib.org/utk/icons/next_motif.gif"></A> <A NAME="tex2html3576" HREF="node108.html"><IMG WIDTH=26 HEIGHT=24 ALIGN=BOTTOM ALT="up" SRC="http://www.netlib.org/utk/icons/up_motif.gif"></A> <A NAME="tex2html3570" HREF="node108.html"><IMG WIDTH=63 HEIGHT=24 ALIGN=BOTTOM ALT="previous" SRC="http://www.netlib.org/utk/icons/previous_motif.gif"></A> <A NAME="tex2html3580" HREF="node1.html"><IMG WIDTH=65 HEIGHT=24 ALIGN=BOTTOM ALT="contents" SRC="http://www.netlib.org/utk/icons/contents_motif.gif"></A> <A NAME="tex2html3581" HREF="node190.html"><IMG WIDTH=43 HEIGHT=24 ALIGN=BOTTOM ALT="index" SRC="http://www.netlib.org/utk/icons/index_motif.gif"></A> <BR>
<B> Next:</B> <A NAME="tex2html3579" HREF="node110.html">Two-Dimensional Block Cyclic Data </A>
<B>Up:</B> <A NAME="tex2html3577" HREF="node108.html">PerformancePortability and Scalability</A>
<B> Previous:</B> <A NAME="tex2html3571" HREF="node108.html">PerformancePortability and Scalability</A>
<BR> <P>
<H2><A NAME="SECTION04521000000000000000">The BLAS as the Key to (Trans)portable Efficiency</A></H2>
<P>
                                                        <A NAME="subsecblas">&#160;</A>
The total number
of floating-point 
operations performed
by most of the ScaLAPACK 
driver routines for
dense matrices
can be approximated
by the quantity
<IMG WIDTH=45 HEIGHT=31 ALIGN=MIDDLE ALT="tex2html_wrap_inline12066" SRC="img3.gif">, where
<IMG WIDTH=18 HEIGHT=25 ALIGN=MIDDLE ALT="tex2html_wrap_inline16191" SRC="img364.gif"> is a constant
and <I>N</I> is the
order of the
largest matrix
operand.  For
solving linear
equations or 
linear least
squares, <IMG WIDTH=18 HEIGHT=25 ALIGN=MIDDLE ALT="tex2html_wrap_inline16191" SRC="img364.gif"><A NAME="3595">&#160;</A>
is a constant
depending solely
on the selected
algorithm.  The
algorithms used
to find eigenvalues
and singular 
values are
iterative; hence,
for these operations
the constant <IMG WIDTH=18 HEIGHT=25 ALIGN=MIDDLE ALT="tex2html_wrap_inline16191" SRC="img364.gif">
truly depends
on the input
data as well.
It is, however,
customary or
``standard'' to
consider the
values of the
constants <IMG WIDTH=18 HEIGHT=25 ALIGN=MIDDLE ALT="tex2html_wrap_inline16191" SRC="img364.gif">
for a fixed
number of
iterations.
The ``standard''
constants <IMG WIDTH=18 HEIGHT=25 ALIGN=MIDDLE ALT="tex2html_wrap_inline16191" SRC="img364.gif">
range from 1/3
to approximately
18, as shown
in Table&nbsp;<A HREF="node116.html#standardflopcount">5.8</A>.
<P>
The performance
of the ScaLAPACK
drivers is thus
bounded above
by the performance
of a computation
that could be
partitioned into
<I>P</I> independent
chunks of 
<IMG WIDTH=68 HEIGHT=31 ALIGN=MIDDLE ALT="tex2html_wrap_inline16211" SRC="img365.gif">
floating-point
operations each.
This upper bound,
referred to 
hereafter as
the <I>peak
performance</I><A NAME="3598">&#160;</A>, 
can be computed
as the product
of <IMG WIDTH=68 HEIGHT=31 ALIGN=MIDDLE ALT="tex2html_wrap_inline16211" SRC="img365.gif">
and the highest 
reachable local
node flop rate.
Hence, for
a given problem
size <I>N</I> and 
assuming a uniform
distribution of the
computational tasks,
the most important 
factors determining
the overall performance
are the number
<I>P</I> of nodes
involved in the
computation and
the local node
flop rate.
<P>
In a serial 
computational
environment,
<I>transportable
efficiency</I><A NAME="3600">&#160;</A><A NAME="3601">&#160;</A> is
the essential
motivation for
developing blocking
strategies and
block-partitioned
algorithms
[<A HREF="node189.html#agarwal94b">2</A>, <A HREF="node189.html#laug">3</A>, <A HREF="node189.html#dayde94a">35</A>, <A HREF="node189.html#kagstrom95b">90</A>]<A NAME="3603">&#160;</A><A NAME="3604">&#160;</A>.
The linear algebra
package (LAPACK)
[<A HREF="node189.html#laug">3</A>]<A NAME="3606">&#160;</A> is 
the archetype of
such a strategy.
The LAPACK software
is constructed as
much as possible
out of calls to
the BLAS. These
kernels confine
the impact of
the computer
architecture
differences
to a small
number of 
routines. The
efficiency and
portability of
the LAPACK 
software are
then achieved
by combining
native and 
efficient BLAS
implementations
with portable
high-level
components.
<P>
The BLAS<A NAME="3607">&#160;</A> are
subdivided
into three 
levels, each
of which 
offers
increased 
scope for
exploiting
parallelism.
This subdivision
corresponds to
three different
types of basic
linear algebra
operations:
<UL>
<LI> Level&nbsp;1 BLAS [<A HREF="node189.html#blas1">93</A>]<A NAME="3610">&#160;</A>:
       for vector operations, such as
       <IMG WIDTH=88 HEIGHT=21 ALIGN=MIDDLE ALT="tex2html_wrap_inline16219" SRC="img366.gif">,
<LI> Level&nbsp;2 BLAS [<A HREF="node189.html#blas2">59</A>, <A HREF="node189.html#blas2alg">58</A>]<A NAME="3612">&#160;</A>:
       for matrix-vector operations, such as
       <IMG WIDTH=111 HEIGHT=25 ALIGN=MIDDLE ALT="tex2html_wrap_inline16221" SRC="img367.gif">,
<LI> Level&nbsp;3 BLAS [<A HREF="node189.html#blas3">57</A>, <A HREF="node189.html#blas3alg">56</A>]<A NAME="3614">&#160;</A>:
       for matrix-matrix operations, such as
       <IMG WIDTH=123 HEIGHT=25 ALIGN=MIDDLE ALT="tex2html_wrap_inline16223" SRC="img368.gif">.
</UL>
Here, <I>A</I>, <I>B</I>,
and <I>C</I> are 
matrices, <I>x</I>
and <I>y</I> are 
vectors, and
<IMG WIDTH=10 HEIGHT=8 ALIGN=BOTTOM ALT="tex2html_wrap_inline16235" SRC="img369.gif"> and
<IMG WIDTH=11 HEIGHT=25 ALIGN=MIDDLE ALT="tex2html_wrap_inline14473" SRC="img243.gif"> are
scalars.
<P>
The performance
potential of
the three levels
of BLAS is 
strongly related
to the ratio of
floating-point
operations to
memory references,
as well as to the
reuse of data when
it is stored in
the higher levels
of the memory
hierarchy.
Consequently,
the Level&nbsp;1
BLAS cannot 
achieve high
efficiency on
most modern
supercomputers.
The Level&nbsp;2
BLAS can achieve
near-peak 
performance
on many vector
processors; on
RISC microprocessors,
however, their 
performance is
limited by the
memory access
bandwidth
bottleneck. The
greatest scope
for exploiting
the highest
levels of the
memory hierarchy
as well as other
forms of parallelism
is offered by the
Level&nbsp;3 BLAS
[<A HREF="node189.html#laug">3</A>].
<P>
The previous
reasoning applies to
distributed-memory
computational
environments in two
ways. First, in order
to achieve overall
high performance,
it is necessary to
express the bulk
of the computation
local to each node
in terms of Level&nbsp;3
BLAS operations.
Second, designing
and developing
a set of parallel
BLAS (PBLAS)<A NAME="3617">&#160;</A> for
distributed-memory
 computers
should lead to an 
efficient and 
straightforward
port of the
LAPACK software.
This is the path
followed by the
ScaLAPACK initiative
[<A HREF="node189.html#lawn95">25</A>, <A HREF="node189.html#dongarra95a">53</A>]
as well as others
[<A HREF="node189.html#aboelaze91a">1</A>, <A HREF="node189.html#brent93a">21</A>, <A HREF="node189.html#chtchelkanova95a">30</A>, <A HREF="node189.html#falgout93a">63</A>].
As part of the
ScaLAPACK project,
a set of PBLAS
has been early
designed and
developed
[<A HREF="node189.html#choi94a">29</A>, <A HREF="node189.html#lawn100">26</A>].
<P>
<HR><A NAME="tex2html3578" HREF="node110.html"><IMG WIDTH=37 HEIGHT=24 ALIGN=BOTTOM ALT="next" SRC="http://www.netlib.org/utk/icons/next_motif.gif"></A> <A NAME="tex2html3576" HREF="node108.html"><IMG WIDTH=26 HEIGHT=24 ALIGN=BOTTOM ALT="up" SRC="http://www.netlib.org/utk/icons/up_motif.gif"></A> <A NAME="tex2html3570" HREF="node108.html"><IMG WIDTH=63 HEIGHT=24 ALIGN=BOTTOM ALT="previous" SRC="http://www.netlib.org/utk/icons/previous_motif.gif"></A> <A NAME="tex2html3580" HREF="node1.html"><IMG WIDTH=65 HEIGHT=24 ALIGN=BOTTOM ALT="contents" SRC="http://www.netlib.org/utk/icons/contents_motif.gif"></A> <A NAME="tex2html3581" HREF="node190.html"><IMG WIDTH=43 HEIGHT=24 ALIGN=BOTTOM ALT="index" SRC="http://www.netlib.org/utk/icons/index_motif.gif"></A> <BR>
<B> Next:</B> <A NAME="tex2html3579" HREF="node110.html">Two-Dimensional Block Cyclic Data </A>
<B>Up:</B> <A NAME="tex2html3577" HREF="node108.html">PerformancePortability and Scalability</A>
<B> Previous:</B> <A NAME="tex2html3571" HREF="node108.html">PerformancePortability and Scalability</A>
<P><ADDRESS>
<I>Susan Blackford <BR>
Tue May 13 09:21:01 EDT 1997</I>
</ADDRESS>
</BODY>
</HTML>