1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192
|
<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<!--Converted with LaTeX2HTML 96.1-h (September 30, 1996) by Nikos Drakos (nikos@cbl.leeds.ac.uk), CBLU, University of Leeds -->
<HTML>
<HEAD>
<TITLE>The Two-dimensional Block-Cyclic Distribution</TITLE>
<META NAME="description" CONTENT="The Two-dimensional Block-Cyclic Distribution">
<META NAME="keywords" CONTENT="slug">
<META NAME="resource-type" CONTENT="document">
<META NAME="distribution" CONTENT="global">
<LINK REL=STYLESHEET HREF="slug.css">
</HEAD>
<BODY LANG="EN" >
<A NAME="tex2html3122" HREF="node76.html"><IMG WIDTH=37 HEIGHT=24 ALIGN=BOTTOM ALT="next" SRC="http://www.netlib.org/utk/icons/next_motif.gif"></A> <A NAME="tex2html3120" HREF="node74.html"><IMG WIDTH=26 HEIGHT=24 ALIGN=BOTTOM ALT="up" SRC="http://www.netlib.org/utk/icons/up_motif.gif"></A> <A NAME="tex2html3114" HREF="node74.html"><IMG WIDTH=63 HEIGHT=24 ALIGN=BOTTOM ALT="previous" SRC="http://www.netlib.org/utk/icons/previous_motif.gif"></A> <A NAME="tex2html3124" HREF="node1.html"><IMG WIDTH=65 HEIGHT=24 ALIGN=BOTTOM ALT="contents" SRC="http://www.netlib.org/utk/icons/contents_motif.gif"></A> <A NAME="tex2html3125" HREF="node190.html"><IMG WIDTH=43 HEIGHT=24 ALIGN=BOTTOM ALT="index" SRC="http://www.netlib.org/utk/icons/index_motif.gif"></A> <BR>
<B> Next:</B> <A NAME="tex2html3123" HREF="node76.html">Local Storage Scheme and </A>
<B>Up:</B> <A NAME="tex2html3121" HREF="node74.html">In-core Dense Matrices</A>
<B> Previous:</B> <A NAME="tex2html3115" HREF="node74.html">In-core Dense Matrices</A>
<BR> <P>
<H2><A NAME="SECTION04431000000000000000">The Two-dimensional Block-Cyclic Distribution</A></H2>
<A NAME="sec2dbcd"> </A>
<A NAME="2356"> </A><A NAME="2357"> </A>
<P>
In this section, we consider the data
layout of dense matrices on distributed-memory
machines, with the goal of making dense
matrix computations as efficient as possible.
We shall discuss a sequence of data layouts,
starting with the most simple, obvious, and
inefficient one and working up to the
complicated but efficient ScaLAPACK
ultimately uses. Even though our
justification is based on Gaussian
elimination, analysis of many other
algorithms has led to the same set
of layouts. As a result, these
layouts have been standardized as
part of the High Performance Fortran
standard [<A HREF="node189.html#hpf">91</A>], with corresponding
data declarations as part of that language.
<P>
The two main issues in choosing a
data layout for dense matrix
computations are
<P>
<UL>
<LI> load balance, or splitting the work reasonably evenly
among the processors throughout the algorithm, and
<LI> use of the Level 3 BLAS during computations on a single
processor, to account for the memory hierarchy on each
processor.
</UL>
<P>
It will help to remember the pictorial
representation of Gaussian elimination
below. As the algorithm proceeds, it
works on successively smaller square
southeast corners of the matrix. In
addition, there is extra Level 2 BLAS
work to factorize the submatrix
<IMG WIDTH=91 HEIGHT=25 ALIGN=MIDDLE ALT="tex2html_wrap_inline14557" SRC="img255.gif">.
<P>
<P><A NAME="2364"> </A><A NAME="figgauss"> </A><IMG WIDTH=250 HEIGHT=249 ALIGN=BOTTOM ALT="figure2362" SRC="img256.gif"><BR>
<STRONG>Figure 4.2:</STRONG> Gaussian elimination using Level 3 BLAS<BR>
<P>
<P>
For convenience we will number the
processes from 0 to <I>P</I>-1, and
matrix columns (or rows) from 1
to <I>N</I>. The following two figures
shows a sequence of data layouts
we will consider. In all cases,
each submatrix is labeled with
the number of the process (from
0 to 3) that contains it. Process
0 owns the shaded submatrices.
<P>
Consider the layout illustrated
on the left of figure <A HREF="node75.html#figbcol">4.3</A>,
the <B>one-dimensional block
column distribution</B>. This distribution
<P><A NAME="2373"> </A><A NAME="figbcol"> </A><IMG WIDTH=448 HEIGHT=160 ALIGN=BOTTOM ALT="figure2369" SRC="img257.gif"><BR>
<STRONG>Figure 4.3:</STRONG> The one-dimensional block and cyclic column distributions<BR>
<P>
assigns a block of contiguous columns
of a matrix to successive processes.
Each process receives only one block
of columns of the matrix. Column <I>k</I>
is stored on process <IMG WIDTH=41 HEIGHT=27 ALIGN=MIDDLE ALT="tex2html_wrap_inline14569" SRC="img258.gif"> where <IMG WIDTH=86 HEIGHT=27 ALIGN=MIDDLE ALT="tex2html_wrap_inline14571" SRC="img259.gif"> is the maximum number of
columns stored per process. In the
figure <I>N</I>=16 and <I>P</I>=4. This layout
does not permit good load balancing
for the above Gaussian elimination
algorithm because as soon as the
first <I>tc</I> columns are complete,
process 0 is idle for the rest
of the computation. The transpose
of this layout, the <B>one-dimensional
block row distribution</B>, has a similar
shortfall for dense computations.
<P>
The second layout illustrated on the right of
figure <A HREF="node75.html#figbcol">4.3</A>, the <B>one-dimensional
cyclic column distribution</B>, addressed this
problem by assigning column <I>k</I> to process
(<I>k</I>-1) mod <I>P</I>. In the figure, <I>N</I>=16 and <I>P</I>=4.
With this layout, each process owns approximately
<IMG WIDTH=41 HEIGHT=32 ALIGN=MIDDLE ALT="tex2html_wrap_inline14591" SRC="img260.gif"> of the square southeast corner of the
matrix, so the load balance is good. However,
since single columns (rather than blocks) are
stored, we cannot use the Level 2 BLAS to
factorize <IMG WIDTH=91 HEIGHT=25 ALIGN=MIDDLE ALT="tex2html_wrap_inline14557" SRC="img255.gif"> and may not be
able to use the Level 3 BLAS to update
<IMG WIDTH=95 HEIGHT=25 ALIGN=MIDDLE ALT="tex2html_wrap_inline14595" SRC="img261.gif">. The transpose of this
layout, the <B>one-dimensional cyclic
row distribution</B>, has a similar shortfall.
<P>
The third layout shown on the left of
figure <A HREF="node75.html#figbcycliccol">4.4</A>, the
<B>one-dimensional block-cyclic
column distribution</B>, is a compromise
between the distribution schemes shown
in figure <A HREF="node75.html#figbcol">4.3</A>. We choose a
block size <I>NB</I>, divide the columns
into groups of size NB, and distribute
these groups in a cyclic manner. This
means column <I>k</I> is stored in process
<IMG WIDTH=158 HEIGHT=27 ALIGN=MIDDLE ALT="tex2html_wrap_inline14601" SRC="img262.gif">.
In fact, this layout includes the first
two as the special cases, <IMG WIDTH=139 HEIGHT=27 ALIGN=MIDDLE ALT="tex2html_wrap_inline14603" SRC="img263.gif"> and <I>NB</I>=1, respectively.
In the figure <I>N</I>=16, <I>P</I>=4 and <I>NB</I>=2.
For <I>NB</I> larger than 1, this has a
slightly worse balance than the
one-dimensional cyclic column
distribution, but can use the
Level 2 BLAS and Level 3 BLAS
for the local computations.
For <I>NB</I> less than <I>tc</I>, it
has a better load balance than
the one-dimensional block column
distribution, but can call the BLAS
only on smaller subproblems. Hence,
it takes less advantage of the local
memory hierarchy. Moreover, this
layout has the disadvantage that
the factorization of <IMG WIDTH=91 HEIGHT=25 ALIGN=MIDDLE ALT="tex2html_wrap_inline14557" SRC="img255.gif">
will take place on one process (in the
natural situation where column blocks
in the layout correspond to column
blocks in Gaussian elimination),
thereby representing a serial bottleneck.
<P>
<P><A NAME="2391"> </A><A NAME="figbcycliccol"> </A><IMG WIDTH=448 HEIGHT=159 ALIGN=BOTTOM ALT="figure2387" SRC="img264.gif"><BR>
<STRONG>Figure 4.4:</STRONG> The one-dimensional block-cyclic column- and the two-dimensional
block-cyclic distributions<BR>
<P>
<P>
This serial bottleneck is eased by the
fourth layout shown on the right of
figure <A HREF="node75.html#figbcycliccol">4.4</A>, the <B>
two-dimensional block cyclic distribution</B>.
Here, we think of our <I>P</I> processes arranged
in a <IMG WIDTH=17 HEIGHT=25 ALIGN=MIDDLE ALT="tex2html_wrap_inline12172" SRC="img24.gif"> <IMG WIDTH=9 HEIGHT=16 ALIGN=MIDDLE ALT="tex2html_wrap_inline12420" SRC="img42.gif"> <IMG WIDTH=17 HEIGHT=25 ALIGN=MIDDLE ALT="tex2html_wrap_inline12162" SRC="img23.gif"> rectangular
array of processes, indexed in a two-dimensional
fashion by <IMG WIDTH=50 HEIGHT=26 ALIGN=MIDDLE ALT="tex2html_wrap_inline14629" SRC="img265.gif">, with <IMG WIDTH=102 HEIGHT=25 ALIGN=MIDDLE ALT="tex2html_wrap_inline14631" SRC="img266.gif">
and <IMG WIDTH=101 HEIGHT=25 ALIGN=MIDDLE ALT="tex2html_wrap_inline14633" SRC="img267.gif">. All the processes
<IMG WIDTH=50 HEIGHT=26 ALIGN=MIDDLE ALT="tex2html_wrap_inline14629" SRC="img265.gif"> with a fixed <IMG WIDTH=16 HEIGHT=17 ALIGN=MIDDLE ALT="tex2html_wrap_inline14637" SRC="img268.gif"> are referred
to as process column <IMG WIDTH=16 HEIGHT=17 ALIGN=MIDDLE ALT="tex2html_wrap_inline14637" SRC="img268.gif">. All the processes
<IMG WIDTH=50 HEIGHT=26 ALIGN=MIDDLE ALT="tex2html_wrap_inline14629" SRC="img265.gif"> with a fixed <IMG WIDTH=16 HEIGHT=17 ALIGN=MIDDLE ALT="tex2html_wrap_inline14643" SRC="img269.gif"> are referred
to as process row <IMG WIDTH=16 HEIGHT=17 ALIGN=MIDDLE ALT="tex2html_wrap_inline14643" SRC="img269.gif">. Thus, this layout
includes all the previous layouts, and their
transposes, as special cases. In the figure,
<I>N</I>=16, <I>P</I>=4, <IMG WIDTH=91 HEIGHT=25 ALIGN=MIDDLE ALT="tex2html_wrap_inline14651" SRC="img270.gif">, and <I>MB</I>=<I>NB</I>=2.
This layout permits <IMG WIDTH=17 HEIGHT=25 ALIGN=MIDDLE ALT="tex2html_wrap_inline12162" SRC="img23.gif">-fold parallelism
in any column, and calls to the Level 2 BLAS
and Level 3 BLAS on local subarrays. Finally,
this layout also features good scalability
properties as shown in [<A HREF="node189.html#dongarra92a">61</A>].
<P>
The two-dimensional block cyclic
distribution scheme is the data layout
that is used in the ScaLAPACK library
for dense matrix computations.
<P>
<HR><A NAME="tex2html3122" HREF="node76.html"><IMG WIDTH=37 HEIGHT=24 ALIGN=BOTTOM ALT="next" SRC="http://www.netlib.org/utk/icons/next_motif.gif"></A> <A NAME="tex2html3120" HREF="node74.html"><IMG WIDTH=26 HEIGHT=24 ALIGN=BOTTOM ALT="up" SRC="http://www.netlib.org/utk/icons/up_motif.gif"></A> <A NAME="tex2html3114" HREF="node74.html"><IMG WIDTH=63 HEIGHT=24 ALIGN=BOTTOM ALT="previous" SRC="http://www.netlib.org/utk/icons/previous_motif.gif"></A> <A NAME="tex2html3124" HREF="node1.html"><IMG WIDTH=65 HEIGHT=24 ALIGN=BOTTOM ALT="contents" SRC="http://www.netlib.org/utk/icons/contents_motif.gif"></A> <A NAME="tex2html3125" HREF="node190.html"><IMG WIDTH=43 HEIGHT=24 ALIGN=BOTTOM ALT="index" SRC="http://www.netlib.org/utk/icons/index_motif.gif"></A> <BR>
<B> Next:</B> <A NAME="tex2html3123" HREF="node76.html">Local Storage Scheme and </A>
<B>Up:</B> <A NAME="tex2html3121" HREF="node74.html">In-core Dense Matrices</A>
<B> Previous:</B> <A NAME="tex2html3115" HREF="node74.html">In-core Dense Matrices</A>
<P><ADDRESS>
<I>Susan Blackford <BR>
Tue May 13 09:21:01 EDT 1997</I>
</ADDRESS>
</BODY>
</HTML>
|