File: node75.html

package info (click to toggle)
scalapack-doc 1.5-11
  • links: PTS
  • area: main
  • in suites: bullseye, buster, stretch
  • size: 10,336 kB
  • ctags: 4,931
  • sloc: makefile: 47; sh: 18
file content (192 lines) | stat: -rw-r--r-- 11,040 bytes parent folder | download | duplicates (4)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<!--Converted with LaTeX2HTML 96.1-h (September 30, 1996) by Nikos Drakos (nikos@cbl.leeds.ac.uk), CBLU, University of Leeds -->
<HTML>
<HEAD>
<TITLE>The Two-dimensional Block-Cyclic Distribution</TITLE>
<META NAME="description" CONTENT="The Two-dimensional Block-Cyclic Distribution">
<META NAME="keywords" CONTENT="slug">
<META NAME="resource-type" CONTENT="document">
<META NAME="distribution" CONTENT="global">
<LINK REL=STYLESHEET HREF="slug.css">
</HEAD>
<BODY LANG="EN" >
 <A NAME="tex2html3122" HREF="node76.html"><IMG WIDTH=37 HEIGHT=24 ALIGN=BOTTOM ALT="next" SRC="http://www.netlib.org/utk/icons/next_motif.gif"></A> <A NAME="tex2html3120" HREF="node74.html"><IMG WIDTH=26 HEIGHT=24 ALIGN=BOTTOM ALT="up" SRC="http://www.netlib.org/utk/icons/up_motif.gif"></A> <A NAME="tex2html3114" HREF="node74.html"><IMG WIDTH=63 HEIGHT=24 ALIGN=BOTTOM ALT="previous" SRC="http://www.netlib.org/utk/icons/previous_motif.gif"></A> <A NAME="tex2html3124" HREF="node1.html"><IMG WIDTH=65 HEIGHT=24 ALIGN=BOTTOM ALT="contents" SRC="http://www.netlib.org/utk/icons/contents_motif.gif"></A> <A NAME="tex2html3125" HREF="node190.html"><IMG WIDTH=43 HEIGHT=24 ALIGN=BOTTOM ALT="index" SRC="http://www.netlib.org/utk/icons/index_motif.gif"></A> <BR>
<B> Next:</B> <A NAME="tex2html3123" HREF="node76.html">Local Storage Scheme and </A>
<B>Up:</B> <A NAME="tex2html3121" HREF="node74.html">In-core Dense Matrices</A>
<B> Previous:</B> <A NAME="tex2html3115" HREF="node74.html">In-core Dense Matrices</A>
<BR> <P>
<H2><A NAME="SECTION04431000000000000000">The Two-dimensional Block-Cyclic Distribution</A></H2>
 <A NAME="sec2dbcd">&#160;</A>
<A NAME="2356">&#160;</A><A NAME="2357">&#160;</A>
<P>
In this section, we consider the data
layout of dense matrices on distributed-memory 
machines, with the goal of making dense 
matrix computations as efficient as possible.
We shall discuss a sequence of data layouts,
starting with the most simple, obvious, and
inefficient one and working up to the 
complicated but efficient ScaLAPACK 
ultimately uses. Even though our 
justification is based on Gaussian
elimination, analysis of many other
algorithms has led to the same set
of layouts.  As a result, these
layouts have been standardized as
part of the High Performance Fortran
standard&nbsp;[<A HREF="node189.html#hpf">91</A>], with corresponding
data declarations as part of that language.
<P>
The two main issues in choosing a 
data layout for dense matrix 
computations are
<P>
<UL>
<LI> load balance, or splitting the work reasonably evenly
       among the processors throughout the algorithm, and
<LI> use of the Level 3 BLAS during computations on a single
       processor, to account for the memory hierarchy on each
       processor. 
</UL>
<P>
It will help to remember the pictorial 
representation of Gaussian elimination
below. As the algorithm proceeds, it
works on successively smaller square
southeast corners of the matrix. In 
addition, there is extra Level&nbsp;2 BLAS
work to factorize the submatrix
<IMG WIDTH=91 HEIGHT=25 ALIGN=MIDDLE ALT="tex2html_wrap_inline14557" SRC="img255.gif">.
<P>
<P><A NAME="2364">&#160;</A><A NAME="figgauss">&#160;</A><IMG WIDTH=250 HEIGHT=249 ALIGN=BOTTOM ALT="figure2362" SRC="img256.gif"><BR>
<STRONG>Figure 4.2:</STRONG> Gaussian elimination using Level 3 BLAS<BR>
<P>
<P>
For convenience we will number the
processes from 0 to <I>P</I>-1, and 
matrix columns (or rows) from 1
to <I>N</I>. The following two figures
shows a sequence of data layouts 
we will consider. In all cases, 
each submatrix is labeled with 
the number of the process (from
0 to 3) that contains it. Process
0 owns the shaded submatrices.
<P>
Consider the layout illustrated
on the left of figure&nbsp;<A HREF="node75.html#figbcol">4.3</A>,
the <B>one-dimensional block 
column distribution</B>. This distribution
<P><A NAME="2373">&#160;</A><A NAME="figbcol">&#160;</A><IMG WIDTH=448 HEIGHT=160 ALIGN=BOTTOM ALT="figure2369" SRC="img257.gif"><BR>
<STRONG>Figure 4.3:</STRONG> The one-dimensional block and cyclic column distributions<BR>
<P>
assigns a block of contiguous columns
of a matrix to successive processes.
Each process receives only one block
of columns of the matrix. Column <I>k</I>
is stored on process <IMG WIDTH=41 HEIGHT=27 ALIGN=MIDDLE ALT="tex2html_wrap_inline14569" SRC="img258.gif"> where <IMG WIDTH=86 HEIGHT=27 ALIGN=MIDDLE ALT="tex2html_wrap_inline14571" SRC="img259.gif"> is the maximum number of 
columns stored per process. In the
figure <I>N</I>=16 and <I>P</I>=4. This layout
does not permit good load balancing 
for the above Gaussian elimination
algorithm because as soon as the 
first <I>tc</I> columns are complete,
process 0 is idle for the rest
of the computation. The transpose
of this layout, the <B>one-dimensional
block row distribution</B>, has a similar
shortfall for dense computations.
<P>
The second layout illustrated on the right of 
figure&nbsp;<A HREF="node75.html#figbcol">4.3</A>, the <B>one-dimensional
cyclic column distribution</B>, addressed this 
problem by assigning column <I>k</I> to process
(<I>k</I>-1) mod <I>P</I>. In the figure, <I>N</I>=16 and <I>P</I>=4.
With this layout, each process owns approximately
<IMG WIDTH=41 HEIGHT=32 ALIGN=MIDDLE ALT="tex2html_wrap_inline14591" SRC="img260.gif"> of the square southeast corner of the
matrix, so the load balance is good. However,
since single columns (rather than blocks) are
stored, we cannot use the Level&nbsp;2 BLAS to 
factorize <IMG WIDTH=91 HEIGHT=25 ALIGN=MIDDLE ALT="tex2html_wrap_inline14557" SRC="img255.gif"> and may not be
able to use the Level&nbsp;3 BLAS to update 
<IMG WIDTH=95 HEIGHT=25 ALIGN=MIDDLE ALT="tex2html_wrap_inline14595" SRC="img261.gif">. The transpose of this
layout, the <B>one-dimensional cyclic
row distribution</B>, has a similar shortfall.
<P>
The third layout shown on the left of 
figure&nbsp;<A HREF="node75.html#figbcycliccol">4.4</A>, the
<B>one-dimensional block-cyclic
column distribution</B>, is a compromise
between the distribution schemes shown
in figure&nbsp;<A HREF="node75.html#figbcol">4.3</A>. We choose a 
block size <I>NB</I>, divide the columns 
into groups of size NB, and distribute
these groups in a cyclic manner. This
means column <I>k</I> is stored in process
<IMG WIDTH=158 HEIGHT=27 ALIGN=MIDDLE ALT="tex2html_wrap_inline14601" SRC="img262.gif">.
In fact, this layout includes the first
two as the special cases, <IMG WIDTH=139 HEIGHT=27 ALIGN=MIDDLE ALT="tex2html_wrap_inline14603" SRC="img263.gif"> and <I>NB</I>=1, respectively.
In the figure <I>N</I>=16, <I>P</I>=4 and <I>NB</I>=2.
For <I>NB</I> larger than 1, this has a 
slightly worse balance than the 
one-dimensional cyclic column 
distribution, but can use the
Level&nbsp;2 BLAS and Level&nbsp;3 BLAS
for the local computations.
For <I>NB</I> less than <I>tc</I>, it 
has a better load balance than
the one-dimensional block column
distribution, but can call the BLAS 
only on smaller subproblems. Hence,
it takes less advantage of the local
memory hierarchy.  Moreover, this
layout has the disadvantage that 
the factorization of <IMG WIDTH=91 HEIGHT=25 ALIGN=MIDDLE ALT="tex2html_wrap_inline14557" SRC="img255.gif">
will take place on one process (in the 
natural situation where column blocks
in the layout correspond to column
blocks in Gaussian elimination), 
thereby representing a serial bottleneck.
<P>
<P><A NAME="2391">&#160;</A><A NAME="figbcycliccol">&#160;</A><IMG WIDTH=448 HEIGHT=159 ALIGN=BOTTOM ALT="figure2387" SRC="img264.gif"><BR>
<STRONG>Figure 4.4:</STRONG> The one-dimensional block-cyclic column- and the two-dimensional
          block-cyclic distributions<BR>
<P>
<P>
This serial bottleneck is eased by the 
fourth layout shown on the right of
figure <A HREF="node75.html#figbcycliccol">4.4</A>, the <B>
two-dimensional block cyclic distribution</B>.
Here, we think of our <I>P</I> processes arranged
in a <IMG WIDTH=17 HEIGHT=25 ALIGN=MIDDLE ALT="tex2html_wrap_inline12172" SRC="img24.gif"> <IMG WIDTH=9 HEIGHT=16 ALIGN=MIDDLE ALT="tex2html_wrap_inline12420" SRC="img42.gif"> <IMG WIDTH=17 HEIGHT=25 ALIGN=MIDDLE ALT="tex2html_wrap_inline12162" SRC="img23.gif"> rectangular
array of processes, indexed in a two-dimensional
fashion by <IMG WIDTH=50 HEIGHT=26 ALIGN=MIDDLE ALT="tex2html_wrap_inline14629" SRC="img265.gif">, with <IMG WIDTH=102 HEIGHT=25 ALIGN=MIDDLE ALT="tex2html_wrap_inline14631" SRC="img266.gif">
and <IMG WIDTH=101 HEIGHT=25 ALIGN=MIDDLE ALT="tex2html_wrap_inline14633" SRC="img267.gif">. All the processes
<IMG WIDTH=50 HEIGHT=26 ALIGN=MIDDLE ALT="tex2html_wrap_inline14629" SRC="img265.gif"> with a fixed <IMG WIDTH=16 HEIGHT=17 ALIGN=MIDDLE ALT="tex2html_wrap_inline14637" SRC="img268.gif"> are referred
to as process column <IMG WIDTH=16 HEIGHT=17 ALIGN=MIDDLE ALT="tex2html_wrap_inline14637" SRC="img268.gif">. All the processes
<IMG WIDTH=50 HEIGHT=26 ALIGN=MIDDLE ALT="tex2html_wrap_inline14629" SRC="img265.gif"> with a fixed <IMG WIDTH=16 HEIGHT=17 ALIGN=MIDDLE ALT="tex2html_wrap_inline14643" SRC="img269.gif"> are referred
to as process row <IMG WIDTH=16 HEIGHT=17 ALIGN=MIDDLE ALT="tex2html_wrap_inline14643" SRC="img269.gif">.  Thus, this layout
includes all the previous layouts, and their
transposes, as special cases. In the figure,
<I>N</I>=16, <I>P</I>=4, <IMG WIDTH=91 HEIGHT=25 ALIGN=MIDDLE ALT="tex2html_wrap_inline14651" SRC="img270.gif">, and <I>MB</I>=<I>NB</I>=2.
This layout permits <IMG WIDTH=17 HEIGHT=25 ALIGN=MIDDLE ALT="tex2html_wrap_inline12162" SRC="img23.gif">-fold parallelism
in any column, and calls to the Level&nbsp;2 BLAS
and Level&nbsp;3 BLAS on local subarrays. Finally,
this layout also features good scalability
properties as shown in&nbsp;[<A HREF="node189.html#dongarra92a">61</A>].
<P>
The two-dimensional block cyclic
distribution scheme is the data layout
that is used in the ScaLAPACK library
for dense matrix computations.
<P>
<HR><A NAME="tex2html3122" HREF="node76.html"><IMG WIDTH=37 HEIGHT=24 ALIGN=BOTTOM ALT="next" SRC="http://www.netlib.org/utk/icons/next_motif.gif"></A> <A NAME="tex2html3120" HREF="node74.html"><IMG WIDTH=26 HEIGHT=24 ALIGN=BOTTOM ALT="up" SRC="http://www.netlib.org/utk/icons/up_motif.gif"></A> <A NAME="tex2html3114" HREF="node74.html"><IMG WIDTH=63 HEIGHT=24 ALIGN=BOTTOM ALT="previous" SRC="http://www.netlib.org/utk/icons/previous_motif.gif"></A> <A NAME="tex2html3124" HREF="node1.html"><IMG WIDTH=65 HEIGHT=24 ALIGN=BOTTOM ALT="contents" SRC="http://www.netlib.org/utk/icons/contents_motif.gif"></A> <A NAME="tex2html3125" HREF="node190.html"><IMG WIDTH=43 HEIGHT=24 ALIGN=BOTTOM ALT="index" SRC="http://www.netlib.org/utk/icons/index_motif.gif"></A> <BR>
<B> Next:</B> <A NAME="tex2html3123" HREF="node76.html">Local Storage Scheme and </A>
<B>Up:</B> <A NAME="tex2html3121" HREF="node74.html">In-core Dense Matrices</A>
<B> Previous:</B> <A NAME="tex2html3115" HREF="node74.html">In-core Dense Matrices</A>
<P><ADDRESS>
<I>Susan Blackford <BR>
Tue May 13 09:21:01 EDT 1997</I>
</ADDRESS>
</BODY>
</HTML>