File: node66.html

package info (click to toggle)
lapack 3.0.20000531a-28
  • links: PTS
  • area: main
  • in suites: sarge
  • size: 61,920 kB
  • ctags: 46,200
  • sloc: fortran: 584,835; perl: 8,226; makefile: 2,331; awk: 71; sh: 45
file content (366 lines) | stat: -rw-r--r-- 11,310 bytes parent folder | download | duplicates (4)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN">
<!--Converted with LaTeX2HTML 98.2 beta6 (August 14th, 1998)
original version by:  Nikos Drakos, CBLU, University of Leeds
* revised and updated by:  Marcus Hennecke, Ross Moore, Herb Swan
* with significant contributions from:
  Jens Lippmann, Marek Rouchal, Martin Wilck and others -->
<HTML>
<HEAD>
<TITLE>Block Algorithms and their
Derivation</TITLE>
<META NAME="description" CONTENT="Block Algorithms and their
Derivation">
<META NAME="keywords" CONTENT="lug_l2h">
<META NAME="resource-type" CONTENT="document">
<META NAME="distribution" CONTENT="global">
<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=iso-8859-1">
<LINK REL="STYLESHEET" HREF="lug_l2h.css">
<LINK REL="next" HREF="node67.html">
<LINK REL="previous" HREF="node65.html">
<LINK REL="up" HREF="node60.html">
<LINK REL="next" HREF="node67.html">
</HEAD>
<BODY >
<!--Navigation Panel-->
<A NAME="tex2html5084"
 HREF="node67.html">
<IMG WIDTH="37" HEIGHT="24" ALIGN="BOTTOM" BORDER="0" ALT="next"
 SRC="next_motif.png"></A> 
<A NAME="tex2html5078"
 HREF="node60.html">
<IMG WIDTH="26" HEIGHT="24" ALIGN="BOTTOM" BORDER="0" ALT="up"
 SRC="up_motif.png"></A> 
<A NAME="tex2html5072"
 HREF="node65.html">
<IMG WIDTH="63" HEIGHT="24" ALIGN="BOTTOM" BORDER="0" ALT="previous"
 SRC="previous_motif.png"></A> 
<A NAME="tex2html5080"
 HREF="node1.html">
<IMG WIDTH="65" HEIGHT="24" ALIGN="BOTTOM" BORDER="0" ALT="contents"
 SRC="contents_motif.png"></A> 
<A NAME="tex2html5082"
 HREF="node152.html">
<IMG WIDTH="43" HEIGHT="24" ALIGN="BOTTOM" BORDER="0" ALT="index"
 SRC="index_motif.png"></A> 
<BR>
<B> Next:</B> <A NAME="tex2html5085"
 HREF="node67.html">Examples of Block Algorithms</A>
<B> Up:</B> <A NAME="tex2html5079"
 HREF="node60.html">Performance of LAPACK</A>
<B> Previous:</B> <A NAME="tex2html5073"
 HREF="node65.html">The BLAS as the</A>
 &nbsp <B>  <A NAME="tex2html5081"
 HREF="node1.html">Contents</A></B> 
 &nbsp <B>  <A NAME="tex2html5083"
 HREF="node152.html">Index</A></B> 
<BR>
<BR>
<!--End of Navigation Panel-->

<H1><A NAME="SECTION03330000000000000000"></A><A NAME="secblockalg"></A><A NAME="7631"></A>
<BR>
Block Algorithms and their
Derivation
</H1>

<P>
It is comparatively straightforward to recode many of the algorithms in
LINPACK and EISPACK so that they call Level 2 BLAS<A NAME="7632"></A>.
Indeed, in the simplest
cases the same floating-point operations are performed, possibly even in
the same order: it is just a matter of reorganizing the software. To
illustrate
this point we derive the Cholesky factorization algorithm that is used in
the
LINPACK<A NAME="7633"></A> routine SPOFA<A NAME="7634"></A>, which
factorizes a symmetric positive definite matrix
as <B><I>A</I> = <I>U</I><SUP><I>T</I></SUP> <I>U</I></B>. Writing these equations as:

<P>
<BR><P></P>
<DIV ALIGN="CENTER">

<!-- MATH
 \begin{displaymath}
\left( \begin{array}{ccc}
A_{11}   & a_j     & A_{13}       \\
.        & a_{jj}  & \alpha_{j}^T \\
.        & .       & A_{33}       \\
\end{array} \right) =
\left( \begin{array}{ccc}
U_{11}^T & 0       & 0            \\
u_{j}^T  & u_{jj}  & 0            \\
U_{13}^T & \mu_j   & U_{33}^T     \\
\end{array} \right)
\left( \begin{array}{ccc}
U_{11}   & u_j     & U_{13}       \\
0        & u_{jj}  & \mu_{j}^T    \\
0        & 0       & U_{33}   \\
\end{array} \right)
\end{displaymath}
 -->


<IMG
 WIDTH="473" HEIGHT="73" BORDER="0"
 SRC="img214.png"
 ALT="\begin{displaymath}
\left( \begin{array}{ccc}
A_{11} &amp; a_j &amp; A_{13} \\
. &amp; a_{j...
... u_{jj} &amp; \mu_{j}^T \\
0 &amp; 0 &amp; U_{33} \\
\end{array} \right)
\end{displaymath}">
</DIV>
<BR CLEAR="ALL">
<P></P>

<P>
and equating coefficients of the <B><I>j</I><SUP><I>th</I></SUP></B> column, we obtain:
<BR><P></P>
<DIV ALIGN="CENTER">
<IMG
 WIDTH="154" HEIGHT="60" BORDER="0"
 SRC="img215.png"
 ALT="\begin{eqnarray*}
a_j &amp; = &amp; U_{11}^T u_j \\
a_{jj} &amp; = &amp; u_{j}^T u_j + u_{jj}^2.
\end{eqnarray*}">
</DIV><P></P>
<BR CLEAR="ALL">

<P>
Hence, if <B><I>U</I><SUB>11</SUB></B> has already been computed, we can compute <B><I>u</I><SUB><I>j</I></SUB></B> and
<B><I>u</I><SUB><I>jj</I></SUB></B>
from the equations:
<BR><P></P>
<DIV ALIGN="CENTER">
<IMG
 WIDTH="173" HEIGHT="60" BORDER="0"
 SRC="img216.png"
 ALT="\begin{eqnarray*}
U_{11}^T u_j &amp; = &amp; a_j \\
u_{jj}^2 &amp; = &amp; a_{jj} - u_{j}^T u_j.
\end{eqnarray*}">
</DIV><P></P>
<BR CLEAR="ALL">

<P>
Here is the body of the code of the LINPACK routine SPOFA<A NAME="7674"></A>,
which implements the above method:

<P>
<PRE>
         DO 30 J = 1, N
            INFO = J
            S = 0.0E0
            JM1 = J - 1
            IF (JM1 .LT. 1) GO TO 20
            DO 10 K = 1, JM1
               T = A(K,J) - SDOT(K-1,A(1,K),1,A(1,J),1)
               T = T/A(K,K)
               A(K,J) = T
               S = S + T*T
   10       CONTINUE
   20       CONTINUE
            S = A(J,J) - S
C     ......EXIT
            IF (S .LE. 0.0E0) GO TO 40
            A(J,J) = SQRT(S)
   30    CONTINUE
</PRE>

<P>
And here is the same computation recoded in ``LAPACK-style''
to use the Level 2 BLAS<A NAME="7677"></A> routine
STRSV (which solves a triangular system of equations). The call to STRSV
has replaced the loop over K which made several calls to the
Level 1 BLAS routine SDOT. (For reasons given below, this is not the actual
code used in LAPACK -- hence the term ``LAPACK-style''.)

<P>
<PRE>
      DO 10 J = 1, N
         CALL STRSV( 'Upper', 'Transpose', 'Non-unit', J-1, A, LDA,
     $               A(1,J), 1 )
         S = A(J,J) - SDOT( J-1, A(1,J), 1, A(1,J), 1 )
         IF( S.LE.ZERO ) GO TO 20
         A(J,J) = SQRT( S )
   10 CONTINUE
</PRE>

<P>
This change by itself is sufficient to make big gains in performance
on machines like the CRAY C-90.

<P>
But on many machines such as an IBM RISC Sys/6000-550 (using double
precision)
there is virtually no difference in performance between
the LINPACK-style and the LAPACK Level 2 BLAS style code.
Both styles run at a megaflop rate far below its peak performance for
matrix-matrix multiplication.
To exploit the faster speed of Level 3 BLAS<A NAME="7680"></A>, the
algorithms must undergo a deeper level of restructuring, and be re-cast as a
<B>block algorithm</B> -- that is, an algorithm that operates on <B>blocks</B>
or submatrices of the original matrix.

<P>
To derive a block form of Cholesky
factorization<A NAME="7683"></A>, we write the
defining equation in partitioned form thus:
<BR><P></P>
<DIV ALIGN="CENTER">

<!-- MATH
 \begin{displaymath}
\left( \begin{array}{ccc}
A_{11} & A_{12} & A_{13}\\
.      & A_{22} & A_{23}\\
.      & .      & A_{33}\\
\end{array} \right) =
\left( \begin{array}{ccc}
U_{11}^T & 0 & 0\\
U_{12}^T & U_{22}^T & 0\\
U_{13}^T & U_{23}^T & U_{33}^T\\
\end{array} \right)
\left( \begin{array}{ccc}
U_{11} & U_{12} & U_{13}\\
0 & U_{22} & U_{23}\\
0 & 0 & U_{33}\\
\end{array} \right).
\end{displaymath}
 -->


<IMG
 WIDTH="495" HEIGHT="73" BORDER="0"
 SRC="img217.png"
 ALT="\begin{displaymath}
\left( \begin{array}{ccc}
A_{11} &amp; A_{12} &amp; A_{13}\\
. &amp; A_...
...
0 &amp; U_{22} &amp; U_{23}\\
0 &amp; 0 &amp; U_{33}\\
\end{array} \right).
\end{displaymath}">
</DIV>
<BR CLEAR="ALL">
<P></P>

<P>
Equating submatrices in the second block of columns, we obtain:
<BR><P></P>
<DIV ALIGN="CENTER">
<IMG
 WIDTH="203" HEIGHT="58" BORDER="0"
 SRC="img218.png"
 ALT="\begin{eqnarray*}
A_{12} &amp; = &amp; U_{11}^T U_{12} \\
A_{22} &amp; = &amp; U_{12}^T U_{12} + U_{22}^T U_{22}.
\end{eqnarray*}">
</DIV><P></P>
<BR CLEAR="ALL">

<P>
Hence, if <B><I>U</I><SUB>11</SUB></B> has already been computed, we can compute <B><I>U</I><SUB>12</SUB></B> as
the solution to the equation
<B>
<I>U</I><SUB>11</SUB><SUP><I>T</I></SUP> <I>U</I><SUB>12</SUB> = <I>A</I><SUB>12</SUB>
</B>
<BR CLEAR="ALL"><P></P>
by a call to the Level 3 BLAS routine STRSM; and then we can compute
<B><I>U</I><SUB>22</SUB></B>
from
<B>
<I>U</I><SUB>22</SUB><SUP><I>T</I></SUP> <I>U</I><SUB>22</SUB> = <I>A</I><SUB>22</SUB> - <I>U</I><SUB>12</SUB><SUP><I>T</I></SUP> <I>U</I><SUB>12</SUB>.
</B>
<BR CLEAR="ALL"><P></P>
This involves first updating the symmetric submatrix <B><I>A</I><SUB>22</SUB></B> by a call to
the
Level 3 BLAS routine SSYRK, and then computing its Cholesky factorization.
Since Fortran does not allow recursion, a separate routine must be called
(using Level 2 BLAS rather than Level 3), named SPOTF2 in the code below.
In this way successive blocks of columns of <B><I>U</I></B> are computed.
Here is LAPACK-style code for the block algorithm. In this code-fragment
<TT>NB</TT> denotes the width<A NAME="7738"></A> of the blocks.

<P>
<PRE>
      DO 10 J = 1, N, NB
         JB = MIN( NB, N-J+1 )
         CALL STRSM( 'Left', 'Upper', 'Transpose', 'Non-unit', J-1, JB,
     $               ONE, A, LDA, A( 1, J ), LDA )
         CALL SSYRK( 'Upper', 'Transpose', JB, J-1, -ONE, A( 1, J ), LDA,
     $               ONE, A( J, J ), LDA )
         CALL SPOTF2( 'Upper', JB, A( J, J ), LDA, INFO )
         IF( INFO.NE.0 ) GO TO 20
   10 CONTINUE
</PRE>

<P>
But that is not the end of the story, and the code given above is
not the code that is actually used in the LAPACK routine
SPOTRF<A NAME="7741"></A>.
We mentioned in subsection&nbsp;<A HREF="node62.html#subsecvectorize">3.1.1</A> that for many
linear algebra computations there
are several vectorizable variants, often referred to as <B><I>i</I></B>-, <B><I>j</I></B>- and
<B><I>k</I></B>-variants, according to a convention introduced in [<A
 HREF="node151.html#Dongarra84a">45</A>]
and used
in [<A
 HREF="node151.html#GVL2">55</A>]. The same is true of the corresponding block algorithms.

<P>
It turns out that the <B><I>j</I></B>-variant
that was chosen for LINPACK, and used in the above
examples, is not the fastest on many machines, because it is based on
solving triangular
systems of equations, which can be significantly slower than matrix-matrix
multiplication.
The variant actually used in LAPACK is the <B><I>i</I></B>-variant, which does
rely on matrix-matrix multiplication.

<P>
<HR>
<!--Navigation Panel-->
<A NAME="tex2html5084"
 HREF="node67.html">
<IMG WIDTH="37" HEIGHT="24" ALIGN="BOTTOM" BORDER="0" ALT="next"
 SRC="next_motif.png"></A> 
<A NAME="tex2html5078"
 HREF="node60.html">
<IMG WIDTH="26" HEIGHT="24" ALIGN="BOTTOM" BORDER="0" ALT="up"
 SRC="up_motif.png"></A> 
<A NAME="tex2html5072"
 HREF="node65.html">
<IMG WIDTH="63" HEIGHT="24" ALIGN="BOTTOM" BORDER="0" ALT="previous"
 SRC="previous_motif.png"></A> 
<A NAME="tex2html5080"
 HREF="node1.html">
<IMG WIDTH="65" HEIGHT="24" ALIGN="BOTTOM" BORDER="0" ALT="contents"
 SRC="contents_motif.png"></A> 
<A NAME="tex2html5082"
 HREF="node152.html">
<IMG WIDTH="43" HEIGHT="24" ALIGN="BOTTOM" BORDER="0" ALT="index"
 SRC="index_motif.png"></A> 
<BR>
<B> Next:</B> <A NAME="tex2html5085"
 HREF="node67.html">Examples of Block Algorithms</A>
<B> Up:</B> <A NAME="tex2html5079"
 HREF="node60.html">Performance of LAPACK</A>
<B> Previous:</B> <A NAME="tex2html5073"
 HREF="node65.html">The BLAS as the</A>
 &nbsp <B>  <A NAME="tex2html5081"
 HREF="node1.html">Contents</A></B> 
 &nbsp <B>  <A NAME="tex2html5083"
 HREF="node152.html">Index</A></B> 
<!--End of Navigation Panel-->
<ADDRESS>
<I>Susan Blackford</I>
<BR><I>1999-10-01</I>
</ADDRESS>
</BODY>
</HTML>