File: algorithmDesign.tex

package info (click to toggle)
spooles 2.2-16
  • links: PTS, VCS
  • area: main
  • in suites: forky, sid
  • size: 19,760 kB
  • sloc: ansic: 146,836; sh: 7,571; csh: 3,615; makefile: 1,970; perl: 74
file content (804 lines) | stat: -rw-r--r-- 27,726 bytes parent folder | download | duplicates (7)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
\par
\chapter{Algorithm Design}
\label{chapter:algorithmDesign}
\par
Let us begin with a very quick description of the sparse
factorizations we will use.
We assume that the matrix $A$ has symmetric structure, or more
generally, if $a_{i,j} \ne 0$, then we treat $a_{j,i}$ as nonzero.
\par
Let us ignore pivoting for a moment.
The matrix will be factored as $A = (L+I)D(I+U)$.
The rows and columns of $A$ are partitioned into {\it fronts}.
We use the upper case Roman letters $I$ and $J$ to refer to
index sets for a front.
We use $\bnd{J}$ to represent the {\it boundary} of front $J$,
namely those rows and columns $k \notin J$ 
such that $l_{k,j} \ne 0$ and or $u_{j,k} \ne 0$.
\par
There are two steps to compute the entries in a front.
The first is to form the temporary matrix
\begin{equation}
\label{alg:eqn:1}
\left \lbrack \begin{array}{cc}
T_{J,J} & T_{J, \bnd{J}} \\
T_{\bnd{J},J} & T_{\bnd{J},\bnd{J}}
\end{array} \right \rbrack
=
\left \lbrack \begin{array}{cc}
A_{J,J} & A_{J, \bnd{J}} \\
A_{\bnd{J},J} & 0
\end{array} \right \rbrack
-
\left \lbrack \begin{array}{c}
L_{J,*} \\
L_{\bnd{J},*}
\end{array} \right \rbrack
D_{*,*} 
\left \lbrack \begin{array}{cc}
U_{*,J} & U_{*,\bnd{J}}
\end{array} \right \rbrack.
\end{equation}
The $*$ subscript means all rows and columns that precede $J$.
We think of the temporary matrix on the left
as {\it fully assembled}, i.e.,
the original entries are present plus all updates from rows and
columns that precede the front.
\par
The second step is to factor the front as
\begin{equation}
\label{alg:eqn:2}
\left \lbrack \begin{array}{cc}
T_{J,J} & T_{J, \bnd{J}} \\
T_{\bnd{J},J} & T_{\bnd{J},\bnd{J}}
\end{array} \right \rbrack
=
\left \lbrack \begin{array}{cc}
L_{J,J} + I & 0 \\
L_{\bnd{J},J} & I
\end{array} \right \rbrack
\left \lbrack \begin{array}{cc}
D_{J,J} & 0 \\
0 & H_{\bnd{J},\bnd{J}}^J
\end{array} \right \rbrack
\left \lbrack \begin{array}{cc}
I + U_{J,J} & U_{J,\bnd{J}} \\
0 & I
\end{array} \right \rbrack.
\end{equation}
\par
In equation~(\ref{alg:eqn:1}) there will likely be many columns $i$ such
that $L_{J,i}$ and $U_{i,J}$ are zero, in which case they need not
take part in the first equation.
Since all preceding rows and columns are found in fronts
themselves, we can rewrite equation~(\ref{alg:eqn:1}) as
\begin{equation}
\label{alg:eqn:3}
\left \lbrack \begin{array}{cc}
T_{J,J} & T_{J, \bnd{J}} \\
T_{\bnd{J},J} & T_{\bnd{J},\bnd{J}}
\end{array} \right \rbrack
=
\left \lbrack \begin{array}{cc}
A_{J,J} & A_{J, \bnd{J}} \\
A_{\bnd{J},J} & 0
\end{array} \right \rbrack
-
\sum_{\bnd{I} \cap J \ne \emptyset}
\left \lbrack \begin{array}{c}
L_{J,I} \\
L_{\bnd{J},I}
\end{array} \right \rbrack
D_{I,I} 
\left \lbrack \begin{array}{cc}
U_{I,J} & U_{I,\bnd{J}}
\end{array} \right \rbrack.
\end{equation}
\par
The fronts can be grouped into a {\it front tree} using a {\it
parent} relation: $par(I) = J$ means that $J$ is the parent of $I$
in the front tree.
There is one special property that relates the boundaries of fronts
and the front tree: if $\bnd{I} \cap J \ne \emptyset$, then $I$ is
a descendent of $J$, or conversely, $J$ is an ancestor of $I$.
Front $I$ need not update all of its ancestors, nor does front $J$
need to be updated by all of its descendents, but when an update
does occur, it is between two fronts that have a
ancestor-descendent relationship.
The front tree is defined by the parent relation:
the {\it parent} of $I$ is the front that contains the first row
and column in $\bnd{I}$, in other words, the closest supported
ancestor to $I$.
(Of course, if $\bnd{I} = \emptyset$, then its parent does not
exist, and $I$ is a root of the tree, actually a forest if $A$ is
reducible.)
There is one important property that relates the boundary sets of a
front and its parent: 
if $par(I) = J$, then $\bnd{I} \subseteq J \cup \bnd{J}$.
\par
Using the front tree, let us describe the fronts that are
descendents of a front.
We use the ${\widehat {\ }}$ operator to denote a subtree.
${\widehat J}$, the subtree rooted at $J$,
is the set of fronts that contains $J$ and all of its descendents.
We can write the subtree relation in a recursive form.
$$
{\widehat J} = J \cup \bigcup_{par(I) = J} {\widehat I}
$$
Using these two relations, we can rewrite
equation~(\ref{alg:eqn:3}) as follows.
\begin{eqnarray}
\label{alg:eqn:4}
\left \lbrack \begin{array}{cc}
T_{J,J} & T_{J, \bnd{J}} \\
T_{\bnd{J},J} & T_{\bnd{J},\bnd{J}}
\end{array} \right \rbrack
& = &
\left \lbrack \begin{array}{cc}
A_{J,J} & A_{J, \bnd{J}} \\
A_{\bnd{J},J} & 0
\end{array} \right \rbrack
-
\sum_{par(I) = J}
\left \lbrack \begin{array}{c}
L_{J,{\widehat I}} \nonumber \\
L_{\bnd{J},{\widehat I}}
\end{array} \right \rbrack
D_{{\widehat I},{\widehat I}} 
\left \lbrack \begin{array}{cc}
U_{{\widehat I},J} & U_{{\widehat I},\bnd{J}}
\end{array} \right \rbrack \\
& = & 
\left \lbrack \begin{array}{cc}
A_{J,J} & A_{J, \bnd{J}} \\
A_{\bnd{J},J} & 0
\end{array} \right \rbrack
- 
\sum_{par(I) = J}
L_{\bnd{I}, {\widehat I}}
D_{{\widehat I}, {\widehat I}}
U_{{\widehat I}, \bnd{I}}
\end{eqnarray}
One more trick is necessary.
Recall equation~(\ref{alg:eqn:2}).
Once we have assembled the temporary matrix, we factor into the
$L$, $D$ and $U$ portions plus an additional matrix
$H_{\bnd{J},\bnd{J}}^J$ which we call the update matrix.
Using the update matrices of the children, we can rewrite 
equation~(\ref{alg:eqn:4}) as follows.
\begin{equation}
\label{alg:eqn:mf}
\left \lbrack \begin{array}{cc}
T_{J,J} & T_{J, \bnd{J}} \\
T_{\bnd{J},J} & T_{\bnd{J},\bnd{J}}
\end{array} \right \rbrack
= 
\left \lbrack \begin{array}{cc}
A_{J,J} & A_{J, \bnd{J}} \\
A_{\bnd{J},J} & 0
\end{array} \right \rbrack
- 
\sum_{par(I) = J}
H_{\bnd{I},\bnd{I}}^I
\end{equation}
This is the defining equation for the multifrontal method
\cite{duf83-multifrontal}.
The matrix on the left is the {\it frontal matrix} for front $J$,
and is usually dense and almost always treated as if it were dense.
The first matrix on the right consists of original entries and is
usually sparse.
Each of the $H_{\bnd{I},\bnd{I}}^I$ update matrices are treated as
if they were dense.
\par
There are many advantages to the multifrontal method.
The computations are almost all dense matrix operations.
In a serial environment, the temporary storage for the update
matrices can be handled as a stack, a last-in first-out list
structure.
This makes the transition to an out-of-core implementation very easy.
But, the one disadvantage is that the frontal matrices can take up
an immense amount of space for the largest fronts, typically near
the top of the tree.
This is the main reason that we do not use this algorithm
in the {\bf SPOOLES} library.
\par
Return to equation~(\ref{alg:eqn:3}) for a moment.
A left-looking block general sparse algorithm computes three of the
four submatrices on the left.
\begin{eqnarray*}
T_{J,J} & = & A_{J,J}
- \sum_{\bnd{I} \cap J \ne \emptyset} 
L_{\bnd{I} \cap J, I} D_{I, I} U_{I, \bnd{I} \cap J}
\\
T_{\bnd{J},J} & = & A_{\bnd{J},J}
- \sum_{\bnd{I} \cap J \ne \emptyset} 
L_{\bnd{I} \cap \bnd{J}, I} D_{I, I} U_{I, \bnd{I} \cap J}
\\
T_{J,\bnd{J}} & = & A_{J,\bnd{J}}
- \sum_{\bnd{I} \cap J \ne \emptyset} 
L_{\bnd{I} \cap J, I} D_{I, I} U_{I, \bnd{I} \cap \bnd{J}}
\end{eqnarray*}
and then factors the three as follows.
\begin{eqnarray*}
T_{J,J}       & = & (L_{J,J} + I)D_{J,J}(I + U_{J,J}) \\
T_{\bnd{J},J} & = & L_{\bnd{J},J}D_{J,J}(I + U_{J,J}) \\
T_{J,\bnd{J}} & = & (L_{J,J} + I)D_{J,J}U_{J,\bnd{J}}
\end{eqnarray*}
If the fronts are large there are good computational kernels to be had.
The disadvantage lies in the irregular access patterns for the
computed factor submatrices; this makes an out-of-core
implementation problematic.
On the other hand, working storage is modest, for there are no
update matrices waiting to be assembled.
\par \bigskip \par
\noindent {\large\bf Pivoting and delayed rows and columns}
\par \bigskip \par
Let us now turn to pivoting during the factorization.
It is not the actual swapping of rows and columns that gives
problems --- one must merely keep track
of indices and permute rows and columns.
What does complicate matters is when rows and columns cannot be
eliminated from a front due to stability reasons.
(The remaining lower right matrix in $T_{J,J}$ may be zero, or
eliminating rows and columns may result in large entries in
$L_{\bnd{J},J}$ or $U_{J,\bnd{J}}$.)
The delayed rows and columns must be {\it passed} up to the parent
front, enlarging the parent front where an attempt will be made
to eliminate them with the rows and columns in the parent's front.
\par
Let us assume symmetric pivoting for the moment, where the row and
column permutation matrices are identical.
Let the rows and columns in front $I$ be partitioned into two sets,
$I_e$ contains the eliminated rows and columns
and
$I_d$ contains the delayed rows and columns.
Let us look first at the multifrontal method for it is easier to
understand.
\begin{equation}
\label{alg:eqn:mf-delayed}
\left \lbrack \begin{array}{cc}
T_{I,I} & T_{I, \bnd{I}} \\
T_{\bnd{I},I} & T_{\bnd{I},\bnd{J}}
\end{array} \right \rbrack
=
\left \lbrack \begin{array}{cc}
L_{I_e,I_e} + I & 0 \\
L_{I_d \cup \bnd{J},I_e}  & I
\end{array} \right \rbrack
\left \lbrack \begin{array}{cc}
D_{I_e,I_e} & 0 \\
0           & H_{I_d \cup \bnd{I},I_d \cup \bnd{I}}^I
\end{array} \right \rbrack
\left \lbrack \begin{array}{cc}
I + U_{I_e,I_e} & U_{I_e,I_d \cup \bnd{J}} \\
0 & I \\
\end{array} \right \rbrack.
% \left \lbrack \begin{array}{cc}
% T_{I,I} & T_{I, \bnd{I}} \\
% T_{\bnd{I},I} & T_{\bnd{I},\bnd{J}}
% \end{array} \right \rbrack
% =
% \left \lbrack \begin{array}{ccc}
% L_{I_e,I_e} + I & 0 & 0 \\
% L_{I_d,I_e}     & I & 0 \\
% L_{\bnd{J},I_e}   & 0 & I
% \end{array} \right \rbrack
% \left \lbrack \begin{array}{ccc}
% D_{I_e,I_e} & 0                 & 0 \\
% 0           & H_{I_d,I_d}^I     & H_{I_d,\bnd{I}}^I \\
% 0           & H_{\bnd{I},I_d}^I & H_{\bnd{I},\bnd{I}}^I
% \end{array} \right \rbrack
% \left \lbrack \begin{array}{ccc}
% I + U_{I_e,I_e} & U_{I_e,I_d}, & U_{I_e,\bnd{J}} \\
% 0 & I & 0 \\
% 0 & o & I \\
% \end{array} \right \rbrack.
\end{equation}
\par
The update matrix for $I$ contains the delayed rows and columns
as well as the usual update matrix $H^I_{\bnd{I},\bnd{I}}$
that would normally occur.
$$
H^I_{I_d \cup \bnd{I}, I_d \cup \bnd{I}} 
=
\left \lbrack \begin{array}{cc}
H^I_{I_d,I_d}     & H^I_{I_d, \bnd{I}} \\
H^I_{\bnd{I},I_d} & H^I_{\bnd{I},\bnd{I}}
\end{array} \right \rbrack
$$
Recall, the $I_e$ rows and columns are fully assembled, so they can
be merged into the parent front. Let us define the new front after
merging all delayed rows and columns from the children as
$$
{\widetilde J} = J \cup \bigcup_{par(I) = J } I_d.
$$
We can now write the multifrontal equation with pivoting as follows.
\begin{equation}
\label{alg:eqn:mf-pivoting}
\left \lbrack \begin{array}{cc}
T_{{\widetilde J},{\widetilde J}} & T_{{\widetilde J}, \bnd{J}} \\
T_{\bnd{J},{\widetilde J}} & T_{\bnd{J},\bnd{J}}
\end{array} \right \rbrack
= 
\left \lbrack \begin{array}{cc}
A_{J,J} & A_{J, \bnd{J}} \\
A_{\bnd{J},J} & 0
\end{array} \right \rbrack
- 
\sum_{par(I) = J}
H_{I_d \cup \bnd{I}, I_d \cup \bnd{I}}^I
\end{equation}
Delaying rows and columns from one front to its parent really
doesn't change the multifrontal algorithm very much.
This is the reason that a factorization with pivoting has usually
been implemented by a multifrontal algorithm.
\par
Now let us return to a left-looking general sparse factorization
and try to incorporate pivoting.
The equations that accumulate updates from the descendent fronts
must be modifed to include the delayed rows and columns from the
children.
\begin{eqnarray*}
T_{{\widetilde J},{\widetilde J}} & = & A_{J,J}
- \sum_{\bnd{I} \cap J \ne \emptyset} 
L_{\bnd{I} \cap J, I_e} 
D_{I_e, I_e} 
U_{I_e, \bnd{I} \cap J}
+ \sum_{par(I) = J} 
\left \lbrack \begin{array}{cc}
H_{I_d, I_d} & H_{I_d, \bnd{I} \cap J} \\
H_{\bnd{I} \cap J, I_d} & 0 \\
\end{array} \right \rbrack
\\
T_{\bnd{J},{\widetilde J}} & = & A_{\bnd{J},J}
- \sum_{\bnd{I} \cap J \ne \emptyset} 
L_{\bnd{I} \cap \bnd{J}, I_e} D_{I_e, I_e} U_{I_e, \bnd{I} \cap J}
+ \sum_{par(I) = J} H_{\bnd{I} \cap \bnd{J}, I_d}
\\
T_{{\widetilde J},\bnd{J}} & = & A_{J,\bnd{J}}
- \sum_{\bnd{I} \cap J \ne \emptyset} 
L_{\bnd{I} \cap J, I_e} D_{I_e, I_e} U_{I_e, \bnd{I} \cap \bnd{J}}
+ \sum_{par(I) = J} H_{I_d, \bnd{I} \cap \bnd{J}}
\end{eqnarray*}
It appears that since we do not know the structure of ${\widetilde J}$
until the delayed rows and columns of the children of $J$ are known, 
we cannot start the computation until the children of $J$ are finished.
This is not quite true.
Write the above equations as follows, marking the ${\widetilde T}$ 
matrices on the left as distinct from the $T$ matrices on the right.
\begin{eqnarray*}
{\widetilde T}_{{\widetilde J},{\widetilde J}} & = & T_{J,J}
+ \sum_{par(I) = J} 
\left \lbrack \begin{array}{cc}
H_{I_d, I_d} & H_{I_d, \bnd{I} \cap J} \\
H_{\bnd{I} \cap J, I_d} & 0 \\
\end{array} \right \rbrack
\\
{\widetilde T}_{\bnd{J},{\widetilde J}} & = & T_{\bnd{J},J}
+ \sum_{par(I) = J} H_{\bnd{I} \cap \bnd{J}, I_d}
\\
{\widetilde T}_{{\widetilde J},\bnd{J}} & = & T_{J,\bnd{J}}
+ \sum_{par(I) = J} H_{I_d, \bnd{I} \cap \bnd{J}}
\end{eqnarray*}
The $T$ matrices have the following form.
\begin{eqnarray*}
T_{J,J} & = & A_{J,J}
- \sum_{\bnd{I} \cap J \ne \emptyset} 
L_{\bnd{I} \cap J, I_e} 
D_{I_e, I_e} 
U_{I_e, \bnd{I} \cap J}
\\
T_{\bnd{J},J} & = & A_{\bnd{J},J}
- \sum_{\bnd{I} \cap J \ne \emptyset} 
L_{\bnd{I} \cap \bnd{J}, I_e} D_{I_e, I_e} U_{I_e, \bnd{I} \cap J}
\\
T_{J,\bnd{J}} & = & A_{J,\bnd{J}}
- \sum_{\bnd{I} \cap J \ne \emptyset} 
L_{\bnd{I} \cap J, I_e} D_{I_e, I_e} U_{I_e, \bnd{I} \cap \bnd{J}}
\end{eqnarray*}
The computation of $T_{J,J}$, $T_{\bnd{J},J}$ and $T_{J,\bnd{J}}$
does not depend on any delayed rows and columns from the children,
and so it can begin to execute before the children are complete.
It is key to note that $J$ and $\bnd{J}$ are known before the
factorization starts and do not change due to any delayed rows or
columns.
It is true that we do not know $I_e$ for each descendent of $J$
{\it until} front $I$ is complete, but in some sense that does not
matter.
All that is necessary is to keep track of which descendent fronts $I$ 
have $I_e \ne \emptyset$ and $\bnd{I} \cap J \ne \emptyset$.
The three equations with ${\widetilde T}$ on the left
are simply an assembly of matrices ---
one matrix comes from the updates from the descendents,
one matrix from delayed rows and columns.
\par \bigskip \par
\noindent {\bf A serial factorization}
\par \bigskip \par
There are five simple steps to a serial factorization:
initialize the front, load original entries, accumulate updates
from descendents, assemble any delayed rows and columns from the
children, then factor the front and update any delayed rows and
columns. 
Here is the algorithm step by step.
\par
\noindent Loop over the fronts in a post-order traversal
\begin{enumerate}
\item
Initialize 
$\displaystyle
\left \lbrack \begin{array}{cc}
T_{J,J} & T_{J, \bnd{J}} \\
T_{\bnd{J},J} & 0
\end{array} \right \rbrack
= 0.
$
\item
Load with original entries 
\par
$\displaystyle
\mbox{\qquad} 
\left \lbrack \begin{array}{cc}
T_{J,J} & T_{J, \bnd{J}} \\
T_{\bnd{J},J} & 0
\end{array} \right \rbrack
\mbox{\tt +=}
\left \lbrack \begin{array}{cc}
A_{J,J} & A_{J, \bnd{J}} \\
A_{\bnd{J},J} & 0
\end{array} \right \rbrack
$
\item
Accumulate updates from descendents.
\par
For $I_e \ne \emptyset$ and $\bnd{I} \cap J \ne \emptyset$.
\par
$\mbox{\qquad}\displaystyle
\left \lbrack \begin{array}{cc}
T_{J,J} & T_{J, \bnd{J}} \\
T_{\bnd{J},J} & 0
\end{array} \right \rbrack
\mbox{\tt -=}
\left \lbrack \begin{array}{c}
L_{\bnd{I}\cap J,I_e} \\
L_{\bnd{I}\cap \bnd{J},I_e} 
\end{array} \right \rbrack
D_{I_e,I_e}
\left \lbrack \begin{array}{cc}
U_{I_e, \bnd{I} \cap J} & U_{I_e, \bnd{I}\cap \bnd{J}} 
\end{array} \right \rbrack
$
\item
Assemble postponed rows and columns.
\par $\widetilde J = J$.
\par for $par(I) = J$ and $I_d \ne \emptyset$
\par
$\mbox{\qquad} {\widetilde J} := {\widetilde J} \cup I_d$
\par
$\mbox{\qquad} \displaystyle
\left \lbrack \begin{array}{cc}
T_{{\widetilde J},{\widetilde J}} & T_{{\widetilde J}, \bnd{{\widetilde J}}} \\
T_{\bnd{{\widetilde J}},{\widetilde J}} & 0
\end{array} \right \rbrack
\mbox{\tt +=}
\left \lbrack \begin{array}{cc}
H_{I_d \cup (\bnd{I} \cap J), I_d \cup (\bnd{I} \cap J)}
& H_{I_d \cup (\bnd{I} \cap J), \bnd{I} \cap \bnd{J}} \\
H_{\bnd{I} \cap \bnd{J}), I_d \cup (\bnd{I} \cap J)} & 0
\end{array} \right \rbrack
$
\item
Factor the front.
\par
$T_{J_e,J_e} 
    =  (L_{J_e,J_e} + I)D_{J_e,J_e} (I + U_{J_e,J_e})$
\par
$T_{J_d \cup \bnd{J},J_e} 
   = L_{J_d \cup \bnd{J},J_e} D_{J_e,J_e} (I + U_{J_e,J_e}) $
\par
$T_{J_e,J_d \cup \bnd{J}} 
   = (L_{J_e,J_e} + I)D_{J_e,J_e} U_{J_e,J_d \cup \bnd{J}})$
\par
$H_{J_d,J_d}^J 
   = T_{J_d,J_d} - L_{J_d, J_e} D_{J_e,J_e} U_{J_e,J_d}$
\par
$H_{J_d,\bnd{J}}^J 
   = T_{J_d,\bnd{J}} - L_{J_d, J_e} D_{J_e,J_e} U_{J_e,\bnd{J}}$
\par
$H_{\bnd{J},J_d}^J 
   = T_{\bnd{J},J_d} - L_{\bnd{J}, J_e} D_{J_e,J_e} U_{J_e,J_d}$
\end{enumerate}
The last step needs a bit of elaboration. 
The factorization of a front is a complex process.
The matrix $D_{J_e,J_e}$ is diagonal or block diagonal.
Our codes try to detect large block pivots as a way of taking
advantage of the speed of BLAS3 kernels.
\par
Consider symmetric pivoting for now. (Nonsymmetric pivoting is much
the same but the notation gets more complicated.)
\begin{center}
\begin{minipage}{3.5 in}
\begin{tabbing}
XXX\=XXX\=XXX\=XXX\=XXX\kill
$J_e = J_d = \emptyset$ \\
while a pivot block $(\bfj, \bfj)$ can be found \\
\> $J_e := J_e \cup \bfj$,
   ${\widetilde J} := {\widetilde J} \setminus \bfj$ \\
\> $D_{\bfj,\bfj} = T_{\bfj,\bfj}$\\
\> $L_{{\widetilde J}\cup\bnd{J},\bfj} 
      = A_{{\widetilde J}\cup\bnd{J},\bfj} D_{\bfj,\bfj}^{-1}$\\
\> $U_{{\widetilde J}\cup\bnd{J},\bfj} 
      = D_{\bfj,\bfj}^{-1} A_{\bfj,{\widetilde J}\cup\bnd{J}} $ \\
\> $T_{{\widetilde J}, {\widetilde J}}
   := T_{{\widetilde J}, {\widetilde J}}
    - L_{{\widetilde J}, \bfj} D_{\bfj,\bfj} U_{\bfj,{\widetilde J}}$ \\
\> $T_{\bnd{J}, {\widetilde J}}
   := T_{\bnd{J}, {\widetilde J}}
    - L_{\bnd{J}, \bfj} D_{\bfj,\bfj} U_{\bfj,{\widetilde J}}$ \\
\> $T_{{\widetilde J}, \bnd{J}}
   := T_{{\widetilde J}, \bnd{J}}
    - L_{{\widetilde J}, \bfj} D_{\bfj,\bfj} U_{\bfj,\bnd{J}}$\\
end while \\
$J_d = {\widetilde J}$
\end{tabbing}
\end{minipage}
\end{center}
Note the matrix-matrix multiply with the explicit inverse
of $D_{\bfj,\bfj}$.
We actually compute the inverse
$D_{\bfj,\bfj}^{-1}$ as part of the test for acceptability 
of the pivot block.
(See Section~\ref{section:PivotFinder} for more details.)
\par \bigskip \par
\noindent {\bf Parallel factorization }
\par \bigskip \par
We have one major assumption as we move towards a parallel
algorithm. {\it 
Once a front is complete (all updates from descendent fronts
have been made and any postponed rows and columns from the children
have been assembled), the its factorization takes place inside
one thread or processor.}
This does not lead to a {\it scalable} algorithm
\cite{sch93-scalability}, but it is important for efficiency to
have the pivot selection process inside one thread of computation.
\par
In light of this constraint, let us look at the five steps of the
factorization.
Steps 1, 2, 4 and 5 must be done by one thread, the owner of the
front $J$.
It is Step 3 that must be parallelized across the processors.
Now for the major question: if front $I$ updates front $J$,
who performs the computation? There are three possibilities.
\begin{itemize}
\item
The owner of front $I$. (The fan-in method 
\cite{ash90-fan-in}.)
\item
The owner of front $J$. (The fan-out method 
\cite{geo87-fan-out}.)
\item
Some other processor. (The fan-both method
\cite{ash93-fan-both}.)
\end{itemize}
We have chosen the fan-in paradigm for this library.
It is a simple method, fairly efficient and can take advantage of
subtree-subcube mappings \cite{geo87-fan-out} better than either
the fan-out or fan-both methods.
Any algorithm will distribute the factor entries among the threads,
but there can be only one owner that computes
the factor pivots of a given front.
Furthermore, the fan-in algorithm features no exchange of factor
entries among the threads, rather it is partial updates, or
{\it aggregate} updates that are exchanged.
\par
Step 3 needs some modification for thread $q$.
\begin{itemize}
\item[$3^\prime.$]
For $I$ owned by thread $q$, 
$I_e \ne \emptyset$
and $\bnd{I} \cap J \ne \emptyset$
\par
$\mbox{\qquad}\displaystyle
\left \lbrack \begin{array}{cc}
T^q_{J,J} & T^q_{J, \bnd{J}} \\
T^q_{\bnd{J},J} & 0
\end{array} \right \rbrack
\mbox{\tt -=}
\left \lbrack \begin{array}{c}
L_{\bnd{I}\cap J,I_e} \\
L_{\bnd{I}\cap \bnd{J},I_e} 
\end{array} \right \rbrack
D_{I_e,I_e}
\left \lbrack \begin{array}{cc}
U_{I_e, \bnd{I} \cap J} & U_{I_e, \bnd{I}\cap \bnd{J}} 
\end{array} \right \rbrack
$
\begin{tabbing}
XXX\=XXX\=XXX\=XXX\=\kill
If $J$ is owned by thread $q$ then \\
\> for each supporting thread $r$ \\
\>\> $\displaystyle
\left \lbrack \begin{array}{cc}
T^q_{J,J} & T^q_{J, \bnd{J}} \\
T^q_{\bnd{J},J} & 0
\end{array} \right \rbrack
\mbox{\tt +=}
\left \lbrack \begin{array}{cc}
T^r_{J,J} & T^r_{J, \bnd{J}} \\
T^r_{\bnd{J},J} & 0
\end{array} \right \rbrack
$ \\
\> end for \\
% else \\
% \> send
% $\displaystyle
% \left \lbrack \begin{array}{cc}
% T^q_{J,J} & T^q_{J, \bnd{J}} \\
% T^q_{\bnd{J},J} & 0
% \end{array} \right \rbrack
% $ to the owner of $J$ \\
end if
\end{tabbing}
\end{itemize}
Of course we are omitting the rendevous (in a threaded context)
or send/receive (in a distributed context) logic in the algorithm
description. It should be clear that the aggregate updates
(on the left)
and the delayed rows and columns 
(on the right)
$$
T_J^q =
\left \lbrack \begin{array}{cc}
T^q_{J,J} & T^q_{J, \bnd{J}} \\
T^q_{\bnd{J},J} & 0
\end{array} \right \rbrack
\qquad \qquad
H^I =
\left \lbrack \begin{array}{cc}
H_{I_d \cup (\bnd{I} \cap J), I_d \cup (\bnd{I} \cap J)}
& H_{I_d \cup (\bnd{I} \cap J), \bnd{I} \cap \bnd{J}} \\
H_{\bnd{I} \cap \bnd{J}), I_d \cup (\bnd{I} \cap J)} & 0
\end{array} \right \rbrack
$$
must be made available to the threads 
or communicated to the processors that need them.
\par
The presence of pivoting, and thus the possibility that a front may
not have any eliminated rows and columns, makes an implementation
somewhat tricky.
This is a key point and needs to be explained further.
It is simple to determine {\it prior to the factorization}
which threads will support which fronts, but it is possible that a
none of the fronts owned by a thread will have any eliminated rows
and columns that support some given front $J$. 
The aggregate matrix $T_J^q$ would never been created but the owner
of front $J$ will be expecting it.
In short, the communication patterns of the factorization without
pivoting must be obeyed for the process to terminate satisfactorily.
If an aggregate matrix is indeed empty, meaning that due to delayed
rows and columns the updates that would have been performed will
not be forthcoming, the entries are not actually transfered,
but the notification is made.
\par \bigskip \par
\noindent {\bf Parallel solve }
\par \bigskip \par
Let us now consider the forward and backsolves.
We make one assumption: the data partition of the factors $L$, $D$
and $U$ will be maintained during the solve.
In other words, the entries in $L$, $D$ and $U$
are found in basic submatrices,
$L_{J_d \cup \bnd{J}, J_e}$, 
$D_{J_e, J_e}$ and
$L_{J_e, J_d \cup \bnd{J}}$, 
and these submatrices are not split
up across threads or processors.
There is still the concept of a thread or processor owning a front.
In the threaded code we do allow for different maps from fronts
to owners.\footnote{
For example, the forward and backsolves could be done with fewer
threads than the factorization.
}
\par
We do not actually compute $A = (L + I)D(I + U)$, instead we
compute $(L + I)(D + {\widehat U})$ where ${\widehat U} = DU$.
Recall, the eliminated rows and columns for a front, $J_e$,
are split into pivot blocks that we denote by $\bfj$.
By convention, let the boundary of $\bfj$ be
$$
\bnd{\bfj} = \{k\ |\ l_{k,j} \ne 0 \mbox{\ for some\ }j \in \bfj\}.
$$
Note, $\bnd{\bfj} \subset J_e \cup J_d \cup \bnd{J}$.
The serial solve $(I+L)(D+U)x = b$ is found in
Figure~\ref{fig:serial-solve}.
\begin{figure}
\caption{Serial forward and back solve}
\label{fig:serial-solve}
\begin{center}
\begin{minipage}{3.5 in}
\begin{tabbing}
XXX\=XXX\=XXX\=XXX\=XXX\=\kill
for $J$ in a post-order traversal \\
\> for $\bfj \in J_e$ in ascending order \\
\>\> $y_{\bfj} = b_{\bfj}$ \\
\>\> $b_{\bnd{\bfj}} :=
      b_{\bnd{\bfj}} - L_{\bnd{\bfj},\bfj} y_{\bfj}$ \\
\> end for\\
end for\\
for $J$ in a pre-order traversal \\
\> for $\bfj \in J_e$ in descending order \\
\>\> $y_{\bfj} := y_{\bfj} 
        - {\widehat U}_{\bfj,\bnd{\bfj}} x_{\bnd{\bfj}}$ \\
\>\> $x_{\bfj} = D_{\bfj,\bfj}^{-1} y_{\bfj}$ \\
\> end for\\
end for
\end{tabbing}
\end{minipage}
\end{center}
\end{figure}
\par
As we distribute the forward solve across threads or processors,
we see that the right hand side vector $b$ accumulates updates of
the form $L_{\bnd{\bfj},\bfj} y_{\bfj}$ from descendents of $J$.
Since these updates are made in a distributed fashion, each thread
or processor needs a copy of $b$, or at least of copy of the
entries in $b$ that it will interact with.
The same holds in a slightly different sense for the backward solve.
Instead of a vector $b$ that needs to be gathered and accumulated,
it is a solution vector $x$ that needs to be scattered to threads
or processors that need it.
\par
A well designed map of fronts to threads or processors will no
doubt take advantage of locality within the front tree, e.g.,
a subtree-subcube map.
In these cases, the entries that {\it active} on a thread or
processor will be a fraction of the total entries.
It thus pays to take have local vectors $b^q$ and $x^q$ for thread $q$.
These vectors should contain only those entries that are active
on thread $q$.
This is particularly important when we are solving several right
hand sides at once.
\par
Figure~\ref{fig:parallel-solve} describes the
the parallel solve $(I+L)(D+U)x = b$ as done by thread $q$.
\begin{figure}
\caption{Parallel forward and back solve}
\label{fig:parallel-solve}
\begin{center}
\begin{minipage}{3.5 in}
\begin{tabbing}
XXX\=XXX\=XXX\=XXX\=XXX\=\kill
for $J$ in a post-order traversal \\
\> if  $J$ owned by $q$ then \\
\>\> gather $b_{J_e} = \sum_r b_{J_e}^r$ \\
\>\> for $\bfj \in J_e$ in ascending order \\
\>\>\> $y_{\bfj} = b_{\bfj}$ \\
\>\>\> $b^q_{\bnd{\bfj}} :=
      b^q_{\bnd{\bfj}} - L_{\bnd{\bfj},\bfj} y_{\bfj}$ \\
\>\> end for\\
\> else if $b_{J_e}^q \ne 0$ then \\
\>\> communicate $b_{J_e}^q$ to $b_{J_e}$ somehow \\
\> end if\\
end for\\
for $J$ in a pre-order traversal \\
\> if  $J$ owned by $q$ then \\
\>\> for $\bfj \in J_e$ in descending order \\
\>\>\> $y_{\bfj} := y_{\bfj} 
        - {\widehat U}_{\bfj,\bnd{\bfj}} x^q_{\bnd{\bfj}}$ \\
\>\>\> $x_{\bfj} = D_{\bfj,\bfj}^{-1} y_{\bfj}$ \\
\>\> end for\\
\>\> store $x_{J_e}$ in $x^q_{J_e}$ \\
\>\> communicate $x_{J_e}$ to those who need it \\
\> else if $x_{J_e}$ needed by thread $q$ then \\
\>\> obtain $x_{J_e}$ and store in $x^q_{J_e}$ \\
\> end if \\
end for
\end{tabbing}
\end{minipage}
\end{center}
\end{figure}
\par
The interplay between local and global information can take many
forms.
In our present threaded solves there are global right hand side and
solution vectors that are protected by locks.
In the MPI version explicit messages will communicate information
among the processors.