File: LU_MPI.tex

package info (click to toggle)
spooles 2.2-16
  • links: PTS, VCS
  • area: main
  • in suites: forky, sid
  • size: 19,760 kB
  • sloc: ansic: 146,836; sh: 7,571; csh: 3,615; makefile: 1,970; perl: 74
file content (427 lines) | stat: -rw-r--r-- 15,084 bytes parent folder | download | duplicates (7)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
\par
\vfill \eject
\section{MPI Solution of $A X = Y$ using an $LU$ factorization}
\label{section:LU-MPI}
\par
Unlike the serial and multithreaded environments where the data
structures are global, existing under one address space, 
in the MPI environment, data is local, each process or processor
has its own distinct address space.
The MPI step-by-step process to solve a linear system is exactly
the same as the multithreaded case, with the additional trouble
that the data structures are distributed and need to be
re-distributed as needed.
\par
The ownership of the factor matrices during the factorization and
solves is exactly the same as for the multithreaded case -- the
map from fronts to processors and map from submatrices to
processors are identical to their counterparts in the multithreaded
program.
What is different is the explicit message passing of data
structures between processors.
Luckily, most of this is hidden to the user code.
\par
We will now begin to work our way through the program 
found in Section~\ref{section:LU-MPI-driver}
to illustrate the use of {\bf SPOOLES} to solve a system 
of linear equations in the MPI environment.
\par
\subsection{Reading the input parameters}
\label{subsection:MPI:input-data}
\par
This step is identical to the serial code, as described in
Section~\ref{subsection:serial:input-data}, with the exception
that the file names for $A$ and $Y$ are hardcoded in the driver,
and so are not part of the input parameters.
\par
\subsection{Communicating the data for the problem}
\label{subsection:MPI:communicating-data}
\par
This step is identical to the serial code, as described in
Section~\ref{subsection:serial:communicating-data}
In the serial and multithreaded codes, the entire matrix $A$ was
read in from one file and placed into one {\tt InpMtx} object.
In the MPI environment, this need not be the case that one
processor holds the entire matrix $A$.
(In fact, $A$ must be distributed across processors during the
factorization.)
\par
Each processor opens a matrix file, (possibly) reads in matrix
entries, and creates its {\it local} {\tt InpMtx} object that holds
the matrix entries it has read in.
We have hardcoded the file names: processor $q$ reads 
its matrix entries from file {\tt matrix.}$q${\tt .input}
and
its right hand side entries from file {\tt rhs.}$q${\tt .input}.
The file formats are the same as for the serial and multithreaded
drivers.
\par
The entries needed not be partitioned over the files.
For example, each processor could read in entries for disjoint sets
of finite elements.
Naturally some degrees of freedom will have support on elements
that are found on different processors.
When the entries in $A$ and $Y$ are mapped to processors, an
assembly of the matrix entries will be done automatically.
\par
It could be the case that the matrix $A$ and right hand side $Y$
are read in by one processor. (This was the approach we took with
the {\tt LinSol} wrapper objects.)
There still need to be input files for the other processors
with zeroes on their first (and only) line, 
to specify that no entries are to be read.  
\par
\subsection{Reordering the linear system}
\label{subsection:MPI:reordering}
\par
The first part is very similar to the serial code, as described in
Section~\ref{subsection:serial:reordering}.
\begin{verbatim}
graph = Graph_new() ;
adjIVL = InpMtx_MPI_fullAdjacency(mtxA, stats, msglvl, msgFile, MPI_COMM_WORLD) ;
nedges = IVL_tsize(adjIVL) ;
Graph_init2(graph, 0, neqns, 0, nedges, neqns, nedges, adjIVL, NULL, NULL) ;
frontETree = orderViaMMD(graph, seed + myid, msglvl, msgFile) ;
\end{verbatim}
While the data and computations are distributed across the
processors, the ordering process is not.
Therefore we need a global graph on each processor.
Since the matrix $A$ is distributed across the processors, 
we use the distributed {\tt InpMtx\_MPI\_fullAdjacency()} method 
to construct the {\tt IVL} object of the graph of $A + A^T$.
\par
At this point, each processor has computed its own minimum degree
ordering and created a front tree object.
The orderings will likely be different, because each processors
input a different random number seed to the ordering method.
Only one ordering can be used for the factorization, so the
processors collectively determine which of the orderings is best, 
which is then broadcast to all the processors, as the code fragment
below illustrates.
\begin{verbatim}
opcounts = DVinit(nproc, 0.0) ;
opcounts[myid] = ETree_nFactorOps(frontETree, type, symmetryflag) ;
MPI_Allgather((void *) &opcounts[myid], 1, MPI_DOUBLE,
              (void *) opcounts, 1, MPI_DOUBLE, MPI_COMM_WORLD) ;
minops = DVmin(nproc, opcounts, &root) ;
DVfree(opcounts) ;
frontETree = ETree_MPI_Bcast(frontETree, root, msglvl, msgFile, MPI_COMM_WORLD) ;
\end{verbatim}
\par
\subsection{Non-numeric work}
\label{subsection:MPI:non-numeric}
\par
Once the front tree is replicated across the processors, we obtain
the permutation vectors and permute the vertices in the front tree.
The local matrices for $A$ and $Y$ are also permuted.
These steps are identical to the serial and multithreaded drivers,
except the fact local instead of global $A$ and $Y$ matrices 
are permuted.
\begin{verbatim}
oldToNewIV = ETree_oldToNewVtxPerm(frontETree) ;
newToOldIV = ETree_newToOldVtxPerm(frontETree) ;
ETree_permuteVertices(frontETree, oldToNewIV) ;
InpMtx_permute(mtxA, IV_entries(oldToNewIV),
IV_entries(oldToNewIV)) ;
if (  symmetryflag == SPOOLES_SYMMETRIC || symmetryflag == SPOOLES_HERMITIAN ) {
   InpMtx_mapToUpperTriangle(mtxA) ;
}
InpMtx_changeCoordType(mtxA, INPMTX_BY_CHEVRONS) ;
InpMtx_changeStorageMode(mtxA, INPMTX_BY_VECTORS) ;
DenseMtx_permuteRows(mtxY, oldToNewIV) ;
\end{verbatim}
\par
The next step is to obtain the map from fronts to processors,
just as was done in the multithreaded driver.
In addition, we need a map from vertices to processors to be able
to distribute the matrix $A$ and right hand side $Y$ as necessary.
Since we have the map from vertices to fronts inside the front tree
object, the vertex map is easy to determine.
\begin{verbatim}
cutoff   = 1./(2*nproc) ;
cumopsDV = DV_new() ;
DV_init(cumopsDV, nproc, NULL) ;
ownersIV = ETree_ddMap(frontETree, type, symmetryflag, cumopsDV, cutoff) ;
DV_free(cumopsDV) ;
vtxmapIV = IV_new() ;
IV_init(vtxmapIV, neqns, NULL) ;
IVgather(neqns, IV_entries(vtxmapIV), IV_entries(ownersIV), ETree_vtxToFront(frontETree)) ;
\end{verbatim}
At this point we are ready to assemble and distribute the entries
of $A$ and $Y$.
\begin{verbatim}
firsttag = 0 ;
newA = InpMtx_MPI_split(mtxA, vtxmapIV, stats, msglvl, msgFile, firsttag,
MPI_COMM_WORLD) ;
InpMtx_free(mtxA) ;
mtxA = newA ;
InpMtx_changeStorageMode(mtxA, INPMTX_BY_VECTORS) ;
newY = DenseMtx_MPI_splitByRows(mtxY, vtxmapIV, stats, msglvl, 
                                msgFile, firsttag, MPI_COMM_WORLD) ;
DenseMtx_free(mtxY) ;
mtxY = newY ;
\end{verbatim}
The {\tt InpMtx\_MPI\_split()} method assembles and redistributes
the matrix entries by the vectors of the local matrix.
Recall above that the coordinate type was set to chevrons, as is
needed for the assembly of the entries into the front matrices.
The method returns a new {\tt InpMtx} object that contains the part
of $A$ that is needed by the processor.
The old {\tt InpMtx} object is free'd and the new one takes its place.
\par
Now we are ready to compute the symbolic factorization, but it too
much be done in a distributed manner.
\begin{verbatim}
symbfacIVL = SymbFac_MPI_initFromInpMtx(frontETree, ownersIV, mtxA,
                    stats, msglvl, msgFile, firsttag, MPI_COMM_WORLD) ;
\end{verbatim}
The {\tt symbfacIVL} object on a particular processor is only a
subset of the global symbolic factorization, containing only what
it needs to know for it to compute its part of the factorization.
\par
\subsection{The Matrix Factorization}
\label{subsection:MPI:factor}
\par
In contrast the the multithreaded environment, data structures are
local to a processor, and so locks are not needed to manage access
to critical regions of code.
The initialization of the front matrix and submatrix manager
objects is much like the serial case, with one exception.
\par
\begin{verbatim}
mtxmanager = SubMtxManager_new() ;
SubMtxManager_init(mtxmanager, NO_LOCK, 0) ;
frontmtx = FrontMtx_new() ;
FrontMtx_init(frontmtx, frontETree, symbfacIVL, type, symmetryflag,
              FRONTMTX_DENSE_FRONTS, pivotingflag, NO_LOCK, myid,
              ownersIV, mtxmanager, msglvl, msgFile) ;
\end{verbatim}
Note that the nineth and tenth arguments are {\tt myid} and {\tt
ownersIV}, not {\tt 0} and {\tt NULL} as for the serial and
multithreaded drivers.
These arguments tell the front matrix object 
that it needs to initialize only
those parts of the factor matrices that it ``owns'', 
which are given by the map from fronts to processors 
and the processor id.
\par
The numeric factorization is performed by the
{\tt FrontMtx\_MPI\_factorInpMtx()} method.
The code segment from the sample program for the numerical
factorization step is found below.
\begin{verbatim}
chvmanager = ChvManager_new() ;
ChvManager_init(chvmanager, NO_LOCK, 0) ;
rootchv = FrontMtx_MPI_factorInpMtx(frontmtx, mtxA, tau, droptol,
                     chvmanager, ownersIV, lookahead, &error, cpus,
                     stats, msglvl, msgFile, firsttag, MPI_COMM_WORLD) ;
ChvManager_free(chvmanager) ;
\end{verbatim}
Note that the {\tt ChvManager} is not locked.
The calling sequence is identical to that of 
the multithreaded factorization except for the addition of the {\tt
firsttag} and MPI communicator at the end.
\par
The post-processing of the factorization is the same in principle
as in the serial code but differs in that is uses the distributed
data structures.
\begin{verbatim}
FrontMtx_MPI_postProcess(frontmtx, ownersIV, stats, msglvl,
                         msgFile, firsttag, MPI_COMM_WORLD) ;
\end{verbatim}
After the post-processing step, each local {\tt FrontMtx} object 
contains the $L_{J,I}$, $D_{I,I}$ and $U_{I,J}$ submatrices
for the fronts that were owned by the particular processor.
However, the parallel solve is based on the submatrices being
distributed across the processors, not just the fronts.
\par
We must specify which threads own which submatrices, 
and so perform computations with them.
This is done by constructing a {\it ``solve--map''} object,
as we see below.
\begin{verbatim}
solvemap = SolveMap_new() ;
SolveMap_ddMap(solvemap, symmetryflag, FrontMtx_upperBlockIVL(frontmtx),
               FrontMtx_lowerBlockIVL(frontmtx), nproc, ownersIV,
               FrontMtx_frontTree(frontmtx), seed, msglvl, msgFile) ;
\end{verbatim}
This object also uses a domain decomposition map, the only solve map
that presently found in the {\bf SPOOLES} library.
\par
Once the solve map has been created, (and note that it is identical
across all the processors), we redistribute the submatrices
with the following code fragment.
\begin{verbatim}
FrontMtx_MPI_split(frontmtx, solvemap, stats, msglvl, msgFile, firsttag, MPI_COMM_WORLD) ;
\end{verbatim}
At this point in time, 
the submatrices that a processor owns are local to that processor.
\par
\subsection{The Forward and Backsolves}
\label{subsection:MPI:solve}
\par
If pivoting has been performed for numerical stability, then the
rows of $PY$ may not be located on the processor that needs them.
We must perform an additional redistribution of the local
{\tt DenseMtx} objects that hold $PY$, as the code fragment below
illustrates.
\begin{verbatim}
if ( FRONTMTX_IS_PIVOTING(frontmtx) ) {
   IV   *rowmapIV ;
/*
   ----------------------------------------------------------
   pivoting has taken place, redistribute the right hand side
   to match the final rows and columns in the fronts
   ----------------------------------------------------------
*/
   rowmapIV = FrontMtx_MPI_rowmapIV(frontmtx, ownersIV, msglvl,
                                    msgFile, MPI_COMM_WORLD) ;
   newY = DenseMtx_MPI_splitByRows(mtxY, rowmapIV, stats, msglvl,
                                   msgFile, firsttag, MPI_COMM_WORLD) ;
   DenseMtx_free(mtxY) ;
   mtxY = newY ;
   IV_free(rowmapIV) ;
}
\end{verbatim}
\par
Each processor now must create a local {\tt DenseMtx} object 
to hold the rows of $PX$ that it owns.
\begin{verbatim}
ownedColumnsIV = FrontMtx_ownedColumnsIV(frontmtx, myid, ownersIV,
                                         msglvl, msgFile) ;
nmycol = IV_size(ownedColumnsIV) ;
mtxX = DenseMtx_new() ;
if ( nmycol > 0 ) {
   DenseMtx_init(mtxX, type, 0, 0, nmycol, nrhs, 1, nmycol) ;
   DenseMtx_rowIndices(mtxX, &nrow, &rowind) ;
   IVcopy(nmycol, rowind, IV_entries(ownedColumnsIV)) ;
}
\end{verbatim}
If $A$ is symmetric, or if pivoting for stability was not used,
then {\tt mtxX} can just be a pointer to {\tt mtxY}, i.e.,
$PX$ could overwrite $PY$.
\par
The parallel solve is remarkably similar to the serial solve,
as we see with the code fragment below.
\begin{verbatim}
solvemanager = SubMtxManager_new() ;
SubMtxManager_init(solvemanager, NO_LOCK, 0) ;
FrontMtx_MPI_solve(frontmtx, mtxX, mtxY, solvemanager, solvemap, cpus,
                   stats, msglvl, msgFile, firsttag, MPI_COMM_WORLD) ;
SubMtxManager_free(solvemanager) ;
\end{verbatim}
The only difference between the multithreaded and MPI solve
methods is the presence of the first tag and MPI communicator
in the latter.
\par
The last step is to permute the rows of the local solution matrix
into the original matrix ordering.
We also gather all the solution entries into one {\tt DenseMtx}
object on processor zero.
\begin{verbatim}
DenseMtx_permuteRows(mtxX, newToOldIV) ;
IV_fill(vtxmapIV, 0) ;
firsttag++ ;
mtxX = DenseMtx_MPI_splitByRows(mtxX, vtxmapIV, stats, msglvl, msgFile,
                                firsttag, MPI_COMM_WORLD) ;
\end{verbatim}
\par
\subsection{Sample Matrix and Right Hand Side Files}
\label{subsection:MPI:input-files}
\par
\begin{center}
\begin{tabular}{|l||l||l||l||}
\hline
{\tt matrix.0.input} &
{\tt matrix.1.input} &
{\tt matrix.2.input} &
{\tt matrix.3.input} \\
\begin{minipage}[t]{1 in}
\begin{verbatim}
9 9 6
0 0  4.0
0 1 -1.0
0 3 -1.0
1 1  4.0
1 2 -1.0 
1 4 -1.0
\end{verbatim}
\end{minipage}
&
\begin{minipage}[t]{1 in}
\begin{verbatim}
9 9 5
2 2  4.0
2 5 -1.0
3 3  4.0
3 4 -1.0
3 6 -1.0
\end{verbatim}
\end{minipage}
&
\begin{minipage}[t]{1 in}
\begin{verbatim}
9 9 7
4 4  4.0
4 5 -1.0
4 7 -1.0
5 5  4.0
5 8 -1.0
6 6  4.0
6 7 -1.0

\end{verbatim}
\end{minipage}
&
\begin{minipage}[t]{1 in}
\begin{verbatim}
9 9 3
7 7  4.0
7 8 -1.0
8 8  4.0
\end{verbatim}
\end{minipage}
\\
\hline
\hline
{\tt rhs.0.input} &
{\tt rhs.1.input} &
{\tt rhs.2.input} &
{\tt rhs.3.input} \\
\begin{minipage}[t]{1 in}
\begin{verbatim}
2 1
0 0.0
1 0.0
\end{verbatim}
\end{minipage}
&
\begin{minipage}[t]{1 in}
\begin{verbatim}
2 1
2 0.0
3 0.0
\end{verbatim}
\end{minipage}
&
\begin{minipage}[t]{1 in}
\begin{verbatim}
2 1
4  1.0
5  0.0
\end{verbatim}
\end{minipage}
&
\begin{minipage}[t]{1 in}
\begin{verbatim}
3 1
6  0.0
7  0.0
8  0.0

\end{verbatim}
\end{minipage}
\\
\hline
\end{tabular}
\end{center}