1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427
|
\par
\vfill \eject
\section{MPI Solution of $A X = Y$ using an $LU$ factorization}
\label{section:LU-MPI}
\par
Unlike the serial and multithreaded environments where the data
structures are global, existing under one address space,
in the MPI environment, data is local, each process or processor
has its own distinct address space.
The MPI step-by-step process to solve a linear system is exactly
the same as the multithreaded case, with the additional trouble
that the data structures are distributed and need to be
re-distributed as needed.
\par
The ownership of the factor matrices during the factorization and
solves is exactly the same as for the multithreaded case -- the
map from fronts to processors and map from submatrices to
processors are identical to their counterparts in the multithreaded
program.
What is different is the explicit message passing of data
structures between processors.
Luckily, most of this is hidden to the user code.
\par
We will now begin to work our way through the program
found in Section~\ref{section:LU-MPI-driver}
to illustrate the use of {\bf SPOOLES} to solve a system
of linear equations in the MPI environment.
\par
\subsection{Reading the input parameters}
\label{subsection:MPI:input-data}
\par
This step is identical to the serial code, as described in
Section~\ref{subsection:serial:input-data}, with the exception
that the file names for $A$ and $Y$ are hardcoded in the driver,
and so are not part of the input parameters.
\par
\subsection{Communicating the data for the problem}
\label{subsection:MPI:communicating-data}
\par
This step is identical to the serial code, as described in
Section~\ref{subsection:serial:communicating-data}
In the serial and multithreaded codes, the entire matrix $A$ was
read in from one file and placed into one {\tt InpMtx} object.
In the MPI environment, this need not be the case that one
processor holds the entire matrix $A$.
(In fact, $A$ must be distributed across processors during the
factorization.)
\par
Each processor opens a matrix file, (possibly) reads in matrix
entries, and creates its {\it local} {\tt InpMtx} object that holds
the matrix entries it has read in.
We have hardcoded the file names: processor $q$ reads
its matrix entries from file {\tt matrix.}$q${\tt .input}
and
its right hand side entries from file {\tt rhs.}$q${\tt .input}.
The file formats are the same as for the serial and multithreaded
drivers.
\par
The entries needed not be partitioned over the files.
For example, each processor could read in entries for disjoint sets
of finite elements.
Naturally some degrees of freedom will have support on elements
that are found on different processors.
When the entries in $A$ and $Y$ are mapped to processors, an
assembly of the matrix entries will be done automatically.
\par
It could be the case that the matrix $A$ and right hand side $Y$
are read in by one processor. (This was the approach we took with
the {\tt LinSol} wrapper objects.)
There still need to be input files for the other processors
with zeroes on their first (and only) line,
to specify that no entries are to be read.
\par
\subsection{Reordering the linear system}
\label{subsection:MPI:reordering}
\par
The first part is very similar to the serial code, as described in
Section~\ref{subsection:serial:reordering}.
\begin{verbatim}
graph = Graph_new() ;
adjIVL = InpMtx_MPI_fullAdjacency(mtxA, stats, msglvl, msgFile, MPI_COMM_WORLD) ;
nedges = IVL_tsize(adjIVL) ;
Graph_init2(graph, 0, neqns, 0, nedges, neqns, nedges, adjIVL, NULL, NULL) ;
frontETree = orderViaMMD(graph, seed + myid, msglvl, msgFile) ;
\end{verbatim}
While the data and computations are distributed across the
processors, the ordering process is not.
Therefore we need a global graph on each processor.
Since the matrix $A$ is distributed across the processors,
we use the distributed {\tt InpMtx\_MPI\_fullAdjacency()} method
to construct the {\tt IVL} object of the graph of $A + A^T$.
\par
At this point, each processor has computed its own minimum degree
ordering and created a front tree object.
The orderings will likely be different, because each processors
input a different random number seed to the ordering method.
Only one ordering can be used for the factorization, so the
processors collectively determine which of the orderings is best,
which is then broadcast to all the processors, as the code fragment
below illustrates.
\begin{verbatim}
opcounts = DVinit(nproc, 0.0) ;
opcounts[myid] = ETree_nFactorOps(frontETree, type, symmetryflag) ;
MPI_Allgather((void *) &opcounts[myid], 1, MPI_DOUBLE,
(void *) opcounts, 1, MPI_DOUBLE, MPI_COMM_WORLD) ;
minops = DVmin(nproc, opcounts, &root) ;
DVfree(opcounts) ;
frontETree = ETree_MPI_Bcast(frontETree, root, msglvl, msgFile, MPI_COMM_WORLD) ;
\end{verbatim}
\par
\subsection{Non-numeric work}
\label{subsection:MPI:non-numeric}
\par
Once the front tree is replicated across the processors, we obtain
the permutation vectors and permute the vertices in the front tree.
The local matrices for $A$ and $Y$ are also permuted.
These steps are identical to the serial and multithreaded drivers,
except the fact local instead of global $A$ and $Y$ matrices
are permuted.
\begin{verbatim}
oldToNewIV = ETree_oldToNewVtxPerm(frontETree) ;
newToOldIV = ETree_newToOldVtxPerm(frontETree) ;
ETree_permuteVertices(frontETree, oldToNewIV) ;
InpMtx_permute(mtxA, IV_entries(oldToNewIV),
IV_entries(oldToNewIV)) ;
if ( symmetryflag == SPOOLES_SYMMETRIC || symmetryflag == SPOOLES_HERMITIAN ) {
InpMtx_mapToUpperTriangle(mtxA) ;
}
InpMtx_changeCoordType(mtxA, INPMTX_BY_CHEVRONS) ;
InpMtx_changeStorageMode(mtxA, INPMTX_BY_VECTORS) ;
DenseMtx_permuteRows(mtxY, oldToNewIV) ;
\end{verbatim}
\par
The next step is to obtain the map from fronts to processors,
just as was done in the multithreaded driver.
In addition, we need a map from vertices to processors to be able
to distribute the matrix $A$ and right hand side $Y$ as necessary.
Since we have the map from vertices to fronts inside the front tree
object, the vertex map is easy to determine.
\begin{verbatim}
cutoff = 1./(2*nproc) ;
cumopsDV = DV_new() ;
DV_init(cumopsDV, nproc, NULL) ;
ownersIV = ETree_ddMap(frontETree, type, symmetryflag, cumopsDV, cutoff) ;
DV_free(cumopsDV) ;
vtxmapIV = IV_new() ;
IV_init(vtxmapIV, neqns, NULL) ;
IVgather(neqns, IV_entries(vtxmapIV), IV_entries(ownersIV), ETree_vtxToFront(frontETree)) ;
\end{verbatim}
At this point we are ready to assemble and distribute the entries
of $A$ and $Y$.
\begin{verbatim}
firsttag = 0 ;
newA = InpMtx_MPI_split(mtxA, vtxmapIV, stats, msglvl, msgFile, firsttag,
MPI_COMM_WORLD) ;
InpMtx_free(mtxA) ;
mtxA = newA ;
InpMtx_changeStorageMode(mtxA, INPMTX_BY_VECTORS) ;
newY = DenseMtx_MPI_splitByRows(mtxY, vtxmapIV, stats, msglvl,
msgFile, firsttag, MPI_COMM_WORLD) ;
DenseMtx_free(mtxY) ;
mtxY = newY ;
\end{verbatim}
The {\tt InpMtx\_MPI\_split()} method assembles and redistributes
the matrix entries by the vectors of the local matrix.
Recall above that the coordinate type was set to chevrons, as is
needed for the assembly of the entries into the front matrices.
The method returns a new {\tt InpMtx} object that contains the part
of $A$ that is needed by the processor.
The old {\tt InpMtx} object is free'd and the new one takes its place.
\par
Now we are ready to compute the symbolic factorization, but it too
much be done in a distributed manner.
\begin{verbatim}
symbfacIVL = SymbFac_MPI_initFromInpMtx(frontETree, ownersIV, mtxA,
stats, msglvl, msgFile, firsttag, MPI_COMM_WORLD) ;
\end{verbatim}
The {\tt symbfacIVL} object on a particular processor is only a
subset of the global symbolic factorization, containing only what
it needs to know for it to compute its part of the factorization.
\par
\subsection{The Matrix Factorization}
\label{subsection:MPI:factor}
\par
In contrast the the multithreaded environment, data structures are
local to a processor, and so locks are not needed to manage access
to critical regions of code.
The initialization of the front matrix and submatrix manager
objects is much like the serial case, with one exception.
\par
\begin{verbatim}
mtxmanager = SubMtxManager_new() ;
SubMtxManager_init(mtxmanager, NO_LOCK, 0) ;
frontmtx = FrontMtx_new() ;
FrontMtx_init(frontmtx, frontETree, symbfacIVL, type, symmetryflag,
FRONTMTX_DENSE_FRONTS, pivotingflag, NO_LOCK, myid,
ownersIV, mtxmanager, msglvl, msgFile) ;
\end{verbatim}
Note that the nineth and tenth arguments are {\tt myid} and {\tt
ownersIV}, not {\tt 0} and {\tt NULL} as for the serial and
multithreaded drivers.
These arguments tell the front matrix object
that it needs to initialize only
those parts of the factor matrices that it ``owns'',
which are given by the map from fronts to processors
and the processor id.
\par
The numeric factorization is performed by the
{\tt FrontMtx\_MPI\_factorInpMtx()} method.
The code segment from the sample program for the numerical
factorization step is found below.
\begin{verbatim}
chvmanager = ChvManager_new() ;
ChvManager_init(chvmanager, NO_LOCK, 0) ;
rootchv = FrontMtx_MPI_factorInpMtx(frontmtx, mtxA, tau, droptol,
chvmanager, ownersIV, lookahead, &error, cpus,
stats, msglvl, msgFile, firsttag, MPI_COMM_WORLD) ;
ChvManager_free(chvmanager) ;
\end{verbatim}
Note that the {\tt ChvManager} is not locked.
The calling sequence is identical to that of
the multithreaded factorization except for the addition of the {\tt
firsttag} and MPI communicator at the end.
\par
The post-processing of the factorization is the same in principle
as in the serial code but differs in that is uses the distributed
data structures.
\begin{verbatim}
FrontMtx_MPI_postProcess(frontmtx, ownersIV, stats, msglvl,
msgFile, firsttag, MPI_COMM_WORLD) ;
\end{verbatim}
After the post-processing step, each local {\tt FrontMtx} object
contains the $L_{J,I}$, $D_{I,I}$ and $U_{I,J}$ submatrices
for the fronts that were owned by the particular processor.
However, the parallel solve is based on the submatrices being
distributed across the processors, not just the fronts.
\par
We must specify which threads own which submatrices,
and so perform computations with them.
This is done by constructing a {\it ``solve--map''} object,
as we see below.
\begin{verbatim}
solvemap = SolveMap_new() ;
SolveMap_ddMap(solvemap, symmetryflag, FrontMtx_upperBlockIVL(frontmtx),
FrontMtx_lowerBlockIVL(frontmtx), nproc, ownersIV,
FrontMtx_frontTree(frontmtx), seed, msglvl, msgFile) ;
\end{verbatim}
This object also uses a domain decomposition map, the only solve map
that presently found in the {\bf SPOOLES} library.
\par
Once the solve map has been created, (and note that it is identical
across all the processors), we redistribute the submatrices
with the following code fragment.
\begin{verbatim}
FrontMtx_MPI_split(frontmtx, solvemap, stats, msglvl, msgFile, firsttag, MPI_COMM_WORLD) ;
\end{verbatim}
At this point in time,
the submatrices that a processor owns are local to that processor.
\par
\subsection{The Forward and Backsolves}
\label{subsection:MPI:solve}
\par
If pivoting has been performed for numerical stability, then the
rows of $PY$ may not be located on the processor that needs them.
We must perform an additional redistribution of the local
{\tt DenseMtx} objects that hold $PY$, as the code fragment below
illustrates.
\begin{verbatim}
if ( FRONTMTX_IS_PIVOTING(frontmtx) ) {
IV *rowmapIV ;
/*
----------------------------------------------------------
pivoting has taken place, redistribute the right hand side
to match the final rows and columns in the fronts
----------------------------------------------------------
*/
rowmapIV = FrontMtx_MPI_rowmapIV(frontmtx, ownersIV, msglvl,
msgFile, MPI_COMM_WORLD) ;
newY = DenseMtx_MPI_splitByRows(mtxY, rowmapIV, stats, msglvl,
msgFile, firsttag, MPI_COMM_WORLD) ;
DenseMtx_free(mtxY) ;
mtxY = newY ;
IV_free(rowmapIV) ;
}
\end{verbatim}
\par
Each processor now must create a local {\tt DenseMtx} object
to hold the rows of $PX$ that it owns.
\begin{verbatim}
ownedColumnsIV = FrontMtx_ownedColumnsIV(frontmtx, myid, ownersIV,
msglvl, msgFile) ;
nmycol = IV_size(ownedColumnsIV) ;
mtxX = DenseMtx_new() ;
if ( nmycol > 0 ) {
DenseMtx_init(mtxX, type, 0, 0, nmycol, nrhs, 1, nmycol) ;
DenseMtx_rowIndices(mtxX, &nrow, &rowind) ;
IVcopy(nmycol, rowind, IV_entries(ownedColumnsIV)) ;
}
\end{verbatim}
If $A$ is symmetric, or if pivoting for stability was not used,
then {\tt mtxX} can just be a pointer to {\tt mtxY}, i.e.,
$PX$ could overwrite $PY$.
\par
The parallel solve is remarkably similar to the serial solve,
as we see with the code fragment below.
\begin{verbatim}
solvemanager = SubMtxManager_new() ;
SubMtxManager_init(solvemanager, NO_LOCK, 0) ;
FrontMtx_MPI_solve(frontmtx, mtxX, mtxY, solvemanager, solvemap, cpus,
stats, msglvl, msgFile, firsttag, MPI_COMM_WORLD) ;
SubMtxManager_free(solvemanager) ;
\end{verbatim}
The only difference between the multithreaded and MPI solve
methods is the presence of the first tag and MPI communicator
in the latter.
\par
The last step is to permute the rows of the local solution matrix
into the original matrix ordering.
We also gather all the solution entries into one {\tt DenseMtx}
object on processor zero.
\begin{verbatim}
DenseMtx_permuteRows(mtxX, newToOldIV) ;
IV_fill(vtxmapIV, 0) ;
firsttag++ ;
mtxX = DenseMtx_MPI_splitByRows(mtxX, vtxmapIV, stats, msglvl, msgFile,
firsttag, MPI_COMM_WORLD) ;
\end{verbatim}
\par
\subsection{Sample Matrix and Right Hand Side Files}
\label{subsection:MPI:input-files}
\par
\begin{center}
\begin{tabular}{|l||l||l||l||}
\hline
{\tt matrix.0.input} &
{\tt matrix.1.input} &
{\tt matrix.2.input} &
{\tt matrix.3.input} \\
\begin{minipage}[t]{1 in}
\begin{verbatim}
9 9 6
0 0 4.0
0 1 -1.0
0 3 -1.0
1 1 4.0
1 2 -1.0
1 4 -1.0
\end{verbatim}
\end{minipage}
&
\begin{minipage}[t]{1 in}
\begin{verbatim}
9 9 5
2 2 4.0
2 5 -1.0
3 3 4.0
3 4 -1.0
3 6 -1.0
\end{verbatim}
\end{minipage}
&
\begin{minipage}[t]{1 in}
\begin{verbatim}
9 9 7
4 4 4.0
4 5 -1.0
4 7 -1.0
5 5 4.0
5 8 -1.0
6 6 4.0
6 7 -1.0
\end{verbatim}
\end{minipage}
&
\begin{minipage}[t]{1 in}
\begin{verbatim}
9 9 3
7 7 4.0
7 8 -1.0
8 8 4.0
\end{verbatim}
\end{minipage}
\\
\hline
\hline
{\tt rhs.0.input} &
{\tt rhs.1.input} &
{\tt rhs.2.input} &
{\tt rhs.3.input} \\
\begin{minipage}[t]{1 in}
\begin{verbatim}
2 1
0 0.0
1 0.0
\end{verbatim}
\end{minipage}
&
\begin{minipage}[t]{1 in}
\begin{verbatim}
2 1
2 0.0
3 0.0
\end{verbatim}
\end{minipage}
&
\begin{minipage}[t]{1 in}
\begin{verbatim}
2 1
4 1.0
5 0.0
\end{verbatim}
\end{minipage}
&
\begin{minipage}[t]{1 in}
\begin{verbatim}
3 1
6 0.0
7 0.0
8 0.0
\end{verbatim}
\end{minipage}
\\
\hline
\end{tabular}
\end{center}
|