1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374
|
Stokes-like problem
-------------------
In this section we consider a saddle point system which was obtained by
discretization of the steady incompressible Stokes flow equations in a unit
cube with a locally divergence-free weak Galerkin finite element method. The
UCube(4) system studied here may be downloaded from the the `dataset
<https://doi.org/10.5281/zenodo.4134357>`_ accompanying the paper [DeMW20]_. We
will use the UCube(4) system from the dataset. The system matrix is symmetric
and has 554,496 rows and 14,292,884 nonzero values, which corresponds to an
average of 26 nonzero entries per row. The matrix sparsity portrait is shown on
the figure below.
.. _SuiteSparse Matrix Collection: https://sparse.tamu.edu/
.. figure:: ../../tutorial/4.Stokes/ucube_4.png
:width: 90%
:name: ucube_spy
UCube(4) matrix portrait
As with any Stokes-like problem, the system has a general block-wise structure:
.. math::
\begin{bmatrix} A & B_1^T \\ B_2 & C \end{bmatrix}
\begin{bmatrix} \mathbf x_u \\ \mathbf x_p \end{bmatrix} =
\begin{bmatrix} \mathbf b_u \\ \mathbf b_p \end{bmatrix}
In this case, the upper left subblock :math:`A` corresponds the the flow
unknowns, and itself has block-wise structure with small :math:`3\times3`
blocks. The lower right subblock :math:`C` corresponds to the pressure
unknowns. There is a lot of research dedicated to the efficient solution of
such systems, see [BeGL05]_ for an extensive overview. The direct approach of
using a monolithic preconditioner usually does not work very well, but we may
try it to have a reference point. The AMG preconditioning does not yield a
converging solution, but a single level ILU(0) relaxation seems to work with a
CG iterative solver::
$ solver -B -A ucube_4_A.bin -f ucube_4_b.bin \
solver.type=cg solver.maxiter=500 \
precond.class=relaxation precond.type=ilu0
Solver
======
Type: CG
Unknowns: 554496
Memory footprint: 16.92 M
Preconditioner
==============
Relaxation as preconditioner
Unknowns: 554496
Nonzeros: 14292884
Memory: 453.23 M
Iterations: 270
Error: 6.84763e-09
[Profile: 9.300 s] (100.00%)
[ reading: 0.133 s] ( 1.43%)
[ setup: 0.561 s] ( 6.03%)
[ solve: 8.599 s] ( 92.46%)
A preconditioner that takes the structure of the system into account should be
a better choice performance-wise. AMGCL provides an implementation of the Schur
complement pressure correction preconditioner. The preconditioning step
consists of solving two linear systems:
.. math::
:label: schurpc_eq
\begin{aligned}
S \mathbf x_p &= \mathbf b_p - B_2 A^{-1} \mathbf b_u, \\
A \mathbf x_u &= \mathbf b_u - B_1^T \mathbf b_p.
\end{aligned}
Here :math:`S` is the Schur complement :math:`S = C - B_2 A^{-1} B_1^T`. Note
that forming the Schur complement matrix explicitly is prohibitively expensive,
and the following approximation is used to create the preconditioner for the
first equation in :eq:`schurpc_eq`:
.. math::
\hat S = C - \mathop{\mathrm{diag}}\left( B_2 \mathop{\mathrm{diag}}(A)^{-1} B_1^T \right).
There is no need to solve the equations :eq:`schurpc_eq` exactly. It is enough
to perform a single application of the corresponding preconditioner as an
approximation to :math:`S^{-1}` and :math:`A^{-1}`. This means that the
overall preconditioner is linear, and we may use a non-flexible iterative
solver with it. The approximation matrix :math:`\hat S` has a simple band
diagonal structure, and a diagonal SPAI(0) preconditioner should have
reasonable performance.
.. _examples/solver: https://github.com/ddemidov/amgcl/blob/master/examples/solver.cpp
.. _examples/schur_pressure_correction: https://github.com/ddemidov/amgcl/blob/master/examples/schur_pressure_correction.cpp
Similar to the `examples/solver`_, the `examples/schur_pressure_correction`_
utility allows to play with the Schur pressure correction preconditioner
options before trying to write any code. We found that using the non-smoothed
aggregation with ILU(0) smoothing on each level for the flow subsystem
(``usolver``) and single-level SPAI(0) relaxation for the Schur complement
subsystem (``psolver``) works best. We also disable lumping of the diagonal of
the :math:`A` matrix in the Schur complement approximation with the
``precond.simplec_dia=false`` option, and enable block-valued backend for the
flow susbsystem with the ``--ub 3`` option. The ``-m '>456192'`` option sets
the pressure mask pattern. It tells the solver that all unknowns starting with
the 456192-th belong to the pressure subsystem::
$ schur_pressure_correction -B -A ucube_4_A.bin -f ucube_4_b.bin -m '>456192' \
-p solver.type=cg solver.maxiter=200 \
precond.simplec_dia=false \
precond.usolver.solver.type=preonly \
precond.usolver.precond.coarsening.type=aggregation \
precond.usolver.precond.relax.type=ilu0 \
precond.psolver.solver.type=preonly \
precond.psolver.precond.class=relaxation \
--ub 3
Solver
======
Type: CG
Unknowns: 554496
Memory footprint: 16.92 M
Preconditioner
==============
Schur complement (two-stage preconditioner)
Unknowns: 554496(98304)
Nonzeros: 14292884
Memory: 587.45 M
[ U ]
Solver
======
Type: PreOnly
Unknowns: 152064
Memory footprint: 0.00 B
Preconditioner
==============
Number of levels: 4
Operator complexity: 1.25
Grid complexity: 1.14
Memory footprint: 233.07 M
level unknowns nonzeros memory
---------------------------------------------
0 152064 982416 188.13 M (80.25%)
1 18654 197826 35.07 M (16.16%)
2 2619 35991 6.18 M ( 2.94%)
3 591 7953 3.69 M ( 0.65%)
[ P ]
Solver
======
Type: PreOnly
Unknowns: 98304
Memory footprint: 0.00 B
Preconditioner
==============
Relaxation as preconditioner
Unknowns: 98304
Nonzeros: 274472
Memory: 5.69 M
Iterations: 35
Error: 8.57921e-09
[Profile: 3.872 s] (100.00%)
[ reading: 0.131 s] ( 3.38%)
[ schur_complement: 3.741 s] ( 96.62%)
[ self: 0.031 s] ( 0.79%)
[ setup: 0.301 s] ( 7.78%)
[ solve: 3.409 s] ( 88.05%)
Lets see how this translates to the code. Below is the complete listing of the
solver (`tutorial/4.Stokes/stokes_ucube.cpp`_) which uses the mixed precision approach.
.. _tutorial/4.Stokes/stokes_ucube.cpp: https://github.com/ddemidov/amgcl/blob/master/tutorial/4.Stokes/stokes_ucube.cpp
.. literalinclude:: ../../tutorial/4.Stokes/stokes_ucube.cpp
:caption: The source code for the solution of the UCube(4) problem.
:language: cpp
:linenos:
Schur pressure correction is composite preconditioner. Its definition includes
definition of two nested iterative solvers, one for the "flow" (U) subsystem,
and the other for the "pressure" (P) subsystem. In lines 55--58 we define the
backends used in the outer iterative solver, and in the two nested solvers.
Note that both backends for nested solvers use single precision values, and the
flow subsystem backend has block value type:
.. literalinclude:: ../../tutorial/4.Stokes/stokes_ucube.cpp
:language: cpp
:linenos:
:lines: 55-58
:lineno-start: 55
In lines 60-79 we define the solver type. The flow solver is defined in lines
62-69, and the pressure solver -- in lines 70--77. Both are using
:cpp:class:`amgcl::solver::preonly` as "iterative" solver, which in fact only
applies the specified preconditioner once. The flow solver is defined with
:cpp:class:`amgcl::make_block_solver`, which automatically converts its
input matrix :math:`A` to the block format during the setup and reinterprets
the scalar RHS and solution vectors as having block values during solution:
.. literalinclude:: ../../tutorial/4.Stokes/stokes_ucube.cpp
:language: cpp
:linenos:
:lines: 60-79
:lineno-start: 60
In the solver parameters we disable lumping of the matrix :math:`A` diagonal
for the Schur complement approimation (line 83) and fill the pressure mask to
indicate which unknowns correspond to the pressure subsystem (lines 84--85):
.. literalinclude:: ../../tutorial/4.Stokes/stokes_ucube.cpp
:language: cpp
:linenos:
:lines: 81-85
:lineno-start: 81
Here is the output from the compiled program. The preconditioner uses 398M or
memory, as opposed to 587M in the case of the full precision preconditioner
used in the `examples/schur_pressure_correction`_, and both the setup and the
solution are about 50% faster due to the use of the mixed precision approach::
$ ./stokes_ucube ucube_4_A.bin ucube_4_b.bin 456192
Matrix ucube_4_A.bin: 554496x554496
RHS ucube_4_b.bin: 554496x1
Solver
======
Type: CG
Unknowns: 554496
Memory footprint: 16.92 M
Preconditioner
==============
Schur complement (two-stage preconditioner)
Unknowns: 554496(98304)
Nonzeros: 14292884
Memory: 398.39 M
[ U ]
Solver
======
Type: PreOnly
Unknowns: 152064
Memory footprint: 0.00 B
Preconditioner
==============
Number of levels: 4
Operator complexity: 1.25
Grid complexity: 1.14
Memory footprint: 130.49 M
level unknowns nonzeros memory
---------------------------------------------
0 152064 982416 105.64 M (80.25%)
1 18654 197826 19.56 M (16.16%)
2 2619 35991 3.44 M ( 2.94%)
3 591 7953 1.85 M ( 0.65%)
[ P ]
Solver
======
Type: PreOnly
Unknowns: 98304
Memory footprint: 0.00 B
Preconditioner
==============
Relaxation as preconditioner
Unknowns: 98304
Nonzeros: 274472
Memory: 4.27 M
Iters: 35
Error: 8.57996e-09
[UCube4: 2.502 s] (100.00%)
[ read: 0.129 s] ( 5.16%)
[ setup: 0.240 s] ( 9.57%)
[ solve: 2.132 s] ( 85.19%)
Converting the solver to the VexCL backend in order to accelerate the solution
with GPGPU is straightforward. Below is the complete source code of the solver
(`tutorial/4.Stokes/stokes_ucube_vexcl.cpp`_), with the differences between the
OpenMP and the VexCL versions highlighted. Note that the GPU version of the
ILU(0) smoother approximates the lower and upper triangular solves in the
incomplete LU decomposition with a couple of Jacobi iterations [ChPa15]_. Here
we set the number of iterations to 4 (line 94).
.. _tutorial/4.Stokes/stokes_ucube_vexcl.cpp: https://github.com/ddemidov/amgcl/blob/master/tutorial/4.Stokes/stokes_ucube_vexcl.cpp
.. literalinclude:: ../../tutorial/4.Stokes/stokes_ucube_vexcl.cpp
:caption: The source code for the solution of the UCube(4) problem with the VexCL backend.
:language: cpp
:linenos:
:emphasize-lines: 4-5,32-39,65-67,94,99-100,104,111-113,118-120,123
The output of the VexCL version is shown below. The solution phase is about
twice as fast as the OpenMP version::
$ ./stokes_ucube_vexcl_cuda ucube_4_A.bin ucube_4_b.bin 456192
1. GeForce GTX 1050 Ti
Matrix ucube_4_A.bin: 554496x554496
RHS ucube_4_b.bin: 554496x1
Solver
======
Type: CG
Unknowns: 554496
Memory footprint: 16.92 M
Preconditioner
==============
Schur complement (two-stage preconditioner)
Unknowns: 554496(98304)
Nonzeros: 14292884
Memory: 399.66 M
[ U ]
Solver
======
Type: PreOnly
Unknowns: 152064
Memory footprint: 0.00 B
Preconditioner
==============
Number of levels: 4
Operator complexity: 1.25
Grid complexity: 1.14
Memory footprint: 131.76 M
level unknowns nonzeros memory
---------------------------------------------
0 152064 982416 106.76 M (80.25%)
1 18654 197826 19.68 M (16.16%)
2 2619 35991 3.45 M ( 2.94%)
3 591 7953 1.86 M ( 0.65%)
[ P ]
Solver
======
Type: PreOnly
Unknowns: 98304
Memory footprint: 0.00 B
Preconditioner
==============
Relaxation as preconditioner
Unknowns: 98304
Nonzeros: 274472
Memory: 4.27 M
Iters: 36
Error: 7.26253e-09
[UCube4 (VexCL): 1.858 s] (100.00%)
[ self: 0.004 s] ( 0.20%)
[ GPU matrix: 0.213 s] ( 11.46%)
[ read: 0.128 s] ( 6.87%)
[ setup: 0.519 s] ( 27.96%)
[ solve: 0.994 s] ( 53.52%)
|