1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361
|
Parallel Fast Fourier Transforms
================================
Parallel FFTs are computed through a combination of :ref:`global redistributions <global>`
and :ref:`serial transforms <dfts>`. In mpi4py-fft the interface to performing such
parallel transforms is the :class:`.mpifft.PFFT` class. The class is highly
configurable and best explained through a few examples.
Slab decomposition
..................
With slab decompositions we use only one group of processors and distribute
only one index of a multidimensional array at the time.
Consider the complete transform of a three-dimensional array of random numbers,
and of shape (128, 128, 128). We can plan the transform of such an array with
the following code snippet::
import numpy as np
from mpi4py import MPI
from mpi4py_fft import PFFT, newDistArray
N = np.array([128, 128, 128], dtype=int)
fft = PFFT(MPI.COMM_WORLD, N, axes=(0, 1, 2), dtype=np.float, grid=(-1,))
Here the signature ``N, axes=(0, 1, 2), dtype=np.float, grid=(-1,)`` tells us
that the created ``fft`` instance is *planned* such as to slab distribute
(along first axis) and
transform any 3D array of shape ``N`` and type ``np.float``. Furthermore, we
plan to transform axis 2 first, and then 1 and 0, which is exactly the reverse
order of ``axes=(0, 1, 2)``. Mathematically, the planned transform corresponds
to
.. math::
\tilde{u}_{j_0/P,k_1,k_2} &= \mathcal{F}_1( \mathcal{F}_{2}(u_{j_0/P, j_1, j_2})), \\
\tilde{u}_{j_0, k_1/P, k_2} &\xleftarrow[P]{1\rightarrow 0} \tilde{u}_{j_0/P, k_1, k_2}, \\
\hat{u}_{k_0,k_1/P,k_2} &= \mathcal{F}_0(\tilde{u}_{j_0, k_1/P, k_2}).
Note that axis 0 is distributed on the
input array and axis 1 on the output array. In the first step above we compute
the transforms along axes 2 and 1 (in that order), but we cannot compute the
serial transform along axis 0 since the global array is distributed in that
direction. We need to perform a global redistribution, the middle step,
that realigns the global data such that it is aligned in axes 0.
With data aligned in axis 0, we can perform the final transform
:math:`\mathcal{F}_{0}` and be done with it.
Assume now that all the code in this section is stored to a file named
``pfft_example.py``, and add to the above code::
u = newDistArray(fft, False)
u[:] = np.random.random(u.shape).astype(u.dtype)
u_hat = fft.forward(u, normalize=True) # Note that normalize=True is default and can be omitted
uj = np.zeros_like(u)
uj = fft.backward(u_hat, uj)
assert np.allclose(uj, u)
print(MPI.COMM_WORLD.Get_rank(), u.shape)
Running this code with two processors (``mpirun -np 2 python pfft_example.py``)
should raise no exception, and the output should be::
1 (64, 128, 128)
0 (64, 128, 128)
This shows that the first index has been shared between the two processors
equally. The array ``u`` thus corresponds to :math:`u_{j_0/P,j_1,j_2}`. Note
that the :func:`.newDistArray` function returns a :class:`.DistArray`
object, which in turn is a subclassed Numpy ndarray. The :func:`.newDistArray`
function uses ``fft`` to determine the size and type of the created distributed
array, i.e., (64, 128, 128) and ``np.float`` for both processors.
The ``False`` argument indicates that the shape
and type should be that of the input array, as opposed to the output
array type (:math:`\hat{u}_{k_0,k_1/P,k_2}` that one gets with ``True``).
Note that because the input array is of real type, and not complex, the
output array will be of global shape::
128, 128, 65
The output array will be distributed in axis 1, so the output array
shape should be (128, 64, 65) on each processor. We check this by adding
the following code and rerunning::
u_hat = newDistArray(fft, True)
print(MPI.COMM_WORLD.Get_rank(), u_hat.shape)
leading to an additional print of::
1 (128, 64, 65)
0 (128, 64, 65)
To distribute in the first axis first is default and most efficient for
row-major C arrays. However, we can easily configure the ``fft`` instance
by modifying the axes keyword. Changing for example to::
fft = PFFT(MPI.COMM_WORLD, N, axes=(2, 0, 1), dtype=np.float)
and axis 1 will be transformed first, such that the global output array
will be of shape (128, 65, 128). The distributed input and output arrays
will now have shape::
0 (64, 128, 128)
1 (64, 128, 128)
0 (128, 33, 128)
1 (128, 32, 128)
Note that the input array will still be distributed in axis 0 and the
output in axis 1. This order of distribution can be tweaked using the
`grid` keyword. Setting `grid=(1, 1, -1)` will force the last axis
to be distributed on the input array.
Another way to tweak the distribution is to use the :class:`.Subcomm`
class directly::
from mpi4py_fft.pencil import Subcomm
subcomms = Subcomm(MPI.COMM_WORLD, [1, 0, 1])
fft = PFFT(subcomms, N, axes=(0, 1, 2), dtype=np.float)
Here the ``subcomms`` tuple will decide that axis 1 should be distributed,
because the only zero in the list ``[1, 0, 1]`` is along axis 1. The ones
determine that axes 0 and 2 should use one processor each, i.e., they should
be non-distributed.
The :class:`.PFFT` class has a few additional keyword arguments that one
should be aware of. The default behaviour of :class:`.PFFT` is to use
one transform object for each axis, and then use these sequentially.
Setting ``collapse=True`` will attempt to minimize the number of transform
objects by combining whenever possible. Take our example, the array
:math:`u_{j_0/P,j_1,j_2}` can transform along both axes 1 and 2 simultaneously,
without any intermediate global redistributions. By setting
``collapse=True`` only one object of ``rfftn(u, axes=(1, 2))`` will be
used instead of two (like ``fftn(rfftn(u, axes=2), axes=1)``).
Note that a collapse can also be configured through the ``axes`` keyword,
using::
fft = PFFT(MPI.COMM_WORLD, N, axes=((0,), (1, 2)), dtype=np.float)
will collapse axes 1 and 2, just like one would obtain with ``collapse=True``.
If serial transforms other than :func:`.fftn`/:func:`.rfftn` and
:func:`.ifftn`/:func:`.irfftn` are required, then this can be achieved
using the ``transforms`` keyword and a dictionary pointing from axes to
the type of transform. We can for example combine real-to-real
with real-to-complex transforms like this::
from mpi4py_fft.fftw import rfftn, irfftn, dctn, idctn
import functools
dct = functools.partial(dctn, type=3)
idct = functools.partial(idctn, type=3)
transforms = {(0,): (rfftn, irfftn), (1, 2): (dct, idct)}
r2c = PFFT(MPI.COMM_WORLD, N, axes=((0,), (1, 2)), transforms=transforms)
u = newDistArray(r2c, False)
u[:] = np.random.random(u.shape).astype(u.dtype)
u_hat = r2c.forward(u)
uj = np.zeros_like(u)
uj = r2c.backward(u_hat, uj)
assert np.allclose(uj, u)
As a more complex example consider a 5-dimensional array where for some reason
you need to perform discrete cosine transforms in axes 1 and 2, discrete sine
transforms in axes 3 and 4, and a regular Fourier transform in the first axis.
Here it makes sense to collapse the (1, 2) and (3, 4) axes, which leaves only
the first axis uncollapsed. Hence we can then only use one processor group and
a slab decomposition, whereas without collapsing we could have used four groups.
A parallel transform object can be created and tested as::
N = (5, 6, 7, 8, 9)
dctn = functools.partial(fftw.dctn, type=3)
idctn = functools.partial(fftw.idctn, type=3)
dstn = functools.partial(fftw.dstn, type=3)
idstn = functools.partial(fftw.idstn, type=3)
fft = PFFT(MPI.COMM_WORLD, N, ((0,), (1, 2), (3, 4)), grid=(-1,),
transforms={(1, 2): (dctn, idctn), (3, 4): (dstn, idstn)})
A = newDistArray(fft, False)
A[:] = np.random.random(A.shape)
C = fftw.aligned_like(A)
B = fft.forward(A)
C = fft.backward(B, C)
assert np.allclose(A, C)
Pencil decomposition
....................
A pencil decomposition uses two groups of processors. Each group then is
responsible for distributing one index set each of a multidimensional array.
We can perform a pencil decomposition simply by running the first example
from the previous section, but now with 4 processors. To remind you, we
put this in ``pfft_example.py``, where now ``grid=(-1,)`` has been removed
in the PFFT calling::
import numpy as np
from mpi4py import MPI
from mpi4py_fft import PFFT, newDistArray
N = np.array([128, 128, 128], dtype=int)
fft = PFFT(MPI.COMM_WORLD, N, axes=(0, 1, 2), dtype=np.float)
u = newDistArray(fft, False)
u[:] = np.random.random(u.shape).astype(u.dtype)
u_hat = fft.forward(u)
uj = np.zeros_like(u)
uj = fft.backward(u_hat, uj)
assert np.allclose(uj, u)
print(MPI.COMM_WORLD.Get_rank(), u.shape)
The output of running ``mpirun -np 4 python pfft_example.py`` will then be::
0 (64, 64, 128)
2 (64, 64, 128)
3 (64, 64, 128)
1 (64, 64, 128)
Note that now both the two first index sets are shared, so we have a pencil
decomposition. The shared input array is now denoted as
:math:`u_{j_0/P_0,j_1/P_1,j2}` and the complete forward transform performs
the following 5 steps:
.. math::
\tilde{u}_{j_0/P_0,j_1/P_1,k_2} &= \mathcal{F}_{2}(u_{j_0/P_0, j_1/P_1, j_2}), \\
\tilde{u}_{j_0/P_0, j_1, k_2/P_1} &\xleftarrow[P_1]{2\rightarrow 1} \tilde{u}_{j_0/P_0, j_1/P_1, k_2}, \\
\tilde{u}_{j_0/P_0,k_1,k_2/P_1} &= \mathcal{F}_1(\tilde{u}_{j_0/P_0, j_1, k_2/P_1}), \\
\tilde{u}_{j_0, k_1/P_0, k_2/P_1} &\xleftarrow[P_0]{1\rightarrow 0} \tilde{u}_{j_0/P_0, k_1, k_2/P_1}, \\
\hat{u}_{k_0,k_1/P_0,k_2/P_1} &= \mathcal{F}_0(\tilde{u}_{j_0, k_1/P_0, k_2/P_1}).
Like for the slab decomposition, the order of the different steps is
configurable. Simply change the value of ``axes``, e.g., as::
fft = PFFT(MPI.COMM_WORLD, N, axes=(2, 0, 1), dtype=np.float)
and the input and output arrays will be of shape::
3 (64, 128, 64)
2 (64, 128, 64)
1 (64, 128, 64)
0 (64, 128, 64)
3 (64, 32, 128)
2 (64, 32, 128)
1 (64, 33, 128)
0 (64, 33, 128)
We see that the input array is aligned in axis 1, because this is the direction
transformed first.
Convolution
...........
Working with Fourier one sometimes need to transform the product of two or
more functions, like
.. math::
:label: ft_convolve
\widehat{ab}_k = \int_{0}^{2\pi} a b e^{-i k x} dx, \quad \forall k \in [-N/2, \ldots, N/2-1]
computed with DFT as
.. math::
:label: dft_convolve
\widehat{ab}_k = \frac{1}{N}\sum_{j=0}^{N-1}a_j b_j e^{-2\pi i j k / N}, \quad \forall \, k\in [-N/2, \ldots, N/2-1].
.. note::
We are here assuming an even number :math:`N` and use wavenumbers centered
around zero.
If :math:`a` and :math:`b` are two Fourier series with their own
coefficients:
.. math::
:label: ab_sums
a &= \sum_{p=-N/2}^{N/2-1} \hat{a}_p e^{i p x}, \\
b &= \sum_{q=-N/2}^{N/2-1} \hat{b}_q e^{i q x},
then we can insert for the two sums from :eq:`ab_sums` in :eq:`ft_convolve` and
get
.. math::
:label: ab_convolve
\widehat{ab}_k &= \int_{0}^{2\pi} \left( \sum_{p=-N/2}^{N/2-1} \hat{a}_p e^{i p x} \sum_{q=-N/2}^{N/2-1} \hat{b}_q e^{i q x} \right) e^{-i k x} dx, \quad \forall \, k \in [-N/2, \ldots, N/2-1] \\
\widehat{ab}_k &= \sum_{p=-N/2}^{N/2-1} \sum_{q=-N/2}^{N/2-1} \hat{a}_p \hat{b}_q \int_{0}^{2\pi} e^{-i (p+q-k) x} dx, \quad \forall \, k \in [-N/2, \ldots, N/2-1]
The final integral is :math:`2\pi` for :math:`p+q=k` and zero otherwise. Consequently, we get
.. math::
:label: ab_convolve2
\widehat{ab}_k = 2\pi \sum_{p=-N/2}^{N/2-1}\sum_{q=-N/2}^{N/2-1} \hat{a}_p \hat{b}_{q} \delta_{p+q, k} , \quad \forall \, k \in [-N/2, \ldots, N/2-1]
Unfortunately, the convolution sum :eq:`ab_convolve2` is very expensive to
compute, and the direct application of :eq:`dft_convolve` leads to
aliasing errors. Luckily there is a fast approach that eliminates aliasing as
well.
The fast, alias-free, approach makes use of the FFT and zero-padded coefficient
vectors. The idea is to zero-pad :math:`\hat{a}` and :math:`\hat{b}` in spectral
space such that we get the extended sums
.. math::
A_j &= \sum_{p=-M/2}^{M/2-1} \hat{\hat{a}}_p e^{2 \pi i p j/M}, \\
B_j &= \sum_{q=-M/2}^{M/2-1} \hat{\hat{b}}_q e^{2 \pi i q j/M},
where :math:`M>N` and where the coefficients have been zero-padded such that
.. math::
\hat{\hat{a}}_p = \begin{cases} \hat{a}_p, &\forall |p| \le N/2 \\
0, &\forall |p| \gt N/2 \end{cases}
Now compute the nonlinear term in the larger physical space and compute the
convolution as
.. math::
:label: ab_convolve3
\widehat{ab}_k = \frac{1}{M} \sum_{j=0}^{M-1} A_j B_j e^{- 2 \pi i k j/M}, \quad \forall \, k \in [-M/2, \ldots, M/2-1]
Finally, truncate the vector :math:`\widehat{ab}_k` to the original range
:math:`k\in[-N/2, \ldots, N/2-1]`, simply by eliminating all the wavenumbers
higher than :math:`|N/2|`.
With mpi4py-fft we can compute this convolution using the ``padding`` keyword
of the :class:`.PFFT` class::
import numpy as np
from mpi4py_fft import PFFT, newDistArray
from mpi4py import MPI
comm = MPI.COMM_WORLD
N = (128, 128) # Global shape in physical space
fft = PFFT(comm, N, padding=[1.5, 1.5], dtype=np.complex)
# Create arrays in normal spectral space
a_hat = newDistArray(fft, True)
b_hat = newDistArray(fft, True)
a_hat[:] = np.random.random(a_hat.shape) + np.random.random(a_hat.shape)*1j
b_hat[:] = np.random.random(a_hat.shape) + np.random.random(a_hat.shape)*1j
# Transform to real space with padding
a = newDistArray(fft, False)
b = newDistArray(fft, False)
assert a.shape == (192//comm.Get_size(), 192)
a = fft.backward(a_hat, a)
b = fft.backward(b_hat, b)
# Do forward transform with truncation
ab_hat = fft.forward(a*b)
.. note::
The padded instance of the :class:`.PFFT` class is often used in addition
to a regular non-padded class. The padded version is then used to handle
non-linearities, whereas the non-padded takes care of the rest, see `demo
<https://github.com/mpi4py/mpi4py-fft/blob/master/examples/spectral_dns_solver.py>`_.
|