1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97


title: 'mpi4pyfft: Parallel Fast Fourier Transforms with MPI for Python'
tags:
 Fast Fourier transforms
 Fast Chebyshev transforms
 MPI
 Python
authors:
 name: Mikael Mortensen
orcid: 0000000232937573
affiliation: "1"
 name: Lisandro Dalcin
orcid: 0000000180860155
affiliation: "2"
 name: David Elliot Keyes
orcid: 0000000240527224
affiliation: "2"
affiliations:
 name: University of Oslo, Department of Mathematics
index: 1
 name: King Abdullah University of Science and Technology, Extreme Computing Research Center
index: 2
date: 4 March 2019
bibliography: paper.bib

# Summary
The fast Fourier transform (FFT) is an algorithm that efficiently
computes the discrete Fourier transform. Since the dawn
of our digital society, the FFT has permeated to the heart of everyday
life applications involving audio, image, and video processing. The
importance of the FFT extends beyond signal processing into scientific
computing because it diagonalizes the Poisson operator, which is
ubiquitous in descriptions of electromagnetism, gravitation,
acoustic and elastic waves, potential flow in fluids, stress in
solids, Hamiltonians of the Schroedingerâ€™s equation for probability
distribution of electrons in quantum mechanics, and diffusion of
internal energy, chemical species, and momentum. The
FFT has been named *the most important numerical algorithm of our
time* by Professor Gilbert Strang [@strang94].
``mpi4pyfft`` (https://bitbucket.org/mpi4py/mpi4pyfft) is an
opensource Python package for computing (in parallel) FFTs of
possibly very large and distributed multidimensional arrays. A
multidimensional FFT is computed in sequence, over all axes, one axis
at the time. In order to fit in the memory of multiple processors,
multidimensional arrays have to be distributed along some, but not
all, of its axes. Consequently, parallel FFTs are computed with
successive sequential (serial) transforms over undivided axes,
combined with global array redistributions (using interprocess
communication) that realign the arrays for further serial transforms.
For global redistributions, ``mpi4pyfft`` makes use of a new and
completely generic algorithm [@dalcin18] based on advanced MPI
features that allows for any index sets of a multidimensional array to
be distributed. It can distribute a single index set (slab
decomposition), two index sets (pencil decomposition), or even more
for higherdimensional arrays. The required MPI communications are
always handled under the hood by MPI for Python [@mpi4py08]. For
serial FFT transforms, ``mpi4pyfft`` uses Cython [@cython11] to wrap
most of the FFTW library [@fftw05] and provide support for
complextocomplex, realtocomplex, complextoreal and realtoreal
transforms.
``mpi4pyfft`` is highly configurable in how it distributes and
redistributes arrays. Large arrays may be globally redistributed for
alignment along any given axis, whenever needed by the user. This
flexibility has enabled the development of ``shenfun``
[@mortensen_joss, @mortensen17], which is a Python framework for
solving partial differential equations (PDEs) by the spectral Galerkin
method. ``shenfun`` is able to solve PDEs of any given dimensionality
by creating tensor product bases as outer products of onedimensional
bases. This leads to large multidimensional arrays that are
distributed effortlessly through ``mpi4pyfft``.
Throughout the ``spectralDNS`` (https://github.com/spectralDNS/spectralDNS)
project ``shenfun`` is being used extensively for
Direct Numerical Simulations (DNS) of turbulent flows
[@mortensen16, @mortensen16b, @ketcheson], using arrays with billions of
unknowns.
``mpi4pyfft`` provides a flexible distributed NumPy array interface,
which allows for further reuse in applications beyond the FFT. The
distribution requires at least one undivided axis, but apart from that
there are no restrictions nor limitations. The interface can
simply be used to boost performance of global array operations through MPI.
# Acknowledgements
M Mortensen acknowledges support from the 4DSpace Strategic Research
Initiative at the University of Oslo.
L Dalcin and D Keyes acknowledge support from the Extreme Computing Research
Center and the KAUST Supercomputing Laboratory at King Abdullah
University of Science and Technology.
# References
