1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360
|
.. _global:
Global Redistributions
======================
In high performance computing large multidimensional arrays are often
distributed and shared amongst a large number of different processors.
Consider a large three-dimensional array of double (64 bit) precision and
global shape :math:`(512, 1024, 2048)`. To lift this array into RAM requires
8 GB of memory, which may be too large for a single, non-distributed
machine. If, however, you have access to a distributed architecture, you can
split the array up and share it between, e.g., four CPUs (most supercomputers
have either 2 or 4 GB of memory per CPU), which will only need to
hold 2 GBs of the global array each. Moreover, many algorithms with varying
degrees of locality can take advantage of the distributed nature of the array
to compute local array pieces concurrently, effectively exploiting multiple
processor resources.
There are several ways of distributing a large multidimensional
array. Two such distributions for our three-dimensional global array
(using 4 processors) are shown below
.. image:: datastructures1.png
:width: 250px
:height: 200px
.. image:: datastructures_pencil0.png
:width: 250px
:height: 200px
Here each color represents one of the processors. We note that in the first
image only one of the three axes is distributed, whereas in the second two axes
are distributed. The first configuration corresponds to a slab, whereas the
second corresponds to a pencil distribution. With either distribution only one
quarter of the large, global array needs to be kept in rapid (RAM) memory for
each processor, which is great. However, some operations may then require
data that is not available locally in its quarter of the total array. If
that is so, the processors will need to communicate with each other and
send the necessary data where it is needed. There are many such MPI routines
designed for sending and receiving data.
We are generally interested in algorithms, like the FFT, that work on the
global array, along one axis at the time. To be able to execute such algorithms,
we need to make sure that the local arrays have access to all of its
data along this axis. For the figure above, the slab distribution gives each
processor data that is fully available along two axes, whereas the pencil
distribution only has data fully available along one axis. Rearranging data,
such that it becomes aligned in a different direction, is usually termed
a global redistribution, or a global transpose operation. Note that with
mpi4py-fft we always require that at least one axis of a multidimensional
array remains aligned (non-distributed).
Distribution and global redistribution is in mpi4py-fft handled by three
classes in the :mod:`.pencil` module:
* :class:`.Pencil`
* :class:`.Subcomm`
* :class:`.Transfer`
These classes are the low-level backbone of the higher-level :class:`.PFFT` and
:class:`.DistArray` classes. To use these low-level classes
directly is not recommended and usually not necessary. However, for
clarity we start by describing how these low-level classes work together.
Lets first consider a 2D dataarray of global shape (8, 8) that will be
distributed along axis 0. With a high level API we could then simply
do::
import numpy as np
from mpi4py_fft import DistArray
N = (8, 8)
a = DistArray(N, [0, 1])
where the ``[0, 1]`` list decides that the first axis can be distributed,
whereas the second axis is using one processor only and as such is
aligned (non-distributed). We may now inspect the low-level
:class:`.Pencil` class associated with ``a``::
p0 = a.pencil
The ``p0`` :class:`.Pencil` object contains information about the
distribution of a 2D dataarray of global shape (8, 8). The
distributed array ``a`` has been created using the information that is in
``p0``, and ``p0`` is used by ``a`` to look up information about
the global array, for example::
>>> a.alignment
1
>>> a.global_shape
(8, 8)
>>> a.subcomm
(<mpi4py.MPI.Cartcomm at 0x10cc14a68>, <mpi4py.MPI.Cartcomm at 0x10e028690>)
>>> a.commsizes
[1, 1]
Naturally, the sizes of the communicators will depend on the
number of processors used to run the program. If we used 4, then
``a.commsizes`` would return ``[1, 4]``.
We note that a low-level approach to creating such a distributed array would
be::
import numpy as np
from mpi4py_fft.pencil import Pencil, Subcomm
from mpi4py import MPI
comm = MPI.COMM_WORLD
N = (8, 8)
subcomm = Subcomm(comm, [0, 1])
p0 = Pencil(subcomm, N, axis=1)
a0 = np.zeros(p0.subshape)
Note that this last array ``a0`` would be equivalent to ``a``, but
it would be a pure Numpy array (created on each processor) and it would
not contain any of the information about the global array that it is
part of ``(global_shape, pencil, subcomm, etc.)``. It contains the same
amount of data as ``a`` though and ``a0`` is as such a perfectly fine
distributed array. Used together with ``p0`` it contains exactly the
same information as ``a``.
Since at least one axis needs to be aligned (non-distributed), a 2D array
can only be distributed with
one processor group. If we wanted to distribute the second axis instead
of the first, then we would have done::
a = DistArray(N, [1, 0])
With the low-level approach we would have had to use ``axis=0`` in the
creation of ``p0``, as well as ``[1, 0]`` in the creation of ``subcomm``.
Another way to get the second ``pencil``, that is aligned with axis 0,
is to create it from ``p0``::
p1 = p0.pencil(0)
Now the ``p1`` object will represent a (8, 8) global array distributed in the
second axis.
Lets create a complete script (``pencils.py``) that fills the array ``a`` with
the value of each processors rank (note that it would also work to follow the
low-level approach and use ``a0``)::
import numpy as np
from mpi4py_fft import DistArray
from mpi4py import MPI
comm = MPI.COMM_WORLD
N = (8, 8)
a = DistArray(N, [0, 1])
a[:] = comm.Get_rank()
print(a.shape)
We can run it with::
mpirun -np 4 python pencils.py
and obtain the printed results from the last line (``print(a.shape)``)::
(2, 8)
(2, 8)
(2, 8)
(2, 8)
The shape of the local ``a`` arrays is (2, 8) on all 4 processors. Now assume
that we need these data aligned in the x-direction (axis=0) instead. For this
to happen we need to perform a *global redistribution*. The easiest approach
is then to execute the following::
b = a.redistribute(0)
print(b.shape)
which would print the following::
(8, 2)
(8, 2)
(8, 2)
(8, 2)
Under the hood the global redistribution is executed with the help of the
:class:`.Transfer` class, that is designed to
transfer data between any two sets of pencils, like those represented by
``p0`` and ``p1``. With low-level API a transfer object may be created
using the pencils and the datatype of the array that is to be sent::
transfer = p0.transfer(p1, np.float)
Executing the global redistribution is then simply a matter of::
a1 = np.zeros(p1.subshape)
transfer.forward(a, a1)
Now it is important to realise that the global array does not change. The local
``a1`` arrays will now contain the same data as ``a``, only aligned differently.
However, the exchange is not performed in-place. The new array is as such a
copy of the original that is aligned differently.
Some images, :numref:`2dpencila` and :numref:`2dpencilb`, can be used to
illustrate:
.. _2dpencila:
.. figure:: 2Dpencil.png
:width: 250px
:height: 200px
Original 4 pencils (p0) of shape (2, 8) aligned in y-direction. Color
represents rank.
.. _2dpencilb:
.. figure:: 2Dpencil2.png
:width: 250px
:height: 200px
4 pencils (p1) of shape (8, 2) aligned in x-direction after receiving
data from p0. Data is the same as in :numref:`2dpencila`, only aligned
differently.
Mathematically, we will denote the entries of a two-dimensional global array
as :math:`u_{j_0, j_1}`, where :math:`j_0\in \textbf{j}_0=[0, 1, \ldots, N_0-1]`
and :math:`j_1\in \textbf{j}_1=[0, 1, \ldots, N_1-1]`. The shape of the array is
then :math:`(N_0, N_1)`. A global array
:math:`u_{j_0, j_1}` distributed in the first axis (as shown in
:numref:`2dpencila`) by processor group :math:`P`,
containing :math:`|P|` processors, is denoted as
.. math::
u_{j_0/P, j_1}
The global redistribution, from alignment in axis 1 to alignment in axis 0,
as from :numref:`2dpencila` to :numref:`2dpencilb` above, is denoted as
.. math::
u_{j_0, j_1/P} \xleftarrow[P]{1\rightarrow 0} u_{j_0/P, j_1}
This operation corresponds exactly to the forward transfer defined above::
transfer.forward(a0, a1)
If we need to go the other way
.. math::
u_{j_0/P, j_1} \xleftarrow[P]{0\rightarrow 1} u_{j_0, j_1/P}
this corresponds to::
transfer.backward(a1, a0)
Note that the directions (forward/backward) here depends on how the transfer
object is created. Under the hood all transfers are executing calls to
`MPI.Alltoallw <https://www.mpich.org/static/docs/v3.2/www3/MPI_Alltoallw.html>`_.
Multidimensional distributed arrays
-----------------------------------
The procedure discussed above remains the same for any type of array, of any
dimensionality. With mpi4py-fft we can distribute any array of arbitrary dimensionality
using any number of processor groups. We only require that the number of processor
groups is at least one less than the number of dimensions, since one axis must
remain aligned. Apart from this the distribution is completely configurable through
the classes in the :mod:`.pencil` module.
We denote a global :math:`d`-dimensional array as :math:`u_{j_0, j_1, \ldots, j_{d-1}}`,
where :math:`j_m\in\textbf{j}_m` for :math:`m=[0, 1, \ldots, d-1]`.
A :math:`d`-dimensional array distributed with only one processor group in the
first axis is denoted as :math:`u_{j_0/P, j_1, \ldots, j_{d-1}}`. If using more
than one processor group, the groups are indexed, like :math:`P_0, P_1` etc.
Lets illustrate using a 4-dimensional array with 3 processor groups. Let the
array be aligned only in axis 3 first (:math:`u_{j_0/P_0, j_1/P_1, j_2/P_2, j_3}`),
and then redistribute for alignment along axes 2, 1 and finally 0. Mathematically,
we will now be executing the three following global redistributions:
.. math::
:label: 4d_redistribute
u_{j_0/P_0, j_1/P_1, j_2, j_3/P_2} \xleftarrow[P_2]{3 \rightarrow 2} u_{j_0/P_0, j_1/P_1, j_2/P_2, j_3} \\
u_{j_0/P_0, j_1, j_2/P_1, j_3/P_2} \xleftarrow[P_1]{2 \rightarrow 1} u_{j_0/P_0, j_1/P_1, j_2, j_3/P_2} \\
u_{j_0, j_1/P_0, j_2/P_1, j_3/P_2} \xleftarrow[P_0]{1 \rightarrow 0} u_{j_0/P_0, j_1, j_2/P_1, j_3/P_2}
Note that in the first step it is only processor group :math:`P_2` that is
active in the redistribution, and the output (left hand side) is now aligned
in axis 2. This can be seen since there is no processor group there to
share the :math:`j_2` index.
In the second step processor group :math:`P_1` is the active one, and
in the final step :math:`P_0`.
Now, it is not necessary to use three processor groups just because we have a
four-dimensional array. We could just as well have been using 2 or 1. The advantage
of using more groups is that you can then use more processors in total. Assuming
:math:`N = N_0 = N_1 = N_2 = N_3`, you can use a maximum of :math:`N^p` processors,
where :math:`p` is
the number of processor groups. So for an array of shape :math:`(8,8,8,8)`
it is possible to use 8, 64 and 512 number of processors for 1, 2 and 3
processor groups, respectively. On the other hand, if you can get away with it,
or if you do not have access to a great number of processors, then fewer groups
are usually found to be faster for the same number of processors in total.
We can implement the global redistribution using the high-level :class:`.DistArray`
class::
N = (8, 8, 8, 8)
a3 = DistArray(N, [0, 0, 0, 1])
a2 = a3.redistribute(2)
a1 = a2.redistribute(1)
a0 = a1.redistribute(0)
Note that the three redistribution steps correspond exactly to the three steps
in :eq:`4d_redistribute`.
Using a low-level API the same can be achieved with a little more elaborate
coding. We start by creating pencils for the 4 different alignments::
subcomm = Subcomm(comm, [0, 0, 0, 1])
p3 = Pencil(subcomm, N, axis=3)
p2 = p3.pencil(2)
p1 = p2.pencil(1)
p0 = p1.pencil(0)
Here we have defined 4 different pencil groups, ``p0, p1, p2, p3``, aligned in
axis 0, 1, 2 and 3, respectively. Transfer objects for arrays of type ``np.float``
are then created as::
transfer32 = p3.transfer(p2, np.float)
transfer21 = p2.transfer(p1, np.float)
transfer10 = p1.transfer(p0, np.float)
Note that we can create transfer objects between any two pencils, not just
neighbouring axes. We may now perform three different global redistributions
as::
a0 = np.zeros(p0.subshape)
a1 = np.zeros(p1.subshape)
a2 = np.zeros(p2.subshape)
a3 = np.zeros(p3.subshape)
a0[:] = np.random.random(a0.shape)
transfer32.forward(a3, a2)
transfer21.forward(a2, a1)
transfer10.forward(a1, a0)
Storing this code under ``pencils4d.py``, we can use 8 processors that will
give us 3 processor groups with 2 processors in each group::
mpirun -np 8 python pencils4d.py
Note that with the low-level approach we can now easily go back using the
reverse ``backward`` method of the :class:`.Transfer` objects::
transfer10.backward(a0, a1)
A different approach is also possible with the high-level API::
a0.redistribute(out=a1)
a1.redistribute(out=a2)
a2.redistribute(out=a3)
which corresponds to the backward transfers. However, with the high-level
API the transfer objects are created (and deleted on exit) during the call
to ``redistribute`` and as such this latter approach may be slightly less
efficient.
|