File: algorithms.md

package info (click to toggle)
gloo 0.0~git20231202.5354032-3
  • links: PTS, VCS
  • area: main
  • in suites: trixie
  • size: 2,148 kB
  • sloc: cpp: 21,546; python: 8,179; sh: 68; makefile: 67
file content (276 lines) | stat: -rw-r--r-- 11,448 bytes parent folder | download | duplicates (6)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
# Algorithms

Index of algorithms provided by Gloo and their semantics.

Variables used:
* **P**: Number of processes/machines
* **N**: Number of buffers per process
* **S**: Size of buffer

Terms used:
* **Communication steps**: number of communication steps. Every
  communication step has some latency, depending on the transport.
  Therefore, the fewer steps an algorithm uses, the better it is
  suited towards higher latency transports. Lower latency transports
  better tolerate more communication steps.
* **Bytes on the wire**: total number of bytes transmitted per
  participating process. The higher this number, the sooner an
  algorithm will be bound by the network bandwidth.

## Allreduce

Compute user-specified reduction operation (for e.g. sum) of N arrays per process
across P processes. This computation happens in place; all input arrays contain
the resulting reduction after the algorithm completes.

There are 3 phases to each implementation of this algorithm:
1. Local reduction of N buffers
2. Allreduce between processes
3. Broadcast result back to N buffers

### allreduce_ring

* Communication steps: P-1
* Bytes on the wire: P\*S

Phase 2 is implemented as follows:
1. Transmit local result to right side neighbor
2. Receive buffer from left side neighbor and reduce into local result
3. Transmit incoming buffer to right side neighbor
4. Repeat 2-3 until process has seen all data

### allreduce_ring_chunked

* Communication steps: 4\*P
* Bytes on the wire: 2\*S

Phase 2 is implemented in 2 sub-phases:
1. First, the algorithm iterates over the local reduction,
   transmitting chunks of the buffer and reducing at every step. The
   number of chunks is equal to 2\*P, allowing double buffering to be
   used. This means there is always one chunk in flight while
   reduction is done on another chunk concurrently. At the end of this
   phase, every process P holds 1/P of the reduced result.
2. Second, the algorithm iterates over the local reduction again, now
   broadcasting the local results.

With 2\*P chunks and two sub-phases, we arrive at 4\*P communication
steps.

These sub-phases are implemented as followed (roughly):

First:
1. Compute offset into local reduction buffer based on process rank
2. Transmit chunk at offset to right side neighbor
3. Receive chunk at offset-1 from left side neighbor and reduce into
   local result
4. Subtract 1 from offset, wrapping when needed
5. Repeat 2-4 until process has walked entire buffer

Second:
1. Transmit chunk at offset+1 (containing the global reduction) to
   right side neighbor
2. Receive chunk at offset from left side neighbor and copy into local
   result
3. Subtract 1 from offset, wrapping when needed
4. Repeat 1-3 until process has walked entire buffer

### allreduce_halving_doubling

* Communication steps: 2\*lg(P)
* Bytes on the wire: 2\*S

Phase 2 is implemented in two sub-phases:

1. First, a reduce-scatter is performed in lg(P) steps using a recursive
vector-halving, distance-doubling approach. In the first step of this algorithm
processes communicate in pairs (rank 0 with 1, 2 with 3, etc.), sending and
receiving for different halves of their input buffer. For example, process 0
sends the second half of its buffer to process 1 and receives and reduces data
for the first half of the buffer from process 1. A reduction over the received
data is performed before proceeding to the next communication step, where the
distance to the destination rank is doubled while the data sent and received is
halved. After the reduce-scatter phase is finished, each process has a portion
of the final reduced array.

2. The second sub-phase of Phase 2 performs an allgather. This is again done
using a recursive algorithm, retracing the communication steps from the
reduce-scatter in reverse, but this time simply concatenating the received data
at each step. At each process and step, the portion of the buffer that was being
sent in the reduce-scatter is received in the allgather, and the portion that was
being received in the reduce-scatter is now sent.

Across the steps of the reduce-scatter, data is received into different buffers
and there is no potential for race conditions. However, mirrored steps of the
reduce-scatter and allgather (e.g. last step of the reduce-scatter and first
step of the allgather) write into the same buffers. To prevent race conditions,
a notification is sent after data is processed in the reduce-scatter
subphase. This notification is processed in the allgather subphase prior to
performing the send. In the majority of cases these notification messages will
arrive long before the step of the allgather where they are processed, so their
effect on performance should be minimal.

When running on non-power-of-two number of processes, the algorithm works by
breaking up execution into blocks that are powers of two and communicating
interblock after the intrablock reduce-scatter. Non-power-of-two cases will have
some degree of load imbalance compared to power-of-two, but cases with few large
blocks (e.g. 8 + 4 or 16 + 8) should still perform relatively well.

The halving-doubling / binary-blocks algorithm is described and analyzed in
(Thakur et al., Optimization of Collective Communication Operations in MPICH,
IJHPCA, 2005).

### allreducube_bcube

Additional variables used:
* **B**: Base (maximum number of peers per step)

* Communication steps: 2\*log_B(P)
* Bytes on the wire: 2\*Sum(S/B^s) {s: 0 to log_B(P) - 1}

This is another allreduce implementation. Bcube is a scheme where nodes are
divided in groups. In reduce-scatter stage, in each group, a node peers with
`base - 1` other nodes. In the first step data is reduced between nodes
within the group. In the next step each node of a group peers with `base - 1`
nodes from other exclusively different groups. Since each node would start
with reduced data communicating with it would be like communicating with
`base` number of nodes/groups from the previous step. This process continues
until all the groups are covered and to be able to do that the algorithm
would have log_base(n) number of steps. Each step the node reduces
totalNumElems / (base^step) amount of elements. At the end of reduce-scatter
stage each node would have reduced a chunk of elements. Now, in all-gather
we follow a reverse process of reduce-scatter to communicate the reduced data
with other nodes.

### cuda_allreduce_ring

CUDA-aware implementation of `allreduce_ring`. GPU side buffers are
copied to system memory in parallel, prior to running local reduction
on CPU. After phase 2 completes, CPU side result is copied back to GPU
side buffers in parallel.

### cuda_allreduce_ring_chunked

CUDA-aware implementation of `allreduce_ring_chunked`. GPU side
buffers are reduced into GPU buffer 0 (using NCCL). The result is
copied to system memory asynchronously. After phase 2 completes, the
CPU side result is copied back to GPU buffer 0, and then broadcast to
other GPU buffers in parallel (using NCCL).

Both local reduction in phase 1 and broadcast in phase 3 is pipelined
with the communication steps where this data is needed or becomes
available.

### cuda_allreduce_halving_doubling

CUDA-aware implementation of `allreduce_halving_doubling` with no
pipelining between reduction/broadcast steps and the communication.

### cuda_allreduce_halving_doubling_pipelined

CUDA-aware implementation of `allreduce_halving_doubling` with pipelining
between local reduction/broadcast steps and communication. Local reduction step
is split into two steps (since the first communication step sends half the
buffer size). Final broadcast is pipelined across lgP steps, with each step
corresponding to a receive during the allgather phase.

## Reduce-Scatter

Compute user-specified reduction operation (for e.g. sum) of N arrays per process
across P processes. This computation happens in place. The result is scattered
to all processes as specified by the user; all input arrays contain the scattered
result after the algorithm completes.

There are 3 phases to each implementation of this algorithm:

1. Local reduction of N buffers

2. Reduce-Scatter between processes

3. Broadcast result back to N buffers

### reduce_scatter_halving_doubling

* Communication steps: lg(P)
* Bytes on the wire: S
(for scattering result evenly among P processes)

Phase 2 is implemented in two sub-phases:

1. First, a reduce-scatter is performed in lg(P) steps using a recursive
vector-halving, distance-doubling approach. In the first step of this algorithm
processes communicate in pairs (rank 0 with 1, 2 with 3, etc.), sending and
receiving for different halves of their input buffer. For example, process 0
sends the second half of its buffer to process 1 and receives and reduces data
for the first half of the buffer from process 1. A reduction over the received
data is performed before proceeding to the next communication step, where the
distance to the destination rank is doubled while the data sent and received is
halved.

2. After the reduce-scatter phase is finished, each process in the largest
binary block has a portion of the final reduced array. Next step is to
scatter/distribute based on user-specified distribution. Note that, due to
nature of recursive halving algorithm in the largest binary block, the blocks are
not ordered in correct order. Enforced correct reorder by exchanging data between
processes p and p',  where p' is the bit-reverse of p.

When running on non-power-of-two number of processes, the algorithm works by
breaking up execution into blocks that are powers of two and communicating
interblock after the intrablock reduce-scatter. Non-power-of-two cases will have
some degree of load imbalance compared to power-of-two, but cases with close to
power-of-two (for e.g. 16 + 2) should still perform relatively well.

The halving-doubling / binary-blocks algorithm is described and analyzed in
(Thakur et al., Optimization of Collective Communication Operations in MPICH,
IJHPCA, 2005).

Data re-ordering is described in (Sack et al., Faster topology-aware collective
algorithms through non-minimal communication, PPoPP, 2012).

## Barrier

Synchronization point between processes.

### barrier_all_to_all

* Communication steps: 1
* Bytes on the wire: P

Every process sends a notification to every other process.
Then, it waits for a notification from every other process.

### barrier_all_to_one

* Communication steps: 2
* Bytes on the wire: 1 for non-root, P for root

_Non-root processes_: send notification to root, wait for notification
from root.

_Root process_: wait for notification from P-1 processes, send
notification to P-1 processes.

## Broadcast

Broadcast contents of buffer on one process to other P-1 processes.

### broadcast_one_to_all

* Communication steps: 1
* Bytes on the wire: P\*S

_Non-root processes_: receive buffer from root.

_Root process_: send buffer to P-1 processes.

## pairwise_exchange

* Communication steps: variable
* Bytes on the wire: S

A communication pattern similar to the halving-doubling reduce-scatter,
simplified for benchmarking purposes. The number of communication steps, C
(which must be between 1 and lg(P)) is specified as input to the algorithm. In
each step, the set of nodes is partitioned into pairs as in the corresponding
step of halving-doubling reduce-scatter. Unlike the reduce-scatter, however, the
buffer size sent in each step is the same (equal to S/C).