File: lower_double_buffer.h

package info (click to toggle)
pytorch 1.13.1%2Bdfsg-4
  • links: PTS, VCS
  • area: main
  • in suites: bookworm
  • size: 139,252 kB
  • sloc: cpp: 1,100,274; python: 706,454; ansic: 83,052; asm: 7,618; java: 3,273; sh: 2,841; javascript: 612; makefile: 323; xml: 269; ruby: 185; yacc: 144; objc: 68; lex: 44
file content (249 lines) | stat: -rw-r--r-- 9,540 bytes parent folder | download
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
#pragma once

#include <c10/macros/Export.h>

#include <torch/csrc/jit/codegen/cuda/ir_all_nodes.h>
#include <torch/csrc/jit/codegen/cuda/kernel_ir.h>
#include <torch/csrc/jit/codegen/cuda/kernel_ir_dispatch.h>

// Double buffering a tensor doubles its allocation size and uses two
// buffers to facilitate computation and memory access
// overlapping. The basic form of code looks like as follows:
//
// Before:
// for i
//   x[S]; // allocation
//   for j:
//     x[j] = y[i, j]
//   for j:
//     ... = x[j]
//
// After:
// X[S * 2]; // allocation
// for i in 0 to 1: // Prologue
//   for j:
//     x[j] = y[i, j]
//
// for i in 0 to N-1: // Main
//   for j:
//     x[j + (1 - i % 2) * S] = y[i + 1, j]
//   for j:
//     ... = x[j + (i % 2) * S]
//
// for i in N-1 to N: // Epilogue
//   for j:
//     ... = x[j + (i % 2) * S]
//
// Here, S is the original size of tensor x.
//
// The i loop is the double buffer loop of tensor x, where double
// buffering is applied to the tensor. The first step of lowering is
// to find the double buffering axis for each double buffered
// tensor. It must not be parallelized as it isn't possible to double
// buffer parallelized loops. Also, an unrolled axis expands the
// allocation and is intended to make the loop completely unrolled,
// which also conflicts with double buffering. So, basically, the double
// buffering axis is the inner-most axis within the axes left
// of the CA position. However, when it is parallelized or unrolled, a
// further left axis is picked.
//
// Once the double buffer axis is determined, the main task is to
// replicate the corresponding double buffer loop as illustrated
// above. The Prologue loop is to just fetch the first element to
// populate the buffer. The main loop is mostly the same as the
// original loop, except for the indexing change to switch the two
// buffers. When used as a consumer, an offset of (1 - i % 2) * S is
// added, whereas (i % 2) * S is added when used as a producer. Here,
// i is the index of the double buffer loop. The Epilogue loop is just
// for the last iteration of the loop. Since the main loop reads one
// element ahead of the producer of the double buffered tensor, it
// would require an additional guard to prevent buffer overruns with
// the producer if the main loop were also used for the last
// iteration. However, the value loaded by the invalid load would not
// be used, so instead of adding the additional predicate, the Epilogue
// loop is replicated from the original loop, except for the load
// expression since it's not used. Note that this overrun does not
// happen when the producer is on gmem, so in that case, this
// additional replication is not done.
//
// When creating those three types of loops, additional care must be
// taken when multiple tensors are double buffered. When multiple
// tensors use the same loop as their double buffer loop, one pass of
// replication takes care of them at once, meaning the same Prologue,
// Main, Epilogue loops are used for the multiple tensors.
//
// Other tasks to do for a double buffer tensor include:
// - Move allocation to outside of the double buffer loop
// - Double the allocation size
// - Omit the RAW sync in the Main and Epilogue loops

// [Cicular buffer] An generalization of double buffering.
// On sm80+ hardware there is asynchronous copy infrastructure that
//  motivates a circular buffering generalization of double buffering.
// Almost all analyses previously done for double buffering are exactly
//  the same with circular buffering, except for the introduction of
//  new concept: `stage depth`.
//
// The `stage depth` is defined as the multiplier of extra buffering
//  space used. In the case of double buffering, the stage depth would
//  be 2.
//
// A circular buffered loop structure would look like follows, which
//  exactly parallels the case of double buffered loop structure, since
//  it is a exact generalization to the same purpose.
//
// Here S is the original allocation size as above,
//  D is the stage depth. With D=2, the below loop structure becomes
//  exactly the same as the case in double buffering.
//
// allocate X[S*D] // allocation
// for i in 0..D-1: // prolog
//   for j in ...
//     if pred:
//       x[i*S+j] = y[i, j];
//
// for i in 0..N: // main loop
//   for j in ...
//     if pred:
//       x[((i+D-1)%D)*S+j] = y[i+D-1, j];
//   for j in ...
//     .. = x[(i%D)*S+j]
//
// (Epilog omitted since this only makes sense in using
// cp.async, where producer will be in global mem and consumer will
// be in shared mem).
//
// The profitability of this optimization comes from extra tolerance
//  of global memory pipeline latency, as on the expression `.. = x[(i%D)*S+j]`
//  we only need to make sure the data for the current iteration is
//  completed while the remaining D-2 load iterations could still be in progress
//  and overlap with the computes of the current loop.
//
// To express this pattern on sm80+ hardware we can group the loads
//  in each iteration of the circular buffered loop as one "transaction",
//  and specify how many transactions we want to ensure completion when
//  we insert the async barriers.
//
// allocate X[S*D] // allocation
// for i in 0..D-1: // prolog
//   for j in ...
//     if pred:
//       x[i*S+j] = y[i, j];
//   cp.async.commit; // mark the transaction boundary
//
// # At this point we have D-1 transactions on the fly.
//   and for the first iteration of the main loop we need
//   one transaction completed, so we leave D-2 transactions
//   on the fly, which would be the input to the barrier instruction.
//
// cp.async.wait D-2 // ensure all but the last D-2 transactions complete.
//
// for i in 0..N: // main loop
//   # At this point we always have D-2 transactions on the fly.
//      and one completed.
//   for j in ...
//     if pred:
//       x[((i+D-1)%D)*S+j] = y[i+D-1, j];
//   for j in ...
//     .. = x[(i%D)*S+j]
//   cp.async.commit; // mark the transaction boundary for the
//                       load issued in this iteration.
//   # At this point we have D-1 transactions on the fly,
//       and none completed.
//   cp.async.wait D-2; // Ensure all but the last D-2 transactions complete.
//   __syncthreads(); // Need to syncthreads because each thread will only
//                      ensure completion of its own async copies so
//                      would need to sync to this point to ensure
//                      completion of the whole tile.

namespace torch {
namespace jit {
namespace fuser {
namespace cuda {

unsigned int getDoubleBufferAxisPosition(const TensorView* tv);

IterDomain* getDoubleBufferAxis(const TensorView* tv);

void validateDoubleBufferedTensor(const TensorView* tv);

class TORCH_CUDA_CU_API DoubleBufferPass {
 public:
  //! Apply double buffering transformations
  static std::vector<Expr*> run(const std::vector<Expr*>& exprs);
};

class TORCH_CUDA_CU_API DoubleBufferInfo {
  // Lowering information of double buffered tensors.
  struct TvInfo {
    IterDomain* double_buffer_axis = nullptr;
    Val* original_alloc_size = nullptr;
  };

 public:
  void build(Fusion* fusion);

  void setDoubleBufferAxis(const TensorView* tv, IterDomain* id);

  IterDomain* getDoubleBufferAxis(const TensorView* tv);

  //! Get a loop that matches with a given double-buffer axis. If
  //! ignore_prologue is true, a matched loop is ignored if it's a
  //! prologue loop.
  static kir::ForLoop* getDoubleBufferLoop(
      IterDomain* axis,
      const std::vector<kir::ForLoop*>& loops,
      bool ignore_prologue = false);

  //! Get a loop that matches with the double-buffer axis of a given
  //! double-buffered tensor. If ignore_prologue is true, a matched
  //! loop is ignored if it's a prologue loop.
  kir::ForLoop* getDoubleBufferLoop(
      const TensorView* tv,
      const std::vector<kir::ForLoop*>& loops,
      bool ignore_prologue = false);

  void setOriginalAllocSize(const TensorView* tv, Val* size);

  Val* getOriginalAllocSize(const TensorView* tv);

  //! Returns true if the iterdomain will be realized
  //!  as a double buffer loop.
  bool isDoubleBufferedIterDomain(IterDomain* id);

  //! Get the number of circular buffer stages for the given axis,
  //!  the number of stages will be 2 in the case of double buffer loop.
  unsigned int getStageDepthFor(IterDomain* circular_buffered_id);

 private:
  TvInfo& getTvInfo(const TensorView* tv);

  //! Set the number of circular buffer stages for the given
  //! circular_buffered_id.
  //!  Current code generation only supports one stage depth per loop disjoint
  //!  set,
  //! so this function will throw an error if trying to set different stage
  //! numbers to iterdomains that are loop mapped.
  void setStageDepth(
      IterDomain* circular_buffered_id,
      unsigned int stage_depth);

 private:
  //! Keeps track of information for lowering double buffered tensors
  std::unordered_map<const TensorView*, TvInfo> map_;

  //! Keeps track of which concrete loop map is realizing double buffer
  //!  iterdomains.
  std::unordered_set<const IterDomain*> concrete_double_buffered_loop_id_;

  //! Keeps track of double buffer loop stage depth.
  //!  Currently for each disjoint set of loop mapped iterdomains,
  //! Only one stage depth is supported, so that the loops can indeed
  //! shared with the same prolog extent and main loop offset.
  std::unordered_map<IterDomain*, unsigned int> stage_depth_;
};

} // namespace cuda
} // namespace fuser
} // namespace jit
} // namespace torch