1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249
|
#pragma once
#include <c10/macros/Export.h>
#include <torch/csrc/jit/codegen/cuda/ir_all_nodes.h>
#include <torch/csrc/jit/codegen/cuda/kernel_ir.h>
#include <torch/csrc/jit/codegen/cuda/kernel_ir_dispatch.h>
// Double buffering a tensor doubles its allocation size and uses two
// buffers to facilitate computation and memory access
// overlapping. The basic form of code looks like as follows:
//
// Before:
// for i
// x[S]; // allocation
// for j:
// x[j] = y[i, j]
// for j:
// ... = x[j]
//
// After:
// X[S * 2]; // allocation
// for i in 0 to 1: // Prologue
// for j:
// x[j] = y[i, j]
//
// for i in 0 to N-1: // Main
// for j:
// x[j + (1 - i % 2) * S] = y[i + 1, j]
// for j:
// ... = x[j + (i % 2) * S]
//
// for i in N-1 to N: // Epilogue
// for j:
// ... = x[j + (i % 2) * S]
//
// Here, S is the original size of tensor x.
//
// The i loop is the double buffer loop of tensor x, where double
// buffering is applied to the tensor. The first step of lowering is
// to find the double buffering axis for each double buffered
// tensor. It must not be parallelized as it isn't possible to double
// buffer parallelized loops. Also, an unrolled axis expands the
// allocation and is intended to make the loop completely unrolled,
// which also conflicts with double buffering. So, basically, the double
// buffering axis is the inner-most axis within the axes left
// of the CA position. However, when it is parallelized or unrolled, a
// further left axis is picked.
//
// Once the double buffer axis is determined, the main task is to
// replicate the corresponding double buffer loop as illustrated
// above. The Prologue loop is to just fetch the first element to
// populate the buffer. The main loop is mostly the same as the
// original loop, except for the indexing change to switch the two
// buffers. When used as a consumer, an offset of (1 - i % 2) * S is
// added, whereas (i % 2) * S is added when used as a producer. Here,
// i is the index of the double buffer loop. The Epilogue loop is just
// for the last iteration of the loop. Since the main loop reads one
// element ahead of the producer of the double buffered tensor, it
// would require an additional guard to prevent buffer overruns with
// the producer if the main loop were also used for the last
// iteration. However, the value loaded by the invalid load would not
// be used, so instead of adding the additional predicate, the Epilogue
// loop is replicated from the original loop, except for the load
// expression since it's not used. Note that this overrun does not
// happen when the producer is on gmem, so in that case, this
// additional replication is not done.
//
// When creating those three types of loops, additional care must be
// taken when multiple tensors are double buffered. When multiple
// tensors use the same loop as their double buffer loop, one pass of
// replication takes care of them at once, meaning the same Prologue,
// Main, Epilogue loops are used for the multiple tensors.
//
// Other tasks to do for a double buffer tensor include:
// - Move allocation to outside of the double buffer loop
// - Double the allocation size
// - Omit the RAW sync in the Main and Epilogue loops
// [Cicular buffer] An generalization of double buffering.
// On sm80+ hardware there is asynchronous copy infrastructure that
// motivates a circular buffering generalization of double buffering.
// Almost all analyses previously done for double buffering are exactly
// the same with circular buffering, except for the introduction of
// new concept: `stage depth`.
//
// The `stage depth` is defined as the multiplier of extra buffering
// space used. In the case of double buffering, the stage depth would
// be 2.
//
// A circular buffered loop structure would look like follows, which
// exactly parallels the case of double buffered loop structure, since
// it is a exact generalization to the same purpose.
//
// Here S is the original allocation size as above,
// D is the stage depth. With D=2, the below loop structure becomes
// exactly the same as the case in double buffering.
//
// allocate X[S*D] // allocation
// for i in 0..D-1: // prolog
// for j in ...
// if pred:
// x[i*S+j] = y[i, j];
//
// for i in 0..N: // main loop
// for j in ...
// if pred:
// x[((i+D-1)%D)*S+j] = y[i+D-1, j];
// for j in ...
// .. = x[(i%D)*S+j]
//
// (Epilog omitted since this only makes sense in using
// cp.async, where producer will be in global mem and consumer will
// be in shared mem).
//
// The profitability of this optimization comes from extra tolerance
// of global memory pipeline latency, as on the expression `.. = x[(i%D)*S+j]`
// we only need to make sure the data for the current iteration is
// completed while the remaining D-2 load iterations could still be in progress
// and overlap with the computes of the current loop.
//
// To express this pattern on sm80+ hardware we can group the loads
// in each iteration of the circular buffered loop as one "transaction",
// and specify how many transactions we want to ensure completion when
// we insert the async barriers.
//
// allocate X[S*D] // allocation
// for i in 0..D-1: // prolog
// for j in ...
// if pred:
// x[i*S+j] = y[i, j];
// cp.async.commit; // mark the transaction boundary
//
// # At this point we have D-1 transactions on the fly.
// and for the first iteration of the main loop we need
// one transaction completed, so we leave D-2 transactions
// on the fly, which would be the input to the barrier instruction.
//
// cp.async.wait D-2 // ensure all but the last D-2 transactions complete.
//
// for i in 0..N: // main loop
// # At this point we always have D-2 transactions on the fly.
// and one completed.
// for j in ...
// if pred:
// x[((i+D-1)%D)*S+j] = y[i+D-1, j];
// for j in ...
// .. = x[(i%D)*S+j]
// cp.async.commit; // mark the transaction boundary for the
// load issued in this iteration.
// # At this point we have D-1 transactions on the fly,
// and none completed.
// cp.async.wait D-2; // Ensure all but the last D-2 transactions complete.
// __syncthreads(); // Need to syncthreads because each thread will only
// ensure completion of its own async copies so
// would need to sync to this point to ensure
// completion of the whole tile.
namespace torch {
namespace jit {
namespace fuser {
namespace cuda {
unsigned int getDoubleBufferAxisPosition(const TensorView* tv);
IterDomain* getDoubleBufferAxis(const TensorView* tv);
void validateDoubleBufferedTensor(const TensorView* tv);
class TORCH_CUDA_CU_API DoubleBufferPass {
public:
//! Apply double buffering transformations
static std::vector<Expr*> run(const std::vector<Expr*>& exprs);
};
class TORCH_CUDA_CU_API DoubleBufferInfo {
// Lowering information of double buffered tensors.
struct TvInfo {
IterDomain* double_buffer_axis = nullptr;
Val* original_alloc_size = nullptr;
};
public:
void build(Fusion* fusion);
void setDoubleBufferAxis(const TensorView* tv, IterDomain* id);
IterDomain* getDoubleBufferAxis(const TensorView* tv);
//! Get a loop that matches with a given double-buffer axis. If
//! ignore_prologue is true, a matched loop is ignored if it's a
//! prologue loop.
static kir::ForLoop* getDoubleBufferLoop(
IterDomain* axis,
const std::vector<kir::ForLoop*>& loops,
bool ignore_prologue = false);
//! Get a loop that matches with the double-buffer axis of a given
//! double-buffered tensor. If ignore_prologue is true, a matched
//! loop is ignored if it's a prologue loop.
kir::ForLoop* getDoubleBufferLoop(
const TensorView* tv,
const std::vector<kir::ForLoop*>& loops,
bool ignore_prologue = false);
void setOriginalAllocSize(const TensorView* tv, Val* size);
Val* getOriginalAllocSize(const TensorView* tv);
//! Returns true if the iterdomain will be realized
//! as a double buffer loop.
bool isDoubleBufferedIterDomain(IterDomain* id);
//! Get the number of circular buffer stages for the given axis,
//! the number of stages will be 2 in the case of double buffer loop.
unsigned int getStageDepthFor(IterDomain* circular_buffered_id);
private:
TvInfo& getTvInfo(const TensorView* tv);
//! Set the number of circular buffer stages for the given
//! circular_buffered_id.
//! Current code generation only supports one stage depth per loop disjoint
//! set,
//! so this function will throw an error if trying to set different stage
//! numbers to iterdomains that are loop mapped.
void setStageDepth(
IterDomain* circular_buffered_id,
unsigned int stage_depth);
private:
//! Keeps track of information for lowering double buffered tensors
std::unordered_map<const TensorView*, TvInfo> map_;
//! Keeps track of which concrete loop map is realizing double buffer
//! iterdomains.
std::unordered_set<const IterDomain*> concrete_double_buffered_loop_id_;
//! Keeps track of double buffer loop stage depth.
//! Currently for each disjoint set of loop mapped iterdomains,
//! Only one stage depth is supported, so that the loops can indeed
//! shared with the same prolog extent and main loop offset.
std::unordered_map<IterDomain*, unsigned int> stage_depth_;
};
} // namespace cuda
} // namespace fuser
} // namespace jit
} // namespace torch
|