1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109
|
#include <torch/csrc/jit/codegen/cuda/iter_visitor.h>
#include <torch/csrc/jit/codegen/cuda/kernel_ir_dispatch.h>
#include <torch/csrc/jit/codegen/cuda/lower2device.h>
#include <torch/csrc/jit/codegen/cuda/lower_magic_zero.h>
#include <torch/csrc/jit/codegen/cuda/lower_instrument.h>
namespace torch {
namespace jit {
namespace fuser {
namespace cuda {
namespace {
class Instrumentor : private kir::IrVisitor {
public:
Instrumentor(const std::vector<Expr*>& exprs) {
IrVisitor::handle(exprs);
if (profile_.getNumberOfProfileEntries() == 0) {
exprs_ = exprs;
return;
}
// Allocate a new TensorView as a backing buffer
allocateBuffer();
profile_.setBuffer(buffer_);
// Insert the allocation expression at the beginning of the
// top-level expressions
exprs_.push_back(buffer_alloc_);
exprs_.insert(exprs_.end(), exprs.begin(), exprs.end());
}
const kir::KernelPerformanceProfile& profile() const {
return profile_;
}
const std::vector<Expr*>& exprs() const {
return exprs_;
}
private:
using IrVisitor::handle;
//! Profile all grid reductions.
//! TODO: support other variants of grid reductions (e.g.,
//! GroupedGridReduction)
void handle(kir::GridReduction* expr) final {
profile_.registerExpr(expr);
}
void handle(kir::GroupedGridReduction* expr) final {
profile_.registerExpr(expr);
}
void allocateBuffer() {
const auto num_profile_entries = profile_.getNumberOfProfileEntries();
// If nothing to profile, do not allocate anything
if (num_profile_entries == 0) {
return;
}
// Allocate two integers for each entry. One is used for accumulating
// cycles, and another for couting the number of hits
const std::vector<IterDomain*> new_buffer_ids = {
IterDomainBuilder(
GpuLower::current()->kernel()->zeroVal(),
IrBuilder::create<Int>(num_profile_entries))
.build(),
IterDomainBuilder(
GpuLower::current()->kernel()->zeroVal(), IrBuilder::create<Int>(2))
.build()};
const auto buffer_domain = IrBuilder::create<TensorDomain>(new_buffer_ids);
buffer_ = IrBuilder::create<TensorView>(
buffer_domain, DataType::Int, MemoryType::Global);
buffer_alloc_ = IrBuilder::create<kir::Allocate>(
buffer_, buffer_->getMemoryType(), nullptr, true);
}
private:
std::vector<Expr*> exprs_;
kir::KernelPerformanceProfile profile_;
TensorView* buffer_ = nullptr;
kir::Allocate* buffer_alloc_ = nullptr;
};
} // namespace
std::vector<Expr*> instrumentKernel(const std::vector<Expr*>& exprs) {
if (!isOptionEnabled(EnableOption::KernelProfile)) {
return exprs;
}
Instrumentor inst(exprs);
GpuLower::current()->profile() = inst.profile();
return inst.exprs();
}
} // namespace cuda
} // namespace fuser
} // namespace jit
} // namespace torch
|