1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142
|
# Instruction count microbenchmarks
## Quick start
### To run the benchmark:
```
# From pytorch root
cd benchmarks/instruction_counts
python main.py
```
Currently `main.py` contains a very simple threadpool (so that run time isn't
unbearably onerous) and simply prints the results. These components will be
upgraded in subsequent PRs.
### To define a new benchmark:
* `TimerArgs`: Low level definition which maps directly to
`torch.utils.benchmark.Timer`
* `GroupedStmts`: Benchmark a snippet. (Python, C++, or both) Can automatically
generate TorchScript and autograd variants.
* `GroupedModules`: Like `GroupedStmts`, but takes `nn.Module`s
* `GroupedVariants`: Benchmark-per-line to define many related benchmarks in a
single code block.
## Architecture
### Benchmark definition.
One primary goal of this suite is to make it easy to define semantically
related clusters of benchmarks. The crux of this effort is the
`GroupedBenchmark` class, which is defined in `core/api.py`. It takes a
definition for a set of related benchmarks, and produces one or more concrete
cases. It's helpful to see an example to understand how the machinery works.
Consider the following benchmark:
```
# `GroupedStmts` is an alias of `GroupedBenchmark.init_from_stmts`
benchmark = GroupedStmts(
py_stmt=r"y = x * w",
cpp_stmt=r"auto y = x * w;",
setup=GroupedSetup(
py_setup="""
x = torch.ones((4, 4))
w = torch.ones((4, 4), requires_grad=True)
""",
cpp_setup="""
auto x = torch::ones((4, 4));
auto w = torch::ones((4, 4));
w.set_requires_grad(true);
""",
),
signature="f(x, w) -> y",
torchscript=True,
autograd=True,
),
```
It is trivial to generate Timers for the eager forward mode case (ignoring
`num_threads` for now):
```
Timer(
stmt=benchmark.py_fwd_stmt,
setup=benchmark.setup.py_setup,
)
Timer(
stmt=benchmark.cpp_fwd_stmt,
setup=benchmark.setup.cpp_setup,
language="cpp",
)
```
Moreover, because `signature` is provided we know that creation of `x` and `w`
is part of setup, and the overall comptation uses `x` and `w` to produce `y`.
As a result, we can derive TorchScript'd and AutoGrad variants as well. We can
deduce that a TorchScript model will take the form:
```
@torch.jit.script
def f(x, w):
# Paste `benchmark.py_fwd_stmt` into the function body.
y = x * w
return y # Set by `-> y` in signature.
```
And because we will want to use this model in both Python and C++, we save it to
disk and load it as needed. At this point Timers for TorchScript become:
```
Timer(
stmt="""
y = jit_model(x, w)
""",
setup=""",
# benchmark.setup.py_setup
# jit_model = torch.jit.load(...)
# Warm up jit_model
""",
)
Timer(
stmt="""
std::vector<torch::jit::IValue> ivalue_inputs(
torch::jit::IValue({x}),
torch::jit::IValue({w})
);
auto y = jit_model.forward(ivalue_inputs);
""",
setup="""
# benchmark.setup.cpp_setup
# jit_model = torch::jit::load(...)
# Warm up jit_model
""",
)
```
While nothing above is particularly complex, there is non-trivial bookkeeping
(managing the model artifact, setting up IValues) which if done manually would
be rather bug-prone and hard to read.
The story is similar for autograd: because we know the output variable (`y`)
and we make sure to assign it when calling TorchScript models, testing AutoGrad
is as simple as appending `y.backward()` (or `y.backward();` in C++) to the
stmt of the forward only variant. Of course this requires that `signature` be
provided, as there is nothing special about the name `y`.
The logic for the manipulations above is split between `core/api.py` (for
generating `stmt` based on language, Eager/TorchScript, with or without AutoGrad)
and `core/expand.py` (for larger, more expansive generation). The benchmarks
themselves are defined in `definitions/standard.py`. The current set is chosen
to demonstrate the various model definition APIs, and will be expanded when the
benchmark runner infrastructure is better equipped to deal with a larger run.
### Benchmark execution.
Once `expand.materialize` has flattened the abstract benchmark definitions into
`TimerArgs`, they can be sent to a worker (`worker/main.py`) subprocess to
execution. This worker has no concept of the larger benchmark suite; `TimerArgs`
is a one-to-one and direct mapping to the `torch.utils.benchmark.Timer` instance
that the worker instantiates.
|