1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164
|
# Distributed Data Parallel Benchmark
This tool is used to measure distributed training iteration time. This
is helpful for evaluating the performance impact of code changes to
`torch.nn.parallel.DistributedDataParallel`, `torch.distributed`, or
anything in between.
It optionally produces a JSON file with all measurements, allowing for
an easy A/B comparison of code, configuration, or environment. This
comparison can be produced by `diff.py`.
## Requirements
This benchmark depends on PyTorch and torchvision.
## How to run
Run as many copies of this script as you have model replicas.
If you launch a single task per machine with multiple GPUs, consider
using [`torch.distributed.launch`][launch] to spawn multiple processes
per machine.
[launch]: https://pytorch.org/docs/stable/distributed.html#launch-utility
Example output (only on rank 0):
```
-----------------------------------
PyTorch distributed benchmark suite
-----------------------------------
* PyTorch version: 1.4.0a0+05140f0
* CUDA version: 10.0
* Distributed backend: nccl
--- nvidia-smi topo -m ---
GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 mlx5_2 mlx5_0 mlx5_3 mlx5_1 CPU Affinity
GPU0 X NV1 NV1 NV2 NV2 SYS SYS SYS SYS PIX SYS PHB 0-19,40-59
GPU1 NV1 X NV2 NV1 SYS NV2 SYS SYS SYS PIX SYS PHB 0-19,40-59
GPU2 NV1 NV2 X NV2 SYS SYS NV1 SYS SYS PHB SYS PIX 0-19,40-59
GPU3 NV2 NV1 NV2 X SYS SYS SYS NV1 SYS PHB SYS PIX 0-19,40-59
GPU4 NV2 SYS SYS SYS X NV1 NV1 NV2 PIX SYS PHB SYS 0-19,40-59
GPU5 SYS NV2 SYS SYS NV1 X NV2 NV1 PIX SYS PHB SYS 0-19,40-59
GPU6 SYS SYS NV1 SYS NV1 NV2 X NV2 PHB SYS PIX SYS 0-19,40-59
GPU7 SYS SYS SYS NV1 NV2 NV1 NV2 X PHB SYS PIX SYS 0-19,40-59
mlx5_2 SYS SYS SYS SYS PIX PIX PHB PHB X SYS PHB SYS
mlx5_0 PIX PIX PHB PHB SYS SYS SYS SYS SYS X SYS PHB
mlx5_3 SYS SYS SYS SYS PHB PHB PIX PIX PHB SYS X SYS
mlx5_1 PHB PHB PIX PIX SYS SYS SYS SYS SYS PHB SYS X
Legend:
X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe switches (without traversing the PCIe Host Bridge)
PIX = Connection traversing a single PCIe switch
NV# = Connection traversing a bonded set of # NVLinks
--------------------------
Benchmark: resnet50 with batch size 32
sec/iter ex/sec sec/iter ex/sec sec/iter ex/sec sec/iter ex/sec
1 GPUs -- no ddp: p50: 0.097s 329/s p75: 0.097s 329/s p90: 0.097s 329/s p95: 0.097s 329/s
1 GPUs -- 1M/1G: p50: 0.100s 319/s p75: 0.100s 318/s p90: 0.100s 318/s p95: 0.100s 318/s
2 GPUs -- 1M/2G: p50: 0.103s 310/s p75: 0.103s 310/s p90: 0.103s 310/s p95: 0.103s 309/s
4 GPUs -- 1M/4G: p50: 0.103s 310/s p75: 0.103s 310/s p90: 0.103s 310/s p95: 0.103s 310/s
8 GPUs -- 1M/8G: p50: 0.104s 307/s p75: 0.104s 307/s p90: 0.104s 306/s p95: 0.104s 306/s
16 GPUs -- 2M/8G: p50: 0.104s 306/s p75: 0.104s 306/s p90: 0.104s 306/s p95: 0.104s 306/s
Benchmark: resnet101 with batch size 32
sec/iter ex/sec sec/iter ex/sec sec/iter ex/sec sec/iter ex/sec
1 GPUs -- no ddp: p50: 0.162s 197/s p75: 0.162s 197/s p90: 0.162s 197/s p95: 0.162s 197/s
1 GPUs -- 1M/1G: p50: 0.171s 187/s p75: 0.171s 186/s p90: 0.171s 186/s p95: 0.172s 185/s
2 GPUs -- 1M/2G: p50: 0.176s 182/s p75: 0.176s 181/s p90: 0.176s 181/s p95: 0.176s 181/s
4 GPUs -- 1M/4G: p50: 0.176s 182/s p75: 0.176s 181/s p90: 0.176s 181/s p95: 0.176s 181/s
8 GPUs -- 1M/8G: p50: 0.179s 179/s p75: 0.179s 178/s p90: 0.180s 178/s p95: 0.180s 177/s
16 GPUs -- 2M/8G: p50: 0.179s 178/s p75: 0.180s 177/s p90: 0.183s 174/s p95: 0.188s 170/s
Benchmark: resnext50_32x4d with batch size 32
sec/iter ex/sec sec/iter ex/sec sec/iter ex/sec sec/iter ex/sec
1 GPUs -- no ddp: p50: 0.145s 220/s p75: 0.145s 220/s p90: 0.145s 220/s p95: 0.145s 220/s
1 GPUs -- 1M/1G: p50: 0.147s 217/s p75: 0.147s 217/s p90: 0.148s 216/s p95: 0.148s 216/s
2 GPUs -- 1M/2G: p50: 0.153s 209/s p75: 0.153s 209/s p90: 0.153s 209/s p95: 0.153s 209/s
4 GPUs -- 1M/4G: p50: 0.153s 208/s p75: 0.153s 208/s p90: 0.154s 208/s p95: 0.154s 208/s
8 GPUs -- 1M/8G: p50: 0.157s 204/s p75: 0.157s 204/s p90: 0.157s 203/s p95: 0.157s 203/s
16 GPUs -- 2M/8G: p50: 0.157s 203/s p75: 0.157s 203/s p90: 0.158s 203/s p95: 0.158s 202/s
Benchmark: resnext101_32x8d with batch size 32
sec/iter ex/sec sec/iter ex/sec sec/iter ex/sec sec/iter ex/sec
1 GPUs -- no ddp: p50: 0.415s 77/s p75: 0.415s 77/s p90: 0.416s 76/s p95: 0.417s 76/s
1 GPUs -- 1M/1G: p50: 0.425s 75/s p75: 0.426s 75/s p90: 0.426s 75/s p95: 0.426s 75/s
2 GPUs -- 1M/2G: p50: 0.438s 73/s p75: 0.439s 72/s p90: 0.439s 72/s p95: 0.439s 72/s
4 GPUs -- 1M/4G: p50: 0.439s 72/s p75: 0.439s 72/s p90: 0.440s 72/s p95: 0.440s 72/s
8 GPUs -- 1M/8G: p50: 0.447s 71/s p75: 0.447s 71/s p90: 0.448s 71/s p95: 0.448s 71/s
16 GPUs -- 2M/8G: p50: 0.450s 71/s p75: 0.451s 70/s p90: 0.451s 70/s p95: 0.451s 70/s
```
## How to diff
Run the benchmark with the `--json PATH_TO_REPORT_FILE` argument to
produce the JSON file that the diff script can consume.
Then, run the diff script as follows:
```
$ python3 diff.py PATH_TO_BASELINE_FILE PATH_TO_TEST_FILE
baseline test
-------------------- --------------------
bucket_size: 25 vs 1
cuda_version: 10.0 vs 10.0
distributed_backend: nccl vs nccl
pytorch_version: 1.4.0a0+05140f0 vs 1.4.0a0+05140f0
Benchmark: resnet50 with batch size 32
sec/iter ex/sec diff sec/iter ex/sec diff
1 GPUs: p75: 0.101s 317/s -0.3% p95: 0.101s 317/s -0.4%
2 GPUs: p75: 0.104s 306/s -1.0% p95: 0.104s 306/s -1.0%
4 GPUs: p75: 0.105s 305/s -1.6% p95: 0.105s 304/s -1.8%
8 GPUs: p75: 0.107s 299/s -2.6% p95: 0.107s 298/s -2.7%
16 GPUs: p75: 0.108s 294/s -3.8% p95: 0.122s 262/s -16.4%
Benchmark: resnet101 with batch size 32
sec/iter ex/sec diff sec/iter ex/sec diff
1 GPUs: p75: 0.172s 185/s -1.2% p95: 0.172s 185/s -1.3%
2 GPUs: p75: 0.179s 178/s -2.1% p95: 0.179s 178/s -2.0%
4 GPUs: p75: 0.180s 177/s -2.6% p95: 0.180s 177/s -2.6%
8 GPUs: p75: 0.184s 173/s -3.5% p95: 0.184s 173/s -3.5%
16 GPUs: p75: 0.187s 170/s -0.1% p95: 0.204s 157/s -7.9%
Benchmark: resnext50_32x4d with batch size 32
sec/iter ex/sec diff sec/iter ex/sec diff
1 GPUs: p75: 0.149s 214/s -1.0% p95: 0.149s 214/s -0.9%
2 GPUs: p75: 0.156s 205/s -1.5% p95: 0.156s 205/s -1.6%
4 GPUs: p75: 0.156s 204/s -1.6% p95: 0.157s 204/s -1.8%
8 GPUs: p75: 0.159s 200/s -1.5% p95: 0.159s 200/s -1.5%
16 GPUs: p75: 0.161s 198/s -1.9% p95: 0.162s 197/s -2.3%
Benchmark: resnext101_32x8d with batch size 32
sec/iter ex/sec diff sec/iter ex/sec diff
1 GPUs: p75: 0.427s 74/s -0.8% p95: 0.428s 74/s -0.7%
2 GPUs: p75: 0.444s 72/s -1.3% p95: 0.445s 71/s -0.7%
4 GPUs: p75: 0.444s 72/s -1.1% p95: 0.445s 71/s -0.8%
8 GPUs: p75: 0.452s 70/s -1.3% p95: 0.452s 70/s -1.3%
16 GPUs: p75: 0.455s 70/s -0.7% p95: 0.456s 70/s -0.6%
```
This compares throughput between `bucket_cap_mb=25` (the default) and
`bucket_cap_mb=1` on 8 DGX machines with V100 GPUs. It confims that
even for a relatively small model on machines with a very fast
interconnect (4x 100Gb InfiniBand per machine), it still pays off to
batch allreduce calls.
|