1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158
|
# CIFAR10 Example with Ignite
In this example, we show how to use _Ignite_ to train a neural network:
- on 1 or more GPUs or TPUs
- compute training/validation metrics
- log learning rate, metrics etc
- save the best model weights
Configurations:
- [x] single GPU
- [x] multi GPUs on a single node
- [x] multi GPUs on multiple nodes
- [x] TPUs on Colab
## Requirements:
- pytorch-ignite: `pip install pytorch-ignite`
- [torchvision](https://github.com/pytorch/vision/): `pip install torchvision`
- [tqdm](https://github.com/tqdm/tqdm/): `pip install tqdm`
- [tensorboardx](https://github.com/lanpa/tensorboard-pytorch): `pip install tensorboardX`
- [python-fire](https://github.com/google/python-fire): `pip install fire`
- Optional: [clearml](https://github.com/allegroai/clearml): `pip install clearml`
Alternatively, install the all requirements using `pip install -r requirements.txt`.
## Usage:
Run the example on a single GPU:
```bash
python main.py run
```
For more details on accepted arguments:
```bash
python main.py run -- --help
```
If user would like to provide already downloaded dataset, the path can be setup in parameters as
```bash
--data_path="/path/to/cifar10/"
```
### Distributed training
#### Single node, multiple GPUs
Let's start training on a single node with 2 gpus:
```bash
# using torchrun
torchrun --nproc_per_node=2 main.py run --backend="nccl"
```
or
```bash
# using function spawn inside the code
python -u main.py run --backend="nccl" --nproc_per_node=2
```
##### Using [Horovod](https://horovod.readthedocs.io/en/latest/index.html) as distributed backend
Please, make sure to have Horovod installed before running.
Let's start training on a single node with 2 gpus:
```bash
# horovodrun
horovodrun -np=2 python -u main.py run --backend="horovod"
```
or
```bash
# using function spawn inside the code
python -u main.py run --backend="horovod" --nproc_per_node=2
```
#### Colab, on 8 TPUs
Same code can be run on TPUs: [](https://colab.research.google.com/drive/1E9zJrptnLJ_PKhmaP5Vhb6DTVRvyrKHx)
#### Multiple nodes, multiple GPUs
Let's start training on two nodes with 2 gpus each. We assuming that master node can be connected as `master`, e.g. `ping master`.
1. Execute on master node
```bash
torchrun \
--nnodes=2 \
--nproc_per_node=2 \
--node_rank=0 \
--master_addr=master --master_port=2222 \
main.py run --backend="nccl"
```
2. Execute on worker node
```bash
torchrun \
--nnodes=2 \
--nproc_per_node=2 \
--node_rank=1 \
--master_addr=master --master_port=2222 \
main.py run --backend="nccl"
```
### Check resume training
#### Single GPU
Initial training with a stop on 1000 iteration (~11 epochs)
```bash
python main.py run --stop_iteration=1000
```
Resume from the latest checkpoint
```bash
python main.py run --resume-from=/tmp/output-cifar10/resnet18_backend-None-1_stop-on-1000/training_checkpoint_1000.pt
```
### Distributed training
#### Single node, multiple GPUs
Initial training on a single node with 2 gpus with a stop on 1000 iteration (~11 epochs):
```bash
# using torchrun
torchrun --nproc_per_node=2 main.py run --backend="nccl" --stop_iteration=1000
```
Resume from the latest checkpoint
```bash
torchrun --nproc_per_node=2 main.py run --backend="nccl" \
--resume-from=/tmp/output-cifar10/resnet18_backend-nccl-2_stop-on-1000/training_checkpoint_1000.pt
```
Similar commands can be adapted for other cases.
## ClearML fileserver
If `ClearML` server is used (i.e. `--with_clearml` argument), the configuration to upload artifact must be done by
modifying the `ClearML` configuration file `~/clearml.conf` generated by `clearml-init`. According to the
[documentation](https://allegro.ai/clearml/docs/docs/examples/reporting/artifacts.html), the `output_uri` argument can be
configured in `sdk.development.default_output_uri` to fileserver uri. If server is self-hosted, `ClearML` fileserver uri is
`http://localhost:8081`.
For more details, see https://allegro.ai/clearml/docs/docs/examples/reporting/artifacts.html
|