File: README.md

package info (click to toggle)
pytorch-ignite 0.5.1-1
links: PTS, VCS
area: main
in suites: forky, sid, trixie
size: 11,712 kB
sloc: python: 46,874; sh: 376; makefile: 27
file content (158 lines) | stat: -rw-r--r-- 4,138 bytes
# CIFAR10 Example with Ignite

In this example, we show how to use _Ignite_ to train a neural network:

- on 1 or more GPUs or TPUs
- compute training/validation metrics
- log learning rate, metrics etc
- save the best model weights

Configurations:

- [x] single GPU
- [x] multi GPUs on a single node
- [x] multi GPUs on multiple nodes
- [x] TPUs on Colab

## Requirements:

- pytorch-ignite: `pip install pytorch-ignite`
- [torchvision](https://github.com/pytorch/vision/): `pip install torchvision`
- [tqdm](https://github.com/tqdm/tqdm/): `pip install tqdm`
- [tensorboardx](https://github.com/lanpa/tensorboard-pytorch): `pip install tensorboardX`
- [python-fire](https://github.com/google/python-fire): `pip install fire`
- Optional: [clearml](https://github.com/allegroai/clearml): `pip install clearml`

Alternatively, install the all requirements using `pip install -r requirements.txt`.

## Usage:

Run the example on a single GPU:

```bash
python main.py run
```

For more details on accepted arguments:

```bash
python main.py run -- --help
```

If user would like to provide already downloaded dataset, the path can be setup in parameters as

```bash
--data_path="/path/to/cifar10/"
```

### Distributed training

#### Single node, multiple GPUs

Let's start training on a single node with 2 gpus:

```bash
# using torchrun
torchrun --nproc_per_node=2 main.py run --backend="nccl"
```

or

```bash
# using function spawn inside the code
python -u main.py run --backend="nccl" --nproc_per_node=2
```

##### Using [Horovod](https://horovod.readthedocs.io/en/latest/index.html) as distributed backend

Please, make sure to have Horovod installed before running.

Let's start training on a single node with 2 gpus:

```bash
# horovodrun
horovodrun -np=2 python -u main.py run --backend="horovod"
```

or

```bash
# using function spawn inside the code
python -u main.py run --backend="horovod" --nproc_per_node=2
```

#### Colab, on 8 TPUs

Same code can be run on TPUs: [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1E9zJrptnLJ_PKhmaP5Vhb6DTVRvyrKHx)

#### Multiple nodes, multiple GPUs

Let's start training on two nodes with 2 gpus each. We assuming that master node can be connected as `master`, e.g. `ping master`.

1. Execute on master node

```bash
torchrun \
    --nnodes=2 \
    --nproc_per_node=2 \
    --node_rank=0 \
    --master_addr=master --master_port=2222 \
    main.py run --backend="nccl"
```

2. Execute on worker node

```bash
torchrun \
    --nnodes=2 \
    --nproc_per_node=2 \
    --node_rank=1 \
    --master_addr=master --master_port=2222 \
    main.py run --backend="nccl"
```

### Check resume training

#### Single GPU

Initial training with a stop on 1000 iteration (~11 epochs)

```bash
python main.py run --stop_iteration=1000
```

Resume from the latest checkpoint

```bash
python main.py run --resume-from=/tmp/output-cifar10/resnet18_backend-None-1_stop-on-1000/training_checkpoint_1000.pt
```

### Distributed training

#### Single node, multiple GPUs

Initial training on a single node with 2 gpus with a stop on 1000 iteration (~11 epochs):

```bash
# using torchrun
torchrun --nproc_per_node=2 main.py run --backend="nccl" --stop_iteration=1000
```

Resume from the latest checkpoint

```bash
torchrun --nproc_per_node=2 main.py run --backend="nccl" \
    --resume-from=/tmp/output-cifar10/resnet18_backend-nccl-2_stop-on-1000/training_checkpoint_1000.pt
```

Similar commands can be adapted for other cases.

## ClearML fileserver

If `ClearML` server is used (i.e. `--with_clearml` argument), the configuration to upload artifact must be done by
modifying the `ClearML` configuration file `~/clearml.conf` generated by `clearml-init`. According to the
[documentation](https://allegro.ai/clearml/docs/docs/examples/reporting/artifacts.html), the `output_uri` argument can be
configured in `sdk.development.default_output_uri` to fileserver uri. If server is self-hosted, `ClearML` fileserver uri is
`http://localhost:8081`.

For more details, see https://allegro.ai/clearml/docs/docs/examples/reporting/artifacts.html