1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115
|
# Transformers Example with Ignite
In this example, we show how to use _Ignite_ to finetune a transformer model:
- on 1 or more GPUs or TPUs
- compute training/validation metrics
- log learning rate, metrics etc
- save the best model weights
Configurations:
- [x] single GPU
- [x] multi GPUs on a single node
- [x] TPUs on Colab
## Requirements:
- pytorch-ignite: `pip install pytorch-ignite`
- [transformers](https://github.com/huggingface/transformers): `pip install transformers`
- [datasets](https://github.com/huggingface/datasets): `pip install datasets`
- [tqdm](https://github.com/tqdm/tqdm/): `pip install tqdm`
- [tensorboardx](https://github.com/lanpa/tensorboard-pytorch): `pip install tensorboardX`
- [python-fire](https://github.com/google/python-fire): `pip install fire`
- Optional: [clearml](https://github.com/allegroai/clearml): `pip install clearml`
Alternatively, install the all requirements using `pip install -r requirements.txt`.
## Usage:
Run the example on a single GPU:
```bash
python main.py run
```
If needed, please, adjust the batch size to your GPU device with `--batch_size` argument.
The default model is `bert-base-uncased` incase you need to change that use the `--model` argument, for details on which models can be used refer [here](https://huggingface.co/transformers/v3.0.2/model_doc/auto.html#automodelforsequenceclassification)
Example:
```bash
#Using DistilBERT which has 40% less parameters than bert-base-uncased
python main.py run --model="distilbert-base-uncased"
```
For details on accepted arguments:
```bash
python main.py run -- --help
```
### Distributed training
#### Single node, multiple GPUs
Let's start training on a single node with 2 gpus:
```bash
# using torch.distributed.launch
python -u -m torch.distributed.launch --nproc_per_node=2 --use_env main.py run --backend="nccl"
```
or
```bash
# using function spawn inside the code
python -u main.py run --backend="nccl" --nproc_per_node=2
```
##### Using [Horovod](https://horovod.readthedocs.io/en/latest/index.html) as distributed backend
Please, make sure to have Horovod installed before running.
Let's start training on a single node with 2 gpus:
```bash
# horovodrun
horovodrun -np=2 python -u main.py run --backend="horovod"
```
or
```bash
# using function spawn inside the code
python -u main.py run --backend="horovod" --nproc_per_node=2
```
#### Colab or Kaggle kernels, on 8 TPUs
```python
# setup TPU environment
import os
assert os.environ['COLAB_TPU_ADDR'], 'Make sure to select TPU from Edit > Notebook settings > Hardware accelerator'
```
```bash
VERSION = "nightly"
!curl -q https://raw.githubusercontent.com/pytorch/xla/master/contrib/scripts/env-setup.py -o pytorch-xla-env-setup.py
!python pytorch-xla-env-setup.py --version $VERSION > /dev/null
```
```python
from main import run
run(backend="xla-tpu", nproc_per_node=8)
```
## ClearML fileserver
If `ClearML` server is used (i.e. `--with_clearml` argument), the configuration to upload artifact must be done by
modifying the `ClearML` configuration file `~/clearml.conf` generated by `clearml-init`. According to the
[documentation](https://allegro.ai/clearml/docs/docs/examples/reporting/artifacts.html), the `output_uri` argument can be
configured in `sdk.development.default_output_uri` to fileserver uri. If server is self-hosted, `ClearML` fileserver uri is
`http://localhost:8081`.
For more details, see https://allegro.ai/clearml/docs/docs/examples/reporting/artifacts.html
|