1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168
|
# Source Separation Example
This directory contains reference implementations for source separations. For the detail of each model, please checkout the followings.
- [Conv-TasNet](./conv_tasnet/README.md)
## Usage
### Overview
To training a model, you can use [`train.py`](./train.py). This script takes the form of
`train.py [parameters for distributed training] -- [parameters for model/training]`
```
python train.py \
[--worker-id WORKER_ID] \
[--device-id DEVICE_ID] \
[--num-workers NUM_WORKERS] \
[--sync-protocol SYNC_PROTOCOL] \
-- \
<model specific training parameters>
# For the detail of the parameter values, use;
python train.py --help
# For the detail of the model parameters, use;
python train.py -- --help
```
If you would like to just try out the traing script, then try it without any parameters
for distributed training. `train.py -- --sample-rate 8000 --batch-size <BATCH_SIZE> --dataset-dir <DATASET_DIR> --save-dir <SAVE_DIR>`
This script runs training in Distributed Data Parallel (DDP) framework and has two major
operation modes. This behavior depends on if `--worker-id` argument is given or not.
1. (`--worker-id` is not given) Launchs training worker subprocesses that performs the actual training.
2. (`--worker-id` is given) Performs the training as a part of distributed training.
When launching the script without any distributed trainig parameters (operation mode 1),
this script will check the number of GPUs available on the local system and spawns the same
number of training subprocesses (as operaiton mode 2). You can reduce the number of GPUs with
`--num-workers`. If there is no GPU available, only one subprocess is launched and providing
`--num-workers` larger than 1 results in error.
When launching the script as a worker process of a distributed training, you need to configure
the coordination of the workers.
- `--num-workers` is the number of training processes being launched.
- `--worker-id` is the process rank (must be unique across all the processes).
- `--device-id` is the GPU device ID (should be unique within node).
- `--sync-protocl` is how each worker process communicate and synchronize.
If the training is carried out on a single node, then the default `"env://"` should do.
If the training processes span across multiple nodes, then you need to provide a protocol that
can communicate over the network. If you know where the master node is located, you can use
`"env://"` in combination with `MASTER_ADDR` and `MASER_PORT` environment variables. If you do
not know where the master node is located beforehand, you can use `"file://..."` protocol to
indicate where the file to which all the worker process have access is located. For other
protocols, please refer to the official documentation.
### Distributed Training Notes
<details><summary>Quick overview on DDP (distributed data parallel)</summary>
DDP is single-program multiple-data training paradigm.
With DDP, the model is replicated on every process,
and every model replica will be fed with a different set of input data samples.
- **Process**: Worker process (as in Linux process). There are `P` processes per a Node.
- **Node**: A machine. There are `N` machines, each of which holds `P` processes.
- **World**: network of nodes, composed of `N` nodes and `N * P` processes.
- **Rank**: Grobal process ID (unique across nodes) `[0, N * P)`
- **Local Rank**: Local process ID (unique only within a node) `[0, P)`
```
Node 0 Node 1 Node N-1
┌────────────────────────┐┌─────────────────────────┐ ┌───────────────────────────┐
│╔══════════╗ ┌─────────┐││┌───────────┐ ┌─────────┐│ │┌─────────────┐ ┌─────────┐│
│║ Process ╟─┤ GPU: 0 ││││ Process ├─┤ GPU: 0 ││ ││ Process ├─┤ GPU: 0 ││
│║ Rank: 0 ║ └─────────┘│││ Rank:P │ └─────────┘│ ││ Rank:NP-P │ └─────────┘│
│╚══════════╝ ││└───────────┘ │ │└─────────────┘ │
│┌──────────┐ ┌─────────┐││┌───────────┐ ┌─────────┐│ │┌─────────────┐ ┌─────────┐│
││ Process ├─┤ GPU: 1 ││││ Process ├─┤ GPU: 1 ││ ││ Process ├─┤ GPU: 1 ││
││ Rank: 1 │ └─────────┘│││ Rank:P+1 │ └─────────┘│ ││ Rank:NP-P+1 │ └─────────┘│
│└──────────┘ ││└───────────┘ │ ... │└─────────────┘ │
│ ││ │ │ │
│ ... ││ ... │ │ ... │
│ ││ │ │ │
│┌──────────┐ ┌─────────┐││┌───────────┐ ┌─────────┐│ │┌─────────────┐ ┌─────────┐│
││ Process ├─┤ GPU:P-1 ││││ Process ├─┤ GPU:P-1 ││ ││ Process ├─┤ GPU:P-1 ││
││ Rank:P-1 │ └─────────┘│││ Rank:2P-1 │ └─────────┘│ ││ Rank:NP-1 │ └─────────┘│
│└──────────┘ ││└───────────┘ │ │└─────────────┘ │
└────────────────────────┘└─────────────────────────┘ └───────────────────────────┘
```
</details>
### SLURM
When launched as SLURM job, the follwoing environment variables correspond to
- **SLURM_PROCID*: `--worker-id` (Rank)
- **SLURM_NTASKS** (or legacy **SLURM_NPPROCS**): the number of total processes (`--num-workers` == world size)
- **SLURM_LOCALID**: Local Rank (to be mapped with GPU index*)
* Even when GPU resource is allocated with `--gpus-per-task=1`, if there are muptiple
tasks allocated on the same node, (thus multiple GPUs of the node are allocated to the job)
each task can see all the GPUs allocated for the tasks. Therefore we need to use
SLURM_LOCALID to tell each task to which GPU it should be using.
<details><summary>Example scripts for running the training on SLURM cluster</summary>
- **launch_job.sh**
```bash
#!/bin/bash
#SBATCH --job-name=source_separation
#SBATCH --output=/checkpoint/%u/jobs/%x/%j.out
#SBATCH --error=/checkpoint/%u/jobs/%x/%j.err
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=8
#SBATCH --cpus-per-task=8
#SBATCH --mem-per-cpu=16G
#SBATCH --gpus-per-task=1
#srun env
srun wrapper.sh $@
```
- **wrapper.sh**
```bash
#!/bin/bash
num_speakers=2
this_dir="$( cd "$( dirname "${BASH_SOURCE[0]}" )" >/dev/null 2>&1 && pwd )"
save_dir="/checkpoint/${USER}/jobs/${SLURM_JOB_NAME}/${SLURM_JOB_ID}"
dataset_dir="/dataset/wsj0-mix/${num_speakers}speakers/wav8k/min"
if [ "${SLURM_JOB_NUM_NODES}" -gt 1 ]; then
protocol="file:///checkpoint/${USER}/jobs/source_separation/${SLURM_JOB_ID}/sync"
else
protocol="env://"
fi
mkdir -p "${save_dir}"
python -u \
"${this_dir}/train.py" \
--worker-id "${SLURM_PROCID}" \
--num-workers "${SLURM_NTASKS}" \
--device-id "${SLURM_LOCALID}" \
--sync-protocol "${protocol}" \
-- \
--num-speakers "${num_speakers}" \
--sample-rate 8000 \
--dataset-dir "${dataset_dir}" \
--save-dir "${save_dir}" \
--batch-size $((16 / SLURM_NTASKS))
```
</details>
|