1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187
|
# Speech Recognition with wav2vec2.0
This example demonstarates how you can use torchaudio's I/O features and models to run speech recognition in C++ application.
**NOTE**
This example uses `"sox_io"` backend for loading audio, which does not work on Windows. To make it work on
Windows, you need to replace the part of loading audio and converting it to Tensor object.
## 1. Create a transcription pipeline TorchScript file
We will create a TorchScript that performs the following processes;
1. Load audio from a file.
1. Pass audio to encoder which produces the sequence of probability distribution on labels.
1. Pass the encoder output to decoder which generates transcripts.
For building decoder, we borrow the pre-trained weights published by `fairseq` and/or Hugging Face Transformers, then convert it `torchaudio`'s format, which supports TorchScript.
### 1.1. From `fairseq`
For `fairseq` models, you can download pre-trained weights
You can download a model from [`fairseq` repository](https://github.com/pytorch/fairseq/tree/main/examples/wav2vec). Here, we will use `Base / 960h` model. You also need to download [the letter dictionary file](https://github.com/pytorch/fairseq/tree/main/examples/wav2vec#evaluating-a-ctc-model).
For the decoder part, we use [simple_ctc](https://github.com/mthrok/ctcdecode), which also supports TorchScript.
```bash
mkdir -p pipeline-fairseq
python build_pipeline_from_fairseq.py \
--model-file "wav2vec_small_960.pt" \
--dict-dir <DIRECTORY_WHERE_dict.ltr.txt_IS_FOUND> \
--output-path "./pipeline-fairseq/"
```
The above command should create the following TorchScript object files in the output directory.
```
decoder.zip encoder.zip loader.zip
```
* `loader.zip` loads audio file and generate waveform Tensor.
* `encoder.zip` receives waveform Tensor and generates the sequence of probability distribution over the label.
* `decoder.zip` receives the probability distribution over the label and generates a transcript.
### 1.2. From Hugging Face Transformers
[Hugging Face Transformers](https://huggingface.co/transformers/index.html) and [Hugging Face Model Hub](https://huggingface.co/models) provides `wav2vec2.0` models fine-tuned on variety of datasets and languages.
We can also import the model published on Hugging Face Hub and run it in our C++ application.
In the following example, we will try the Geremeny model, ([facebook/wav2vec2-large-xlsr-53-german](https://huggingface.co/facebook/wav2vec2-large-xlsr-53-german/tree/main)) on [VoxForge Germany dataset](http://www.voxforge.org/de/downloads).
```bash
mkdir -p pipeline-hf
python build_pipeline_from_huggingface_transformers.py \
--model facebook/wav2vec2-large-xlsr-53-german \
--output-path ./pipeline-hf/
```
The resulting TorchScript object files should be same as the `fairseq` example.
## 2. Build the application
Please refer to [the top level README.md](../README.md)
## 3. Run the application
Now we run the C++ application [`transcribe`](./transcribe.cpp), with the TorchScript object we created in Step.1.1. and an input audio file.
```bash
../build/speech_recognition/transcribe ./pipeline-fairseq ../data/input.wav
```
This will output something like the following.
```
Loading module from: ./pipeline/loader.zip
Loading module from: ./pipeline/encoder.zip
Loading module from: ./pipeline/decoder.zip
Loading the audio
Running inference
Generating the transcription
I HAD THAT CURIOSITY BESIDE ME AT THIS MOMENT
Done.
```
## 4. Evaluate the pipeline on Librispeech dataset
Let's evaluate this word error rate (WER) of this application using [Librispeech dataset](https://www.openslr.org/12).
### 4.1. Create a list of audio paths
For the sake of simplifying our C++ code, we will first parse the Librispeech dataset to get the list of audio path
```bash
python parse_librispeech.py <PATH_TO_YOUR_DATASET>/LibriSpeech/test-clean ./flist.txt
```
The list should look like the following;
```bash
head flist.txt
1089-134691-0000 /LibriSpeech/test-clean/1089/134691/1089-134691-0000.flac HE COULD WAIT NO LONGER
```
### 4.2. Run the transcription
[`transcribe_list`](./transcribe_list.cpp) processes the input flist list and feed the audio path one by one to the pipeline, then generate reference file and hypothesis file.
```bash
../build/speech_recognition/transcribe_list ./pipeline-fairseq ./flist.txt <OUTPUT_DIR>
```
### 4.3. Score WER
You need `sclite` for this step. You can download the code from [SCTK repository](https://github.com/usnistgov/SCTK).
```bash
# in the output directory
sclite -r ref.trn -h hyp.trn -i wsj -o pralign -o sum
```
WER can be found in the resulting `hyp.trn.sys`. Check out the column that starts with `Sum/Avg` the first column of the third block is `100 - WER`.
In our test, we got the following results.
| model | Fine Tune | test-clean | test-other |
|:-----------------------------------------:|----------:|:----------:|:----------:|
| Base<br/>`wav2vec_small_960` | 960h | 3.1 | 7.7 |
| Large<br/>`wav2vec_big_960` | 960h | 2.6 | 5.9 |
| Large (LV-60)<br/>`wav2vec2_vox_960h_new` | 960h | 2.9 | 6.2 |
| Large (LV-60) + Self Training<br/>`wav2vec_vox_960h_pl` | 960h | 1.9 | 4.5 |
You can also check `hyp.trn.pra` file to see what errors were made.
```
id: (3528-168669-0005)
Scores: (#C #S #D #I) 7 1 0 0
REF: there is a stone to be RAISED heavy
HYP: there is a stone to be RACED heavy
Eval: S
```
## 5. Evaluate the pipeline on VoxForge dataset
Now we use the pipeline we created in step 1.2. This time with German language dataset from VoxForge.
### 5.1. Create a list of audio paths
Download an archive from http://www.repository.voxforge1.org/downloads/de/Trunk/Audio/Main/16kHz_16bit/, and extract it to your local file system, then run the following to generate the file list.
```bash
python parse_voxforge.py <PATH_TO_YOUR_DATASET> > ./flist-de.txt
```
The list should look like
```bash
head flist-de.txt
de5-001 /datasets/voxforge/de/guenter-20140214-afn/wav/de5-001.wav ES SOLL ETWA FÜNFZIGTAUSEND VERSCHIEDENE SORTEN GEBEN
```
### 5.2. Run the application and score WER
This process is same as the Librispeech example. We just use the pipeline with the Germany model and file list of Germany dataset. Refer to the corresponding ssection in Librispeech evaluation..
```bash
../build/speech_recognition/transcribe_list ./pipeline-hf ./flist-de.txt <OUTPUT_DIR>
```
Then
```bash
# in the output directory
sclite -r ref.trn -h hyp.trn -i wsj -o pralign -o sum
```
You can find the detail of evalauation result in PRA.
```
id: (guenter-20140214-afn/mfc/de5-012)
Scores: (#C #S #D #I) 4 1 1 0
REF: die ausgaben kÖnnen gigantisch STEIGE N
HYP: die ausgaben kÖnnen gigantisch ****** STEIGEN
Eval: D S
```
|