1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301
|
.. py:module:: torchaudio.pipelines
torchaudio.pipelines
====================
.. currentmodule:: torchaudio.pipelines
The ``torchaudio.pipelines`` module packages pre-trained models with support functions and meta-data into simple APIs tailored to perform specific tasks.
When using pre-trained models to perform a task, in addition to instantiating the model with pre-trained weights, the client code also needs to build pipelines for feature extractions and post processing in the same way they were done during the training. This requires to carrying over information used during the training, such as the type of transforms and the their parameters (for example, sampling rate the number of FFT bins).
To make this information tied to a pre-trained model and easily accessible, ``torchaudio.pipelines`` module uses the concept of a `Bundle` class, which defines a set of APIs to instantiate pipelines, and the interface of the pipelines.
The following figure illustrates this.
.. image:: https://download.pytorch.org/torchaudio/doc-assets/pipelines-intro.png
A pre-trained model and associated pipelines are expressed as an instance of ``Bundle``. Different instances of same ``Bundle`` share the interface, but their implementations are not constrained to be of same types. For example, :class:`SourceSeparationBundle` defines the interface for performing source separation, but its instance :data:`CONVTASNET_BASE_LIBRI2MIX` instantiates a model of :class:`~torchaudio.models.ConvTasNet` while :data:`HDEMUCS_HIGH_MUSDB` instantiates a model of :class:`~torchaudio.models.HDemucs`. Still, because they share the same interface, the usage is the same.
.. note::
Under the hood, the implementations of ``Bundle`` use components from other ``torchaudio`` modules, such as :mod:`torchaudio.models` and :mod:`torchaudio.transforms`, or even third party libraries like `SentencPiece <https://github.com/google/sentencepiece>`__ and `DeepPhonemizer <https://github.com/as-ideas/DeepPhonemizer>`__. But this implementation detail is abstracted away from library users.
.. _RNNT:
RNN-T Streaming/Non-Streaming ASR
---------------------------------
Interface
~~~~~~~~~
``RNNTBundle`` defines ASR pipelines and consists of three steps: feature extraction, inference, and de-tokenization.
.. image:: https://download.pytorch.org/torchaudio/doc-assets/pipelines-rnntbundle.png
.. autosummary::
:toctree: generated
:nosignatures:
:template: autosummary/bundle_class.rst
RNNTBundle
RNNTBundle.FeatureExtractor
RNNTBundle.TokenProcessor
.. rubric:: Tutorials using ``RNNTBundle``
.. minigallery:: torchaudio.pipelines.RNNTBundle
Pretrained Models
~~~~~~~~~~~~~~~~~
.. autosummary::
:toctree: generated
:nosignatures:
:template: autosummary/bundle_data.rst
EMFORMER_RNNT_BASE_LIBRISPEECH
wav2vec 2.0 / HuBERT / WavLM - SSL
----------------------------------
Interface
~~~~~~~~~
``Wav2Vec2Bundle`` instantiates models that generate acoustic features that can be used for downstream inference and fine-tuning.
.. image:: https://download.pytorch.org/torchaudio/doc-assets/pipelines-wav2vec2bundle.png
.. autosummary::
:toctree: generated
:nosignatures:
:template: autosummary/bundle_class.rst
Wav2Vec2Bundle
Pretrained Models
~~~~~~~~~~~~~~~~~
.. autosummary::
:toctree: generated
:nosignatures:
:template: autosummary/bundle_data.rst
WAV2VEC2_BASE
WAV2VEC2_LARGE
WAV2VEC2_LARGE_LV60K
WAV2VEC2_XLSR53
WAV2VEC2_XLSR_300M
WAV2VEC2_XLSR_1B
WAV2VEC2_XLSR_2B
HUBERT_BASE
HUBERT_LARGE
HUBERT_XLARGE
WAVLM_BASE
WAVLM_BASE_PLUS
WAVLM_LARGE
wav2vec 2.0 / HuBERT - Fine-tuned ASR
-------------------------------------
Interface
~~~~~~~~~
``Wav2Vec2ASRBundle`` instantiates models that generate probability distribution over pre-defined labels, that can be used for ASR.
.. image:: https://download.pytorch.org/torchaudio/doc-assets/pipelines-wav2vec2asrbundle.png
.. autosummary::
:toctree: generated
:nosignatures:
:template: autosummary/bundle_class.rst
Wav2Vec2ASRBundle
.. rubric:: Tutorials using ``Wav2Vec2ASRBundle``
.. minigallery:: torchaudio.pipelines.Wav2Vec2ASRBundle
Pretrained Models
~~~~~~~~~~~~~~~~~
.. autosummary::
:toctree: generated
:nosignatures:
:template: autosummary/bundle_data.rst
WAV2VEC2_ASR_BASE_10M
WAV2VEC2_ASR_BASE_100H
WAV2VEC2_ASR_BASE_960H
WAV2VEC2_ASR_LARGE_10M
WAV2VEC2_ASR_LARGE_100H
WAV2VEC2_ASR_LARGE_960H
WAV2VEC2_ASR_LARGE_LV60K_10M
WAV2VEC2_ASR_LARGE_LV60K_100H
WAV2VEC2_ASR_LARGE_LV60K_960H
VOXPOPULI_ASR_BASE_10K_DE
VOXPOPULI_ASR_BASE_10K_EN
VOXPOPULI_ASR_BASE_10K_ES
VOXPOPULI_ASR_BASE_10K_FR
VOXPOPULI_ASR_BASE_10K_IT
HUBERT_ASR_LARGE
HUBERT_ASR_XLARGE
wav2vec 2.0 / HuBERT - Forced Alignment
---------------------------------------
Interface
~~~~~~~~~
``Wav2Vec2FABundle`` bundles pre-trained model and its associated dictionary. Additionally, it supports appending ``star`` token dimension.
.. image:: https://download.pytorch.org/torchaudio/doc-assets/pipelines-wav2vec2fabundle.png
.. autosummary::
:toctree: generated
:nosignatures:
:template: autosummary/bundle_class.rst
Wav2Vec2FABundle
Wav2Vec2FABundle.Tokenizer
Wav2Vec2FABundle.Aligner
.. rubric:: Tutorials using ``Wav2Vec2FABundle``
.. minigallery:: torchaudio.pipelines.Wav2Vec2FABundle
Pertrained Models
~~~~~~~~~~~~~~~~~
.. autosummary::
:toctree: generated
:nosignatures:
:template: autosummary/bundle_data.rst
MMS_FA
.. _Tacotron2:
Tacotron2 Text-To-Speech
------------------------
``Tacotron2TTSBundle`` defines text-to-speech pipelines and consists of three steps: tokenization, spectrogram generation and vocoder. The spectrogram generation is based on :class:`~torchaudio.models.Tacotron2` model.
.. image:: https://download.pytorch.org/torchaudio/doc-assets/pipelines-tacotron2bundle.png
``TextProcessor`` can be rule-based tokenization in the case of characters, or it can be a neural-netowrk-based G2P model that generates sequence of phonemes from input text.
Similarly ``Vocoder`` can be an algorithm without learning parameters, like `Griffin-Lim`, or a neural-network-based model like `Waveglow`.
Interface
~~~~~~~~~
.. autosummary::
:toctree: generated
:nosignatures:
:template: autosummary/bundle_class.rst
Tacotron2TTSBundle
Tacotron2TTSBundle.TextProcessor
Tacotron2TTSBundle.Vocoder
.. rubric:: Tutorials using ``Tacotron2TTSBundle``
.. minigallery:: torchaudio.pipelines.Tacotron2TTSBundle
Pretrained Models
~~~~~~~~~~~~~~~~~
.. autosummary::
:toctree: generated
:nosignatures:
:template: autosummary/bundle_data.rst
TACOTRON2_WAVERNN_PHONE_LJSPEECH
TACOTRON2_WAVERNN_CHAR_LJSPEECH
TACOTRON2_GRIFFINLIM_PHONE_LJSPEECH
TACOTRON2_GRIFFINLIM_CHAR_LJSPEECH
Source Separation
-----------------
Interface
~~~~~~~~~
``SourceSeparationBundle`` instantiates source separation models which take single channel audio and generates multi-channel audio.
.. image:: https://download.pytorch.org/torchaudio/doc-assets/pipelines-sourceseparationbundle.png
.. autosummary::
:toctree: generated
:nosignatures:
:template: autosummary/bundle_class.rst
SourceSeparationBundle
.. rubric:: Tutorials using ``SourceSeparationBundle``
.. minigallery:: torchaudio.pipelines.SourceSeparationBundle
Pretrained Models
~~~~~~~~~~~~~~~~~
.. autosummary::
:toctree: generated
:nosignatures:
:template: autosummary/bundle_data.rst
CONVTASNET_BASE_LIBRI2MIX
HDEMUCS_HIGH_MUSDB_PLUS
HDEMUCS_HIGH_MUSDB
Squim Objective
---------------
Interface
~~~~~~~~~
:py:class:`SquimObjectiveBundle` defines speech quality and intelligibility measurement (SQUIM) pipeline that can predict **objecive** metric scores given the input waveform.
.. autosummary::
:toctree: generated
:nosignatures:
:template: autosummary/bundle_class.rst
SquimObjectiveBundle
Pretrained Models
~~~~~~~~~~~~~~~~~
.. autosummary::
:toctree: generated
:nosignatures:
:template: autosummary/bundle_data.rst
SQUIM_OBJECTIVE
Squim Subjective
----------------
Interface
~~~~~~~~~
:py:class:`SquimSubjectiveBundle` defines speech quality and intelligibility measurement (SQUIM) pipeline that can predict **subjective** metric scores given the input waveform.
.. autosummary::
:toctree: generated
:nosignatures:
:template: autosummary/bundle_class.rst
SquimSubjectiveBundle
Pretrained Models
~~~~~~~~~~~~~~~~~
.. autosummary::
:toctree: generated
:nosignatures:
:template: autosummary/bundle_data.rst
SQUIM_SUBJECTIVE
|