1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372
|
Speech
******
.. warning::
WARNING! This is still work in progress; we reserve the right to change this API as development continues.
The quality of the speech is not great, merely "good enough". Given the
constraints of the device you may encounter memory errors and / or
unexpected extra sounds during playback. It's early days and we're
improving the code for the speech synthesiser all the time. Bug reports
and pull requests are most welcome.
.. py:module:: speech
This module makes microbit talk, sing and make other speech like sounds
provided that you connect a speaker to your board as shown below:
.. image:: speech.png
.. note::
This work is based upon the amazing reverse engineering efforts of
Sebastian Macke based upon an old text-to-speech (TTS) program called SAM
(Software Automated Mouth) originally released in 1982 for the
Commodore 64. The result is a small C library that we have adopted and
adapted for the micro:bit. You can find out more from
`his homepage <http://simulationcorner.net/index.php?page=sam>`_. Much of
the information in this document was gleaned from the original user's
manual which can be found
`here <http://www.apple-iigs.info/newdoc/sam.pdf>`_.
The speech synthesiser can produce around 2.5 seconds worth of sound from up to
255 characters of textual input.
To access this module you need to::
import speech
We assume you have done this for the examples below.
Functions
=========
.. py:function:: translate(words)
Given English words in the string ``words``, return a string containing
a best guess at the appropriate phonemes to pronounce. The output is
generated from this
`text to phoneme translation table <https://github.com/s-macke/SAM/wiki/Text-to-phoneme-translation-table>`_.
This function should be used to generate a first approximation of phonemes
that can be further hand-edited to improve accuracy, inflection and
emphasis.
.. py:function:: pronounce(phonemes, \*, pitch=64, speed=72, mouth=128, throat=128)
Pronounce the phonemes in the string ``phonemes``. See below for details of
how to use phonemes to finely control the output of the speech synthesiser.
Override the optional pitch, speed, mouth and throat settings to change the
timbre (quality) of the voice.
.. py:function:: say(words, \*, pitch=64, speed=72, mouth=128, throat=128)
Say the English words in the string ``words``. The result is semi-accurate
for English. Override the optional pitch, speed, mouth and throat
settings to change the timbre (quality) of the voice. This is a short-hand
equivalent of: ``speech.pronounce(speech.translate(words))``
.. py:function:: sing(phonemes, \*, pitch=64, speed=72, mouth=128, throat=128)
Sing the phonemes contained in the string ``phonemes``. Changing the pitch
and duration of the note is described below. Override the optional pitch,
speed, mouth and throat settings to change the timbre (quality) of the
voice.
Punctuation
===========
Punctuation is used to alter the delivery of speech. The synthesiser
understands four punctuation marks: hyphen, comma, full-stop and question mark.
The hyphen (``-``) marks clause boundaries by inserting a short pause in the
speech.
The comma (``,``) marks phrase boundaries and inserts a pause of approximately
double that of the hyphen.
The full-stop (``.``) and question mark (``?``) end sentences.
The full-stop inserts a pause and causes the pitch to fall.
The question mark also inserts a pause but causes the pitch to rise. This works
well with yes/no questions such as, "are we home yet?" rather than more complex
questions such as "why are we going home?". In the latter case, use a
full-stop.
Timbre
======
The timbre of a sound is the quality of the sound. It's the difference between
the voice of a DALEK and the voice of a human (for example). To control the
timbre change the numeric settings of the ``pitch``, ``speed``, ``mouth`` and
``throat`` arguments.
The pitch (how high or low the voice sounds) and speed (how quickly the speech
is delivered) settings are rather obvious and generally fall into the following
categories:
Pitch:
* 0-20 impractical
* 20-30 very high
* 30-40 high
* 40-50 high normal
* 50-70 normal
* 70-80 low normal
* 80-90 low
* 90-255 very low
(The default is 64)
Speed:
* 0-20 impractical
* 20-40 very fast
* 40-60 fast
* 60-70 fast conversational
* 70-75 normal conversational
* 75-90 narrative
* 90-100 slow
* 100-225 very slow
(The default is 72)
The mouth and throat values are a little harder to explain and the following
descriptions are based upon our aural impressions of speech produced as the
value of each setting is changed.
For mouth, the lower the number the more it sounds like the speaker is talking
without moving their lips. In contrast, higher numbers (up to 255) make it
sound like the speech is enunciated with exagerated mouth movement.
For throat, the lower the number the more relaxed the speaker sounds. In
contrast, the higher the number, the more tense the tone of voice becomes.
The important thing is to experiment and adjust the settings until you get the
effect you desire.
To get you started here are some examples::
speech.say("I am a little robot", speed=92, pitch=60, throat=190, mouth=190)
speech.say("I am an elf", speed=72, pitch=64, throat=110, mouth=160)
speech.say("I am a news presenter", speed=82, pitch=72, throat=110, mouth=105)
speech.say("I am an old lady", speed=82, pitch=32, throat=145, mouth=145)
speech.say("I am E.T.", speed=100, pitch=64, throat=150, mouth=200)
speech.say("I am a DALEK - EXTERMINATE", speed=120, pitch=100, throat=100, mouth=200)
Phonemes
========
The ``say`` function makes it easy to produce speech - but often it's not
accurate. To make sure the speech synthesiser pronounces things
*exactly* how you'd like, you need to use phonemes: the smallest
perceptually distinct units of sound that can be used to distinguish different
words. Essentially, they are the building-block sounds of speech.
The ``pronounce`` function takes a string containing a simplified and readable
version of the `International Phonetic Alphabet <https://en.wikipedia.org/wiki/International_Phonetic_Alphabet>`_ and optional annotations to indicate
inflection and emphasis.
The advantage of using phonemes is that you don't have to know how to spell!
Rather, you only have to know how to say the word in order to spell it
phonetically.
The table below lists the phonemes understood by the synthesiser.
.. note::
The table contains the phoneme as characters, and an example word. The
example words have the sound of the phoneme (in parenthesis), but not
necessarily the same letters.
Often overlooked: the symbol for the "H" sound is ``/H``. A glottal stop
is a forced stoppage of sound.
::
SIMPLE VOWELS VOICED CONSONANTS
IY f(ee)t R (r)ed
IH p(i)n L a(ll)ow
EH b(e)g W a(w)ay
AE S(a)m W (wh)ale
AA p(o)t Y (y)ou
AH b(u)dget M Sa(m)
AO t(al)k N ma(n)
OH c(o)ne NX so(ng)
UH b(oo)k B (b)ad
UX l(oo)t D (d)og
ER b(ir)d G a(g)ain
AX gall(o)n J (j)u(dg)e
IX dig(i)t Z (z)oo
ZH plea(s)ure
DIPHTHONGS V se(v)en
EY m(a)de DH (th)en
AY h(igh)
OY b(oy)
AW h(ow) UNVOICED CONSONANTS
OW sl(ow) S (S)am
UW cr(ew) SH fi(sh)
F (f)ish
TH (th)in
SPECIAL PHONEMES P (p)oke
UL sett(le) (=AXL) T (t)alk
UM astron(om)y (=AXM) K (c)ake
UN functi(on) (=AXN) CH spee(ch)
Q kitt-en (glottal stop) /H a(h)ead
The following non-standard symbols are also available to the user::
YX diphthong ending (weaker version of Y)
WX diphthong ending (weaker version of W)
RX R after a vowel (smooth version of R)
LX L after a vowel (smooth version of L)
/X H before a non-front vowel or consonant - as in (wh)o
DX T as in pi(t)y (weaker version of T)
Here are some seldom used phoneme combinations (and suggested alternatives)::
PHONEME YOU PROBABLY WANT: UNLESS IT SPLITS SYLLABLES LIKE:
COMBINATION
GS GZ e.g. ba(gs) bu(gs)pray
BS BZ e.g. slo(bz) o(bsc)ene
DS DZ e.g. su(ds) Hu(ds)son
PZ PS e.g. sla(ps) -----
TZ TS e.g. cur(ts)y -----
KZ KS e.g. fi(x) -----
NG NXG e.g. singing i(ng)rate
NK NXK e.g. bank Su(nk)ist
If you use anything other than the phonemes described above, a ``ValueError``
exception will be raised. Pass in the phonemes as a string like this::
speech.pronounce("/HEHLOW") # "Hello"
The phonemes are classified into two broad groups: vowels and consonants.
Vowels are further subdivided into simple vowels and diphthongs. Simple vowels
don't change their sound as you say them whereas diphthongs start with one
sound and end with another. For example, when you say the word "oil" the "oi"
vowel starts with an "oh" sound but changes to an "ee" sound.
Consonants are also subdivided into two groups: voiced and unvoiced. Voiced
consonants require the speaker to use their vocal chords to produce the sound.
For example, consonants like "L", "N" and "Z" are voiced. Unvoiced consonants
are produced by rushing air, such as "P", "T" and "SH".
Once you get used to it, the phoneme system is easy. To begin with some
spellings may seem tricky (for example, "adventure" has a "CH" in it) but the
rule is to write what you say, not what you spell. Experimentation is the best
way to resolve problematic words.
It's also important that speech sounds natural and understandable. To help
with improving the quality of spoken output it's often good to use the built-in
stress system to add inflection or emphasis.
There are eight stress markers indicated by the numbers ``1`` - ``8``. Simply
insert the required number after the vowel to be stressed. For example, the
lack of expression of "/HEHLOW" is much improved (and friendlier) when
spelled out "/HEH3LOW".
It's also possible to change the meaning of words through the way they are
stressed. Consider the phrase "Why should I walk to the store?". It could be
pronounced in several different ways::
# You need a reason to do it.
speech.pronounce("WAY2 SHUH7D AY WAO5K TUX DHAH STOH5R.")
# You are reluctant to go.
speech.pronounce("WAY7 SHUH2D AY WAO7K TUX DHAH STOH5R.")
# You want someone else to do it.
speech.pronounce("WAY5 SHUH7D AY2 WAO7K TUX DHAH STOHR.")
# You'd rather drive.
speech.pronounce("WAY5 SHUHD AY7 WAO2K TUX7 DHAH STOHR.")
# You want to walk somewhere else.
speech.pronounce("WAY5 SHUHD AY WAO5K TUX DHAH STOH2OH7R.")
Put simply, different stresses in the speech create a more expressive tone of
voice.
They work by raising or lowering pitch and elongating the associated vowel
sound depending on the number you give:
#. very emotional stress
#. very emphatic stress
#. rather strong stress
#. ordinary stress
#. tight stress
#. neutral (no pitch change) stress
#. pitch-dropping stress
#. extreme pitch-dropping stress
The smaller the number, the more extreme the emphasis will be. However, such
stress markers will help pronounce difficult words correctly. For example, if
a syllable is not enunciated sufficiently, put in a neutral stress marker.
It's also possible to elongate words with stress markers::
speech.pronounce("/HEH5EH4EH3EH2EH2EH3EH4EH5EHLP.”)
Singing
=======
It's possible to make MicroPython sing phonemes.
This is done by annotating a pitch related number onto a phoneme. The lower the
number, the higher the pitch. Numbers roughly translate into musical notes as
shown in the diagram below:
.. image:: speech-pitch.png
Annotations work by pre-pending a hash (``#``) sign and the pitch number in
front of the phoneme. The pitch will remain the same until a new annotation
is given. For example, make MicroPython sing a scale like this::
solfa = [
"#115DOWWWWWW", # Doh
"#103REYYYYYY", # Re
"#94MIYYYYYY", # Mi
"#88FAOAOAOAOR", # Fa
"#78SOHWWWWW", # Soh
"#70LAOAOAOAOR", # La
"#62TIYYYYYY", # Ti
"#58DOWWWWWW", # Doh
]
song = ''.join(solfa)
speech.sing(song, speed=100)
In order to sing a note for a certain duration extend the
note by repeating vowel or voiced consonant phonemes (as demonstrated in
the example above). Beware diphthongs - to extend them you need to break them
into their component parts. For example, "OY" can be extended with
"OHOHIYIYIY".
Experimentation, listening carefully and adjusting is the only sure way to work
out how many times to repeat a phoneme so the note lasts for the desired
duration.
How Does it Work?
=================
The original manual explains it well:
First, instead of recording the actual speech waveform, we only store the
frequency spectrums. By doing this, we save memory and pick up other
advantages. Second, we [...] store some data about timing. These are
numbers pertaining to the duration of each phoneme under different
circumstances, and also some data on transition times so we can know how
to blend a phoneme into its neighbors. Third, we devise a system of rules
to deal with all this data and, much to our amazement, our computer is
babbling in no time.
--- S.A.M. owner's manual.
The output is piped through the functions provided by the ``audio`` module and,
hey presto, we have a talking micro:bit.
Example
=======
.. include:: ../examples/speech.py
:code: python
|