Whatever you may think of the robotic voices foisted upon the world thanks to Google Voice Search and Siri, you’re unlikely to mistake them for human voices. For years, the state of the art in computer speech synthesis has been stuck at a fairly low level. However, new software called WaveNet, from the brainiacs at DeepMind, is setting a high watermark in the field of speech synthesis and giving AI a voice eerily similar to that of a human.
For years robotics have spoken about something called the uncanny valley – the creepy feeling one gets when observing a robot that is too mechanistic to be mistaken for a human, but not quite mechanical enough to be distinctly robotic, either.
Perhaps one reason there has been no parallel concept for robotic speech is that to date, no speech synthesizer was capable of attaining a quality that came close enough to a human as to be disturbingly similar. With DeepMind’s WaveNet, we may be witnessing the emergence of something like an uncanny waveform, a robotic voice close enough to our own as to be distinctly creepy. Or like me, you may just rejoice that finally there’s hope for an ebook reader that doesn’t sound like the re-animated corpse of a 1980’s Commodore computer.
The secret sauce behind this new standard in robotic speech, ironically enough, is artificial intelligence — albeit with a little help from some smart software engineers along the way.
Side by side comparison of text to speech methods as rated by human listeners. Image Source: DeepMind www.deepmind.com
We may as well get used to this state of affairs, as it looks increasingly that advancements made in things like robotics and AI will be realized with the help of artificial intelligence itself. While this virtuous feedback loop still includes human intermediaries, a trend towards self-improving AI may be in the offing — along with all the concomitant existential risks this betokens. Regardless, let’s take a closer look at WaveNet and see how artificial intelligence has enabled and is, indeed, the backbone behind DeepMind’s new speech synthesizer.
To date, most speech synthesizers were of two types — concatenative text to speech and parametric text to speech. Concatenative text to speech is the method behind the so-called “high quality” speech synthesizers used by Google Voice and Siri. It provides a more realistic sound by using large audio files of real people’s voices, chopped up and reorganized to form whatever word the computer is enunciating. The downside is that it is difficult to color the speech with changes of emotion or emphasis.
The alternative method, parametric speech, uses a rule-based system discovered by applying statistical models to speech patterns. The stilted and robotic-sounding speech synthesizers are mostly of this latter type, since they rely upon the computer to generate the audio signal rather than recordings of real human voices.
The WaveNet system can be thought of as an improvement upon concatenative text to speech, in that it still employs recordings of real human voices.
Credit to extremetech.com