Advanced Text-To-Speech Promises New Possibilities for Siri (and Others)

In our “Voice-First” world, we focus most of our attention on technologies that enable intelligent systems to understand what people are saying and to respond with an appropriate answer. But at some point, having systems that can speak back to people effectively will be important.

That’s not to say that text-to-speech technologies have been standing still for the past decade. Far from it. But while we seem to be unforgiving of Voice-First applications and assistants that don’t understand us perfectly all the time, we’re willing to put up with Text-To-Speech (TTS) technologies offering voice output that’s far from human.

In advice columns popping up for developers of voice apps, a common theme is to avoid applications that are too “verbose.” Don’t make the voice assistant say too much. And if your application requires lots of talk, the advice is to opt for streaming audio of human narrated content as opposed to TTS.

These are workarounds for what hopefully is a temporary problem with current technology. Most voice assistants today rely on concatenative synthesis for speech output. Short snippets of speech, including individual words and phonemes, are recorded from a single speaker. To make full sentences, these small snippets are spliced together on the fly. While this technology makes it possible for a voice assistant to say pretty much anything, it doesn’t facilitate the natural fluctuations in intonation, emphasis, or phrasing that we associate with human speech.

At this week’s WWDC, Apple unveiled two new voices for Siri. Siri now has a new female as well as a male voice. Both voices rely on machine learning to produce more natural-sounding speech that includes different vocal inflections. Apple showcased how the female Siri voice is able to enunciate the word “sunny” in three different ways while providing a weather forecast.

Google has also been working on improved speech synthesis technology that leverages deep learning neural networks developed by DeepMind. Wavenet is a Google project that analyzes the raw audio waveforms associated with human speech. Deep neural networks are trained on large datasets of human speaking. The goal is to train complex algorithms to generate natural-sounding synthesized speech on the fly.

Truly natural sounding TTS will open up new worlds of possibilities for voice app developers, brands, and customer support teams. Automated IVR systems powered by intelligent assistance technologies can already retrieve the best answers for customers. In the near future, automated TTS systems may be able to provide customers with those answers in voices indistinguishable from humans. While this achievement may be viewed as a further threat to jobs by some, the technology is sure to be welcomed by customers eager for better self-service options and by call center agents who would gladly hand off routine question-answering to their virtual partners.



Categories: Conversational Intelligence, Intelligent Assistants

Tags: , , ,