Microsoft’s Head of Research Showcases Speech-to-Speech Translation

Microsoft’s chief research officer Rick Rashid showcased real time speech-to-speech translation (English to Chinese) at Microsoft Research Asia’s 21st Century Computing event in early November 2012. In his address, Rashid noted that the software underlying the service benefited from 60 years of research in speech processing that accelerated with application of the Hidden Markov Model (HMM) in the late 1970s and, more recently, benefited significantly from Microsoft’s work with “deep neural networking” and Big Data.

Then he segued into a demonstration that begins about 7 minutes into this video (but really hits its stride about a minute later).

The key points made by Rashid include: Speech recognition is still fraught with errors. Word error rates are still in the 15% range for “raw” speech, but it can be compensated for. The advent of Big Data and Deep Neural Networking has made it possible to translate individual words and “re-order” them to be rendered as fairly accurate phrases and sentences.

But the secret sauce for speech-to-speech translation comes with the next step. Microsoft’s technology takes a sample of Rashid’s spoken words, compares it to a large database of random utterances from Chinese speakers and uses the pitch, timbre and prosody of those voices that most closely match Rashid’s speaking patterns. As a result, it is able to render a text-to-speech output that closely matches the original English phrases. That’s pretty impressive.

Rashid firmly believes that its technology will break down language barriers “within a few years.” To me, that means that it will be at least three years before Microsoft and its peers (Google comes closest) plan to role out production versions of speech-to-speech translation. In the mean time, we’ve seen some interesting offerings from small, entrepreneurial firms like Lexifone, SpeechTrans, iTranslate and a few other companies that have cobbled together proprietary machine translation technologies with speech processing resources from Nuance, iSpeech or others.

Once the three-year gestation period is over, we anticipate that a routine feature of mobile personal virtual assistants will be the ability to translate instructions and content in order to render them in the proper language on the fly.



Categories: Articles

Tags: , , ,

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.