Ah, Firenze! It’s the site of the Interspeech 2011, the 12th annual conference of the International Speech Communication Association (ISCA). This year the theme is “speech science and technology for real life.” At the outset, ISCA spokespeople said they expected to hold a gathering of more than 1000 international speech scientists with more than 800 papers presented in oral and poster sessions.
The scientists affiliated with the Speech at Microsoft Group upped the speech applications ante this year when they posted this story on the Microsoft Research blog. For years, we’ve been hearing that the automated speech processing is on the cusp of a quantum leap in “accuracy,” in terms of recognizing utterances and understanding language. This post makes it clear that neural networking plays an important part in adding height to the big leap forward.
As always, the ideal solutions are “out-of-the-box, speaker-independent speech-recognition services” provided by a system that requires no training and works well for all users under all conditions. Certainly advances in acoustic processing and directional microphones will play a part. But Interspeech is the place where speech science is showcased and the relevant three-letter abbreviation is DNN (for “deep neural networking.”
Past solutions had been based on capturing utterances as “phonemes” and subjecting them to interpretation using so-called “context-dependent Gaussian mixture model HMMs (CD-GMM-HMMs).” Dong Yu, a researcher at Microsoft’s speech labs in Redmond, Washington, observes that the use of DNN in conjunction with smaller components (or “building blocks”) of spoken utterances called “senomes.” In short, today’s computing platforms can apply a new technique (DNN) to understand what is being said by looking at incredibly small pieces of spoken material.
Microsoft’s team delivered their paper at Interspeech today (August 29, 2011).
Categories: Articles