The annual SpeechTek Conferences provide a great venue in which to take stock of how well and how quickly automated speech processing technologies are evolving. In past years, discussions revolved around the accuracy of speech recognition and the human-like qualities of text-to-speech rendering. Attendees questions most often centered on whether speech-enabled IVRs (interactive voice response systems) would ever be cost-effective replacements for touch-tone systems and how superior forms of “natural language understanding” might dissuade people from automatically pressing “zero” or barking the word “agent” or “representative” repeatedly until they successfully connected with a fellow human being.
At this year’s gathering, both formal presentations and informal hallway discussions centered on a decidedly different set of topics. Instead of talking about whether improving the accuracy of speech recognition resources would lead to higher automation rates and make it easier to replace live agents, most of the conversations I engaged in revolved around the value of integrating speech processing resources – such as they are – with computing resources and capacious memory banks that only the cloud can support. As a result, discussions had new framing. We talked about how the combination of NLP (natural language processing), ML (machine learning) and PA (predictive analytics) are already compensating for shortcomings in today’s statistical language models. From their lofty perch in the cloud, speech-enabled systems do a better job of disambiguating what an individual has said and they can use large data sets and analytics to recognize or even anticipate each caller’s intent.
In the new reality, old-school problems created by inaccurate recognition of spoken words has been rendered moot. Self-service professionals, speech scientists and VUI (voice user interface) experts, alike, are grappling with a new set of challenges borne of the need to support search, command and control and e-commerce in a “post-PC” but “pre-PVA” world. We’re post-PC because wireless and mobile devices (especially smartphones and tablets) have largely eclipsed PCs and laptops for routine daily commerce. At the same time, the general public has been tantalized by the promise (but not quite the reality) of Personal Virtual Assistants. In most people’s minds, PVAs have an identity crisis. There are a cacophony of evolving technology platforms and services. The list includes anthropomorphic wunderkind like Apple’s Siri, Nuance’s Nina or SRI’s Lola. But there are also a plethora of speech-enabled search and “action” tools, particularly Google Now, chatbots, textbots and others.
I, myself, participated in a Keynote Panel on Tuesday August 20th, sharing the stage with Vlad Sejnoha, CTO of Nuance, and Mazin Gilbert, AVP of Technical Research at AT&T. Both fellow panelists did an excellent job of articulating that “Intelligent Systems” in no way implies that they recognize speech more accurately or sound more human-like. Instead, when looking deep into the predictable future (meaning three years from now), the most apparent advancement will be the use of machine learning and natural language processing and analytics in conjunction with touch, gesture, sentiment recognition and even biometrics to do a better job of recognizing who an individual is and to ascertain and respond to that person’s intent. Another way to say this is that speech technology have to come to grips with the fact that other technologies are being employed to compensate for shortcomings in the core speech recognition and text-rendering arena.
The emphasis during the coming years will be on systems that learn. For PVAs, “learning” is about detecting user preferences, usage patterns and applying “deep analysis” toward becoming predictive. Google Now is the closest thing to this sort of assistant that is generally available. Meanwhile Apple, Nuance and others will continue to forge stronger bonds with e-commerce sites or industry specific Web-based resources. They are already pretty good at helping to book a reservation at a restaurant, buy movie tickets or other aspects of travel. When Amazon finally cranks up a talking Kindle or an e-commerce assistant/advisor on its various Web sites, you can expect to see some spectacularly robust applications in e-commerce across retailing, media and other areas of strength for the master of Web-based commerce. And Nuance has demonstrated some impressive applications for Nina on mobile devices as well as home entertainment units, combining text, speech, touch and gesture-based input.
In all cases, “understanding” and “learning” are the subtext for providing constant improvement in the user experience. We, the humans involved in using the virtual assistants, must do some learning of our own. We must focus on understanding what the machines can do well and, to paraphrase one of Tom Cruise’s memorable lines from the film Jerry Maguire, “Help them help us.”
Categories: Articles