Voice Assistants to Benefit from Generative AI’s Popularity

It took the viral adoption of OpenAI’s ChatGPT to inspire Microsoft and its cohort of fellow hyperscalers to expand availability of their own large language models (LLMs) and generative AI resources. Collectively they precipitated a sudden, fundamental rise in expectations mere humans have for automated assistants. Thanks to Microsoft, Salesforce, Facebook, Google, Adobe, NVIDIA and many smaller solution providers, Conversational AI a standard layer in the new IT’s solution stack. Now hundreds of millions of “Gen G’ers” are sharing their (generally positive) experiences as they take turns with ChatGPT, Bard, Claude, Firefly, Perplexity.ai or their Generative AI platform of choice.

Conversations, most often, start with simple requests: from assistance in writing a business email or specifications for a 400-page essay on Virgil’s Aeneid. A growing number culminate with completing complex tasks, like making travel reservations for multi-city trips, concocting menus for upcoming social gatherings, putting together a detailed plan for purchasing materials, or taking on a home improvement project. The list of plug-ins initially released for ChatGPT reflects where OpenAI expects ChatGPT to take their users. They included Expedia, FiscalNote, Instacart, KAYAK, Klarna, Milo, OpenTable, Shopify, Slack, Speak, Wolfram, and Zapier, representing a sampling of Web-based resources for travel, financial planning, grocery shopping, dining out and a plethora of other common activities. 

Shouldn’t Voice Assistants Be Doing All This?

In short, the new generation of Conversational AI portals are performing tasks on behalf of users that, a decade ago, were the charter for their voice-based precursors Siri, Alexa and Google Assistant. After more than a decade performing a short list of routine tasks through smartphones, home speakers and automobile control panels, there is new hope for voice assistants and their corporate cousins, voicebots and speech enabled IVRs. As documented in the “Voice Consumer Index” issued by Vixen Labs in 2022 <https://vixenlabs.co/research/voice-consumer-index-2022>, there is a short list of features or functions that regular users turned to on a daily basis – limited to weather, music, news and entertainment. In the U.S., add shopping, food delivery and even healthcare and wellness. A significant number, though not a majority, have stepped up to use voice commands to take control of lighting systems or entertainment units already integrated into the Internet of things. In the U.S. and U.K., only 38% of respondents used their voice assistance at least once a day, including 28% reporting multiple uses per day.

Voice assistants lack two attributes that would fuel momentum toward more frequent use. They don’t engage in conversations and are capable of completing only the simplest of tasks: “turn on the lights,” “tell me the news,” “play ‘Misty’ for me.” As the authors of Vixen’s “Voice Consumer Index” explain, “Voice in daily life means voice that’s conversational, convenient and helpful.” Elsewhere in the report they note that the primary goal of each user is “to find assistance, not an assistant.” The aspiration of voice application developers is to provide services that are invoked frequently, conversationally and successfully.

These objectives proved elusive when applications developers used hard-coded dialog modules along with “state-of-the-art” speech recognition and text-to-speech resources that often employed live voices. Solutions were expensive, hard to scale and required extensive training and testing. Today, LLMs and Generative AI change all that. In the most frequently used categories – search, shopping, and task completion – the prospects of success are greatly enhanced when voice assistants are informed through an API that requires no training and can be deployed almost instantaneously to hundreds of millions of users.

As Jon Stine of the Open Voice Network noted in a recent post, “voice” is not a product or a capability, rather it is “a subset of an interface to language models large and small.” Being a subset of anything means that it is a small part of a significantly larger whole and considerable energy needs to be dedicated to determining “where voice fits” in Conversational AI-based workflows and when the spoken word is the ideal user interface. Both these factors are uncertain but, for the first time, end-users are finding an abundance of applications that benefit from conversational user interfaces.



Categories: Conversational Intelligence, Intelligent Assistants, Articles