A Rising Tide Raises All (Voice) Bots

OpenAI rocked the bot world with its introduction of GPT-4o on May 13. The “o” is for “omni” and it refers to its ability to support text, vision and audio rapidly and routinely, even on the free version of ChatGPT. As CTO Mira Mirati explains in an introductory video, the new model eliminates a good deal of latency “in voice mode” by orchestrating three basic functions – transcription, intelligence and text-to-speech – natively. As a result, conversing with a voicebot has a totally natural feel. The bot is highly responsive; it can be interrupted; and it can carry out instructions while remembering important pieces of context.

The most dramatic demonstration, from my point of view, is the one illustrated in the video below where one AI (bot) is prompted to interact with another AI whose phone/host was equipped with a camera. What better way to illustrate the power of vision and voice. (If you stay until the end, you will hear the bots singing to each other).

Floodgates Open For AI-Informed Voice Assistants

Developers of alternative voice assistants were already accelerating their efforts to popularize GenAI-informed services. One example is SoundHound’s technology partnership with Perplexity, which was announced on May 9. As this demo from 2015 illustrates, SoundHound has been delivering impressive demonstrations of its core technologies’ ability to answer rapid-fire questions, while keeping track of context in ways that rivals like Siri, Google, and Microsoft could not. Leaning on Perplexity’s LLM and Generative AI resources facilitates more lifelike responses and showcases the power of Voice AI with great relevance in automobiles and mobile situations.

Interactions LLCa winner of  an Opus Research Conversational AI Award for its implementation at TXU Energy, has been supporting life-like, conversational interactive voice response (IVR) systems for over twenty years. It has relied on a unique solution architecture that places much importance on live agents as “Intent Analysts” to disambiguate unclear input and propel conversations to successful resolution.

Gridspace, two-time winner of the Conversational AI Award, enables Memorial Hermann Health System to automate outreach to existing patients, using phone calls to schedule annual wellness visits and do follow-up to medical procedures among other activities. Gridspace positions its core virtual agent product, Grace, as an AI voice bot with the tone and skill of a top-performing contact center agent. In advance of the introduction of a new Siri, Opus Research expects to see heightened activity among firms targeting the growing population of citizen developers with APIs that are optimized to support humanlike conversations with Generative AI informed resources.

PolyAI, a seven-year-old startup, has just raised $40 million in a Series B funding round, reflecting an appreciation of the growth potential of voice AI among the investment community. PolyAI’s CEO, Nikola Mrkšić cites greater than 10x growth over the past two years or so, by providing voiebots that “speak to people like people speak to each other” and tackle complex challenges over the phone for the likes of Marriott, PG&E, Caesars Entertainment, Volkswagen, SelectQuote and others.

ASAPP formally launched its GenerativeAgent last month. A video demo is available to show how its technology, like PolyAI’s is able to understand complex questions and resolve them by consulting live agents or accessing back-office systems that can make reservations, issue refunds, add a driver to an auto insurance policy and the like.

Other solution providers seeking to speed up development of voicebots that can carry on  complex conversations and respond based on the output of an LLM include:

VAPI, which counts Groq as a technology partner. Groq, which was founded by former Google engineer Jonathan Ross, manufactures Language Processing Units (LPUs), which are computer chips that promise to accelerate chatbot response generation.

Bland.ai, Bland.ai had been demoing its lifelike voice agent, including a promotion that emblazoned a telephone number to call in order to expose interested individuals in the quality of their voicebots for a variety of enterprise use cases.

Retell.ai, which spun out of Y-Combinator in 2023, offers a custom LLM and access to an API for contact center operators as well as small-to-medium-sized businesses to build voicebots.

Up Next: A Better, AI-Informed Siri

It is no coincidence that OpenAI’s introduction of GPT-4o was timed on the eve of Google I/O 2024, and a month prior to Apple’s Worldwide Developers Conference (WWDC2024). When ChatGPT burst on the scene 18 months ago, many analysts saw it as a replacement for Web-based search that poses an existential threat to Google. While that is unlikely to be the case, there are impressive demos of voice search that spans text, images and video, and enables searchers to converse with an AI in ways that are already obsoleting Google Assistant and requiring Gemini to up its game.

Apple’s Worldwide Developers Conference (WWDC2024) will convene on June 10 and it is expected to be the coming out party for a brand new Siri. In a major update, the voice assistant it introduced in its AppStore for the iPhone6 in 2010 will be retooled and ready to rely on pretrained large language models that Apple has been refining for the past several years. Informed speculation anticipates that Apple, after sacking 121 people responsible for infusing the “old” Siri earlier this year, is focused on an AI-informed Siri whose GenerativeAI resource that can run “locally” on the souped up processor and other resources on its iPhones. In that case, there is really fertile ground for Siri to transform itself into a true personal assistant or agent informed by a “domain specific language model” built on permissioned and secure access to user-generated content. Yet, in the wake of OpenAI’s Spring Release, I can’t help but think that GPT-4o is a live version of what Siri ought to be.

Recall that Apple has long differentiated itself in the personal computer and mobile domains as a protector of privacy and security. The new Siri can be trained on search history and profile information, enhanced by location awareness, activity streams, payment preferences and “status” indicators (from loyalty programs and other sources). In short, Apple is in on a short list of candidates to provide its customers with “Personal AI” and Siri is the ideal voice user interface to for that resource.

Collectively, evidence builds that new chip technology and access to custom LLMs are poised to create voice user interfaces that the general public finds useful and usable…. FINALLY!

Categories: Intelligent Assistants, Articles