This post is triggered by an ongoing thread among contact center professionals on LinkedIn as they lament the imminent “end-of-life” for Nuance’s roster of premises-based speech processing resources. The process began in 2021 when parent company, Microsoft, issued formal notice to its go-to-market partners (Genesys, Cisco and Avaya, among them) that Nuance SpeechSuite products, including Nuance Vocalizer Text-to-Speech (TTS)), Nuance Recognizer (Automatic Speech Recognition (ASR)) and associated Dialog Module would achieve “end-of-sale” status in August of this year.
The change has global impact. Prior to its acquisition by Microsoft, Nuance was, hands down, the market share leader for premises-based ASR and TTS (and voice biometrics). Its dominance was the result of a mega-merger of Scansoft/Speechworks and California-based Nuance in 2005. Both had a history of Pacman-like ingestion of voice processing specialists, including BeVocal, VoiceSignal, Vocada, Spinvox, SVOX, Loquendo, Vlingo and many others. Nuance’s departure leaves enterprise customers with a very short list of premises-based solution providers, led by Lumenvox by Capacity.
Meanwhile, looking exclusively at ASR, Nuance has been joined by an impressive new generation of transcription specialists. Small firms like Otter.ai, Speechmatic and Deepgram have made significant strides toward recognizing and rendering spoken words and phrases, including proper names, technical terms and other jargon that have plagued precursor technologies, as well as humans. Meanwhile, Google, Microsoft (independent of Nuance), Amazon Web Services, and IBM have made steady investments and progress toward high accuracy transcriptions.
The Move to Speech-Enable LLMs
Voicebots had been slower to evolve than text-based Chatbots largely because popular foundation models were trained on text input. That situation changed dramatically when a new strain of “Voice AI” was showcased by Open AI (and described here) in May (2024) where GPT-4o showed its uncanny ability to carry out life-like conversations. A personable, emotive voice assistant that didn’t mind being interrupted became a darling of the GenAI world. Since that time, the pace of innovation has only accelerated, culminating yesterday (August 13) with Google’s introduction of its latest rendition of Gemini including Gemini Live which, like GPT-4o’s debut, let’s people engage in conversations, ask for advice, and carry out daily commerce.
Both assistants converse through mobile phones. For owners of mobile phones in the Pixel 9 Series (Pixel 9 Pro, 9 Pro XL and 9 Pro Fold), Gemini takes the place of the Google Assistant with the touch of a button. On iPhones, voice control of GPT-4o is available, but I couldn’t figure out how to make it carry on a conversation. Both assistants rely on LLMs that reside “in the cloud”. They are architected to overcome latency issues, yet somehow they are unable to perform simple tasks like setting alarms or playing music because key functions are not running on the phone.
Premises-based Speech Solutions Get the Same Treatment
The emergence of these AI-powered voice assistants spells more bad news for the “legacy” services that have been a mainstay for our analysis for decades. This is especially true in “enterprise” settings where Machine Learning and Natural Language Processing were foundational to the conversion of Interactive Voice Response (IVR) systems’ touch-tone input (Press “1” for Sales) to spoken input.
Even though text has primacy we have the technology for LLMs to understand spoken words and respond in kind. The question is, what does that mean for the installed base of IVR systems (as well as the staff charged with the care and feeding of existing voice applications)? For premises-based implementations, the answer has been clear for a while thanks to decisions made by Microsoft in the wake of its acquisition of Nuance to end-of-life premises-based ASR and TTS software.
OpenAI and the other foundation model providers continually raise the ASR bar in ways with which stand-alone solutions can’t compete. Imagine an IVR system (or voicebot), with access to all the knowledge and connections it takes to rapidly ascertain a customer’s needs and objectives and the ability to recognize the caller’s intents, state of mind and sentiment. This is no longer ASR. It’s Intelligent Assistance.
Now the stage is set for understanding recent events that are shaping Voice-based self-service. Two recent news items highlight such acceleration.
Item 1: PolyAI Rests on AWS’ Bedrock
Earlier this month London-based PolyAI named Amazon Web Services (AWS) its “preferred cloud provider” with intent to leverage the capabilities of its LLM management resource, Amazon SageMaker and Amazon Bedrock. The collaboration gives PolyAI and its enterprise customers easy access to a range of foundation models in a cloud-based environment that are hardened for security and safety.
Item 2: Soundhound Adds Voice to Amelia
SoundHound made a surprise acquisition of Amelia AI. The voice user interface specialist became a public company in April 2022 through merger with a Special Purpose Acquisition Company (SPAC) with a market capitalization of $2.1 billion. Soundhound has made inroads into the quick-service restaurant business and automotive markets. Amelia is a 26-year-old Enterprise Intelligent Assistant provider with a global installed base of large enterprise customers that spans financial services, travel, and telecommunications.
Expectations are growing for people to be able to talk with LLM-powered agents or assistants. They have a growing number of options and we expect there to be more mergers, acquisitions and strategic agreements to accelerate AI IVR offerings.
It’s been a slow process. Disruption caused by large language models was anticipated years ago, and will continue to ripple its way through the industry. Technology providers offering premises-based ASR and TTS were the first to be impacted. As the ripples continue to expand outward, application expect more roll-ups and collaborative initiatives as solution providers find ways to protect their previous moats and avoid being completely submerged by the waves.
Categories: Intelligent Assistants