In recent years, AI-powered voice assistants have made remarkable strides. Automated Speech Recognition (ASR) now rivals human-level understanding, large language models (LLMs) generate intelligent responses, and text-to-speech (TTS) has become more natural than ever. Yet, many voicebots still sound robotic, struggle with emotional nuance, and fail to adapt fluidly to conversation—undermining the customer experience.
Voice AI start-up Sesame is tackling this challenge head-on, pioneering a new generation of TTS that makes AI voices more lifelike and expressive. By combining advanced TTS models with conversational memory and emotional intelligence, Sesame strives to close the gap between synthetic and human-like speech. While their primary focus is on developing consumer-focused personal assistants and eventually smart glasses, their advancements in TTS could also open new possibilities for enterprise voice assistants.
How Is Sesame Different from Traditional TTS?
Most AI-powered voice bots today follow a simple pipeline:
- ASR transcribes the customer’s speech into text.
- An LLM processes the text and generates a response.
- A TTS Engine converts the response into spoken words.
While ASR and LLMs have seen dramatic advancements, TTS remains the weakest link. Even a well-crafted AI response can fall flat if delivered in a monotone, robotic voice.
Sesame is solving this by replacing traditional TTS with a system that:
- Remembers recent conversation context to adjust tone dynamically.
- Modulates pitch, pauses, and rhythm to reflect emotional cues.
- Sounds more natural and adaptive rather than like a generic synthesized voice.
For example, if a customer says:
“I’m really frustrated. This is the third time I’m calling!”
A traditional TTS bot would respond in the same neutral tone as a simple greeting.
Sesame’s model, however, would soften its voice, slow its pace, or adjust intonation to convey empathy—enhancing the overall customer experience.
How Sesame AI Works
Sesame’s Conversational Speech Model (CSM) is built using a transformer-based architecture (the same AI technology that powers GPT models), but instead of just processing text, it learns from both text and speech to create more lifelike responses.
Here’s how it works in a real-time customer service scenario:
- Customer speaks → ASR converts it to text (e.g., using Deepgram, Whisper, etc.).
- LLM generates a response based on customer intent.
- Sesame AI transforms the response into expressive speech that matches the conversation’s tone.
Unlike traditional TTS, Sesame remembers the past 2 minutes of conversation and adjusts dynamically, so every response sounds contextually appropriate. The Sesame CSM was trained on a large dataset of audio, mostly of people conversing in English. It learned to mimic the tone, inflections, and prosody of people talking about many different subjects, involving widely diverse emotions.
How This Could Benefit Customer Service AI
If you’re building voice-based AI agents for customer support, sales, or virtual assistants, the way the AI speaks directly impacts customer perception. Customers engage better with AI that sounds human rather than robotic. A voice that adjusts tone and emotion based on context makes interactions feel more authentic and personal. If customers feel they are talking to an empathetic AI, they may be less likely to escalate to a human agent, reducing operational costs. This human-like interaction not only enhances the customer experience but also drives efficiency, making AI a more valuable asset to businesses.
Advancing AI Voicebots with Human-Like Speech
The Sesame Conversational Speech Model represents an advancement in the state-of-the-art for TTS. However, it’s still just one component in the overall technology stack for an AI Voicebot. You’ll still need to select a capable ASR to convert speech to text and rely on your LLM of choice for understanding intent and generating responses. But the good news is that you can integrate Sesame into your existing voice AI system without overhauling everything. The team at Sesame also stated plans to open-source key components of their model.
Sesame’s CSM represents a possible shift toward more human-like AI voice interactions in customer service. For businesses relying on AI-driven voice automation, this technology could soon offer a way to improve customer engagement, reduce call escalations, and create more natural AI interactions.
You can try out the Sesame demo here.
Categories: Articles