Believe it or not, but there are still people out there who believe that one can deploy a voicebot by designing and building a text-based chatbot and then adding a layer of voice on top of it. Build the Natural Language Processing (NLP) model, build the text-based dialog rules, design and craft the prompts carefully, test the thing by having people type queries, make sure that it works well enough, and then add a voice layer to it (Automatic Speech Recognition and Text to Speech), and there you are: you now have two systems that you can deploy — a chatbot that you can launch on your website, as a text messaging service, as a Facebook messenger, or on your mobile app, and a voicebot that you can deliver as an Amazon Alexa skill, a Google Assistant action, a Samsung Bixby capsule, an IVR application, a mobile voice app (or feature within an app), etc.
(I remember a partners session in the offices of Amazon in New York City, back in 2017, when the Amazon Lex service was being introduced, where someone from the Amazon team earnestly declared, with everyone in the audience hanging on to his words and taking what he was saying very seriously, that the beauty of Lex consisted essentially in enabling you to build your application once and then to deliver it on multiple platforms and multiple channels – voice and text, so that one needed to maintain only one representation rather than deal with separate streams of work.)
Obviously, the proposition is fundamentally flawed and its flaws are easily provable.
For instance, voice-based exchanges are ephemeral: they are spoken, they are heard, and then they are no more, except, at best, as something remembered. In contrast, with a text-based chatbot, the text of the exchange persists, so that one can scroll up and down to refresh one’s memory.
The implications of this basic difference are several and not trivial, but the most consequential is this one: while one needs to consider cognitive load when designing a voicebot, one need not as much for the text-based chatbot.
Correction and error handling are also far more prevalent and complex with voice than with text. One certainly can mistype and misspell words, but if you don’t mistype or misspell your words, the chatbot will not misread what you wrote. In speech, you can pronounce words as well as you can, but there will always be a non-trivial chance that you will be misheard. Recovering from such “mishearings” can be complicated and is one of the least pleasant aspects of voicebot interactions.
Keep the Conversation Going
Then there is the matter of time. With the chatbot, you can read, think, look stuff up, and then respond to the bot. You can take your time. In voice, you can’t take long pauses: you need to respond quickly to keep the flow going. In currently deployed voicebots, for instance, the user is given two to three seconds to respond, otherwise they are threatened with conversational termination. The effect is that the user is continuously on edge throughout their voicebot conversation: they need to listen carefully, understand quickly, and respond promptly.
In text, you can be given links to click on and images to look at. And you, too, can do the same: copy and paste links and text, and upload images and videos. In voice, you can’t. You have your voice and the voice of the bot, and that’s it. (Multi-modal interfaces are another matter, with their own allotted set of advantages and challenges.)
In text, you can be a simpler version of yourself, or even adopt a whole new persona, in a way that is hard to do with voice — unless you are a trained artist or a well-oiled conman. (Imagine trying to mess around with someone you are talking to over the phone vs. via chat.) Text in a sense gives some space to the persona that you want to adopt for the moment. Moreover, the distancing from the self that text enables and that voice does not, provides the user with a level of anonymity that empowers them to express things that with voice they would feel uncomfortable expressing. That’s one reason while flaming and bullying are endemic in text-based platforms (Facebook, Twitter), while it’s hard to imagine flaming and bullying sessions taking place on a voice-based medium such as Clubhouse or Discord, for instance.
Finding Your Voice
And then there are things that voice gives you that text can’t. Voice gives you a chance to be yourself — your personality — and to express your emotions efficiently and subtly. Voice also lets you listen to rich meaning in a way that text can’t. A joke well delivered has no match in text. Listening to stories well told, interactive or not, has a mere pale shadow emulator in text.
Crucially, of course, with voice, you can engage in interactions eyes free and hands free: preparing food, cleaning, folding laundry, fixing a tire, painting the walls, writing, eating, driving, walking, lying on your bed with your eyes closed, taking a shower, staring at the fridge, and so on. In other words, in the vast majority of the time that you are going about your day, doing your human things.
So which one is “better” — the voicebot or the chatbot? The question is, of course, a silly one, while the answer is boring, but nevertheless worth stating. A chatbot is good for complex interactions that require time and where pauses are necessary and where rich-data information needs to be shared and integrated in the conversation. A chatbot that helps you book a trip or get an estimate on a car repair job, are perfect fits. A voicebot is of course the best medium when your eyes and hands are busy, but it is also good for short interactions, where you are looking for content that can be quickly delivered by voice (the weather, the time, the sum of 73 and 19), or content that cannot be rendered through text, for instance, rich-emotion content, such as music, or someone telling a story or a joke, or reading a poem.
Remember, as Marshall McLuhan taught us many decades ago, the medium is the message.
Dr. Ahmed Bouzid, is CEO of Witlingo, a McLean, Virginia, based company that builds tools for publishing Sonic experiences, such as Alexa skills, Google actions, Bixby Capsules, Microcasts, and crowd-sourced audio streams. Prior to Witlingo, Dr. Bouzid was Head of Product at Amazon Alexa and VP of Product at Genesys. Dr. Bouzid holds 12 patents in Human Language Technology, is Ambassador at The Open Voice Network, Editor at The Social Epistemology Review and Reply Collective (SERRC), and was recognized as a “Speech Luminary” by Speech Technology Magazine and among the Top 11 Speech Technologists by Voicebot.ai.
Categories: Conversational Intelligence, Intelligent Assistants, Articles
Hi Ahmed Bouzid, thanks for an informative blog. The main difference between voice bot vs chatbot is the voice assistant only i.e communicates via voice. Voicebots are AI powered software which uses interactive voice response so that, customers can speak to the bot and communicate with it using natural language. Where as chatbots are text mode of conversations, so that one has to scroll up and down the entire chat to know the exact conversation. That is the reason voicebots are preferred more over chatbots.