Voicebots: The New Channel of First Resort

By Ahmed Bouzid on June 15, 2021 • ( 1 )

When the Amazon Echo was launched in November of 2014, it was launched with what may now feel like a meager handful of features: Weather, Time, Music and Answering Questions. The total number of skills available at the time was zero (the Alexa Skills Kit was still several months away, to be launched in June 2015), and basic features we now take for granted (timers, reminders, home automation, asking for local information) were yet to be deployed. And yet, from Day One, to borrow Amazon’s famous tagline, the Amazon Echo was beloved by its users. It consistently scored an unprecedented 4.5 stars over 5, across tens of thousands of enthralled users who found the new gizmo not only useful, but delightful and novel and made them feel that with it: they were participating in the ushering of a new technological era.

But its novelty was not what made the Echo the smash hit that it turned out to be. After all, an endless parade of novel, slick, attractive, exciting and imagination-capturing gadgets come and go every year under the glitzy lights of Las Vegas during the enormous CES trade show.

No, the main reason why the Echo was a smash hit was its ability to do very well what it claimed it could do: quickly answer questions without the user having to do anything other than speak and listen. Sure, Siri was already doing that, as was Google’s Assistant, but with both of them, the user had to fetch their phone and swipe and click, and only then speak. But as the Echo showed, it turned out that not having to do that — being able to just speak without interfering with what you were doing (e.g., typing, or preparing food) — was not a mere incremental UX enhancement; it was a game changer.

We are a whole world away today from November 2014, and yet the basic value proposition that makes smart speakers so compelling remains: being able to ask for something, or do something with minimal disruption to your activity flow. You can ask for the news while slicing chicken meat, start off a podcast while folding your laundry, ask what time the post office closes while keeping your eyes on the screen and your hand on the keyboard as you craft your important email, start off your car outside while putting on your shoes, and so on.

What’s the Big Deal about “Micro-optimizations”?
The deal is big because they enable a way of being that is superior to a way of being where one needs to pause and pick up their smartphone to do what they want to do. The cost is twofold: physical and psychological. The actual physical work that one needs to do to use the phone is not small: putting down the knife as you pause your chicken cutting, thoroughly washing your hands (it’s chicken meat!), drying them, locating your phone, unlocking it, tapping and swiping, reading what you need to read, putting the phone down, going back to the kitchen, and picking up the knife to continue with the chicken cutting. Compared to speaking and listening, we are talking about an order of magnitude less work.

But in addition to sheer effort cost, there is also context-switching cost: we humans don’t like to break our flow, and in fact are happiest when we are in a zone, preparing food and listening to a podcast, folding laundry and watching an episode of our favorite TV series, reading, potting our plants, writing, playing the piano — we enjoy being into whatever we are doing and find interruptions, forced or willed, jarring. Especially when upon the end of the interruption, we just can’t get back into the zone that we left.

Keeping that value proposition in mind, businesses should start thinking of ways they can engage potential buyers and serve current customers by using this distinctly new channel that delivers value in a wholly novel way: through eyes free, hands free, voice based conversational interactions.

Two Obvious General Use Cases:

I. In your audio commercial, whether it’s on traditional radio, a podcast, or an audio live stream, why not add to your commercial a call to action that leverages the smart speaker that is likely to be around them (it is estimated that by end the of this year, 50% of broadband homes will have a smart speaker)? After all, if they are listening to something, chances are that their hands and their eyes are busy. Any of the usual calls to actions will require them to remember something (for instance the name of the company or the website URL) or write something down (the phone number), which means that if they fail to do either, your ad would have fallen on ears that may as well have been deaf. For instance, instead of the call to action being, “To find out more, go to, www.mtrustcompany.com, that’s www.mtrustcompany.com,” more actionable, and immediately so, would be, “To learn more, ask your speaker to ‘Talk to Millennium Trust Company’”.

II. Imagine a future — and it’s probably here, but not everywhere just yet — where every physical product that you buy, whether it’s a wifi router, an Ikea furniture kit, a TV set (or a bottle of wine) from Costco, had an Amazon Alexa, a Google Assistant, or a Samsung Bixby call to action on the packaging, so that buyers could engage the product and its brand by just following the simple invocation that is printed on the package. You are troubleshooting the wifi router? Sure, just ask your Alexa (e.g., “Alexa, launch the Roqos Helper”). You want a bit of instructions while assembling the Ikea bookshelf? Easy: just ask Google with the specific call to action. You want to register the warranty for your new TV set? Sure, Bixby can help. You want to find out more about that bottle of wine? No problem: just ask the voicebot as per the label. In essence, in addition to being a branding opportunity, packaging can now be extended to become an interaction jump off point that enables the user to engage the brand at the very moment of interest.

In a nutshell, because it’s easy for them to do, users are starting to develop the habit of first verbally asking their voicebot without having to stop doing whatever they are doing, and then listen to the answer, and only if they don’t get what they want, will they move on to the more traditional channels for help (visit the website, get into a chat session, send an email, or call). In other words, soon, the smart speaker will become a “Channel of First Resort” and buyers and customers will expect to be able to receive help by just asking.

Dr. Ahmed Bouzid, is CEO of Witlingo, a McLean, Virginia, based company that builds tools for publishing Sonic experiences, such as Alexa skills, Google actions, Bixby Capsules, Microcasts, and crowd-sourced audio streams. Prior to Witlingo, Dr. Bouzid was Head of Product at Amazon Alexa and VP of Product at Genesys. Dr. Bouzid holds 12 patents in Human Language Technology, is Ambassador at The Open Voice Network, Editor at The Social Epistemology Review and Reply Collective (SERRC), and was recognized as a “Speech Luminary” by Speech Technology Magazine and among the Top 11 Speech Technologists by Voicebot.ai.

‹ Opus Research Report: “How to Thwart Fast-Changing Fraud Workflows”

In the Intelligent Assistant Era, Voice Biometrics are the X-Factor ›

Categories: Conversational Intelligence, Intelligent Assistants, Articles