On The Dogma of The Self-Contained Voicebot

By Ahmed Bouzid on July 13, 2022

In a previous essay, I proposed that Voice User Interface (VUI) design, as currently practiced, has been informed by two dogmas. The first is the dogma of emulation: The belief that VUI designers should aim to build voicebots that emulate how a human being interacts with another human being. For instance, the voicebot’s text-to-speech should sound as human as possible, its prosody should be crafted to convey the right emotion at the right time, that it should open interactions with human beings with pleasant greetings, that it should speak “naturally” and “conversationally,” and so forth. I argued that this dogma not only sets up the designer for failure by inflating the human user’s expectations only to deflate them as soon as the voicebot makes an error that a human being would not make (for instance, not understand something that the human said and that the human believes the voicebot should have easily understood), but also because it needlessly limits the ability of the designer to innovate: to use non-human sounds, to establish new protocols, to use new patterns and strategies, all focused on one thing: delivering the most effective voice interface that will enable the human user to get the job of solving their problem done using the voicebot.

In this essay, I propose to highlight a second dogma that I believe is inhibiting effective voicebot design: what I call The Dogma of The Self-Contained Voicebot. This is the dogma that holds that thanks to the expert work of the VUI designer, deploying the full power of their talent, skills and knowledge, the aim of the designer should be to deliver voicebots that will enable any human user, coming in cold to the voicebot, not even knowing what the voicebot does or why it was created, to use that voicebot effectively. According to this dogma, the designer should build a “robust” voicebot that can take a user who comes to the voicebot potentially almost as a blank slate and guide them to successfully use the voicebot. This dogma holds that it is in fact the responsibility of the VUI designer to ensure that any human user is able to learn what the voicebot does in real time, on the fly, on the go, as it interacts with that human user in the heat of the exchange.

An example of a rule that flows directly from this dogma is the following: ‘Never open a voicebot conversation by simply asking the user: “How may I help you?”’ Instead, the best practice proposed advises us to give the human user first a general sense of what the voicebot is about and then to provide them with a list of options that the human can select from. For instance: “Welcome to Dominion One. I am here to help you with your banking needs. Which of the following do you want me to help you with: Check your balance, transfer money, or something else?”

The Simplicity of Voicebot Menus Enable Swift, Accurate Conversations

Before I elaborate on why I believe that this dogma is not only unnecessary but that it undermines the very goal that it is earnestly trying to deliver on (the goal of usability), let me point out two things. First, I am not a detractor of clear and simple voicebot menus. On the contrary, I am a fan of simplicity, and voicebot menus are a powerful instrument that, if and when crafted carefully and with care, can help the human user move swiftly through a voicebot conversation. Moreover, I am attracted to the simple menu device because menus are not how human beings talk to other human beings, which, for me, is a refreshing violation of the first dogma — the dogma of human emulation.

Which brings me to my second point: Although I caution against the dogma of emulation, I do not hold the flip dogma of never emulating human behavior under any circumstances. If there is a dogma or a principle that I follow, it is the one that cautions against all dogmas — any and all rigid rules — that will trap us and force us to act against our ultimate goal of delivering effective voicebots given the situation that we are designing for.

And so, against the often cited best practice of ‘Never open your voicebot conversation by simply asking the user: “How may I help you?”’ I propose the following best practice: ‘Whenever possible, open your voicebot conversation by simply asking the user: “How may I help you?”’

Why would I say such a heretic thing? Isn’t this how human beings open their conversations after they announce themselves? And if so, does this emulation not fly in the face of the first dogma that I am denouncing?

The answer is twofold: First, in my countering the dogma of emulation, I am, again, not condemning instances where the designer emulates the behavior of a human being, but rather the dogma itself which strives to always emulate a human being, or, emulate the human being whenever one can. In contrast, I propose that the designer should, whenever they feel it is appropriate, lean on the human-to-human model, but do so not as a matter of principle or dogma but opportunistically, when the emulation will lead to a felicitous interaction.

Why Voicebots Should Engage with Open-Ended Questions

But more importantly, I propose the best practice of having the voicebot open by asking the open question: ‘How may I help you?’ for the following reason: For a voicebot that starts with that bold open question to succeed, the human users that come to the voicebot must come to it with a set of wants and goals that the voicebot is ready to understand and deliver on. And for that to happen — that is, for the voicebot to systematically encounter only humans who come to it with the expected limited set of questions that the voicebot has been built to handle successfully — two sets of crucial activities that are not within the VUI designer’s bailiwick must take place: (1) Voice UX research on who the user of the voicebot will be and what problems those users wish to solve, and (2) Post-launch socialization of the voicebot to ensure that those for whom the voicebot was built are aware of its existence, what its purpose is, and what they should expect it to help them with.

In other words, the mark of a great voicebot that will deliver value to as many humans who can benefit from that value as possible is a voicebot that can boldly open its engagement with the human being by asking the open question: ‘How may I help you?’ A voicebot cannot afford to ask that question is a voicebot that is usually failing on one — or both — of the following fronts: (1) The voicebot is engaging with people who are coming in with the expected closed set of questions and problems to solve, but the voicebot is not able to understand what the users are saying or fails to successfully help the human users solve their problems. Or, (2) The voicebot is engaging with people who are coming in with questions and problems that the voicebot was not designed to field in the first place. Only the first of these two is the fault of the designer. The second problem — the one that accounts for the vast majority of voicebot failures and that leads VUI designers to avoid the open question conversation opening — is not the fault of the designer but rather that of the voicebot’s Product Manager, the one who is supposed to ensure that: (1) Solid UX research is conducted so that we know what the people who are being targeted will ask for, (2) That such solid research is taken seriously by the Product Manager who will write up the functional requirements and the VUI designer who will design the voicebot, and (3) That the voicebot is marketed and surfaced to the users for whom the voicebot was designed and built in the first place.

In a nutshell, I propose a rejection of the dogma of the “Self-Contained Voicebot” that puts the burden of delivering a usable and robust voicebot almost wholly on the shoulders of the VUI designer because I believe that the only way to deliver a great voicebot is by elevating the (almost always) neglected activities and findings of both UX Research and Post-launch Marketing. Build a voicebot that can consistently handle, “How can I help you?” and you know that you have pinned down exactly who your target users are and what problems they want to solve, that you have designed your voicebot well, and that you have messaged the voicebot’s existence, how to engage with it, and what it was built to do for them, to exactly those who will benefit the most by giving a chance to the voicebot to help them help themselves.

Dr. Ahmed Bouzid, is CEO of Witlingo, a McLean, Virginia, based startup that builds products and solutions that enable brands to engage with their clients and prospects using voice, audio, and conversational AI. Prior to Witlingo, Dr. Bouzid was Head of Alexa’s Smart Home Product at Amazon and VP of Product and Innovation at Angel.com. Dr. Bouzid holds 12 patents in the Speech Recognition and Natural Language Processing field and was recognized as a “Speech Luminary” by Speech Technology Magazine and as one of the Top 11 Speech Technologists by Voicebot.ai. He is also an Open Voice Network Ambassador, heading their Social Audio initiative, and author at Opus Research. Some of his articles and media appearances can be found here and here. His new book, The Elements of Voice First Style (O’Reilly Media, 2022), co-authored with Dr. Weiye Ma, can be found here.