It’s Too Early for VUI Taboos

By Ahmed Bouzid on December 17, 2021

One can usually assess the health of a field of practice by examining the rigidity of its dogmas and the willingness of its practitioners, experienced or novice, to entertain challenges to established, received notions.

The Voice User Interface (VUI) design space, as a field with an active number of practitioners and a critical mass of activity, is between 20 to 30 years old. If we point to the launch of Siri in 2011 and the Amazon Echo and smart speakers in 2015-2016 as milestones that mark a new phase of interest in the design of VUIs, we are talking about a field of practice that is less than 10 years old. Young compared to other disciplines (mainstream high level computer programming, for instance, has a solid 50 years under its belt), the field however is, in my opinion, far more dogmatic and less flexible than it really should be, given its tender age.

In this essay, I would like to challenge five examples of orthodoxies that I believe are premature. I issue the challenge not to state that the orthodoxies are fundamentally flawed, because they are probably not, nor to accuse those who subscribe to them, or tolerate them, of being knowing or unwitting agents of a stodgy establishment, because I am sure that most of them are not, but rather to see if anything interesting can be elicited by agitating against some sacred cows.

1. No to “Personas”
I start with the one that I personally find the most irksome: the proposition that when I am designing a voicebot, I need to draw up a “persona” — i.e., a portrait of a fully fleshed out human being, whether in anticipation of what a human user of the voicebot will say or do, or as a handy technique for fashioning a consistent, coherent sound-and-feel of the voicebot.

My objection against the first is straightforward: I should not be modeling my voicebot to be optimally used by a specific kind of a person because my voicebot will be used by a whole spectrum of human beings. Even if my users do indeed share many characteristics with each other, the reality is that the prototypical, epitomic, average user — whether a senior, a toddler, a professional woman, or a college student, or any other such category — is an illusion. Such a person is not real and does not exist. What is real, however, and what should pre-occupy me as a designer, is the set of those relevant problems that I need to surface to deliver an effective voicebot. If most of my users are going to be seniors who have either never used a smart speaker or who just started using it, then let me design with that in mind, rather than conjure up a 73-year-old widower named Nancy, who lives in Fountain Hills, Arizona, who has a dog named Sandy, and who is a mean backgammon player of more than 40 years.

I also oppose modeling my voicebot by conjuring up a human being with a name, an age, an address, a martial status, and a hobby. Instead, it would be more useful for me to spend my time and energy understanding the role of the voicebot and the brand of the company that it represents. The voicebot is going to help users manage their water service account: it will help them obtain information about their outstanding bill, payment options, and perhaps even enable the user to make a payment while engaged with that voicebot. The brand in this case needs to be that of the water services company that it represents: probably to-the-point, transactional, let’s-get-this-over-with-so-that-you-can-go-back-to-your-life kind of a feel. Do I need anything else to carry on with my design work? Probably not.

2. “Yes” to open-ended questions
Here, the taboo in question that I would like to challenge is: ‘The voicebot should never open its dialog with a human being with a flat “How can I help you?” Instead, it should, we are told, tell the user what they can say, so that they know how to react when the voicebot cedes the conversational turn to them. Now, I have nothing against telling users what to say — and in fact, #4 below is all about promoting that. What I do challenge, however, is the blanket underlying proposition that we should be designing our voicebot with the assumption that the users who engage with the voicebot have no idea why the voicebot was created and why they should be engaging with it, and therefore need to be told in the heat of their conversational engagement with the voicebot what they can say and do.

But why are we designing for such an unnatural user state? When I engage any customer service agent, for instance, the agent greets me and then asks something along the lines of: ‘How can I help you today?’ Why is that the case? Well, simply because: (a) When I reach out for help, I do so with a specific problem that I want to solve and the agent on the other side knows that, and (b) When I reach out for help, I expect the person who is engaging with me to be able to help me in some specific ways — or at the very least, to help me with the one specific problem that I need them to help me with. Given that the person has a specific role that they are playing (say, the role of bank teller), when I kick off the conversation with them, we are both operating within the context of our respective roles: I am the bank customer with a bank problem, they are the bank agent with the knowledge and the tools to help me.

The main driver behind my objection here is not that the voicebot is not acting like a human being. In fact, I am a strong supporter of the proposition that voicebots should not aim to emulate human beings. It’s because such a design frame of mind betrays perhaps the biggest cause of voicebot failure, which is the malignant neglect which owners of voicebots (say, the customer care unit in a company) inflict upon their voicebots. Rarely is an effort made to make the user aware of the existence of the voicebot, let alone what the bot can do and how it can help users solve problems, before the user has engaged with the voicebot. The result: if the voicebot says, “How can I help you?” the user will indeed not know what to say. In other words, by subscribing to this rule against open-ended questions, one is in effect unwittingly tolerating the voicebot neglect that is at the root of most voicebot usability problems.

3. “Yes” to directed dialogs
A directed dialog is one where the voicebot is in control of the conversation, step by step, from start to finish. A good example of a directed dialog is a customer satisfaction survey: the voicebot greets you, asks you the first question, you answer it, asks you the next one, and so on, until the session is done. The opposite of a directed dialog is a “mixed-initiative” dialog, where the user may at any point decide to ignore the voicebot and take the conversation into a different direction. For instance, if I am speaking with a voicebot that is helping me book a table for lunch, I may give the voicebot the number of people in the party rather than answer the question, “What day are you making your reservation for?” A robust mixed-initiative-powered voicebot would be able to handle the deviation without a pause.

Pulling off mixed-initiative voicebots is hard. It consumes a lot of design energy. In the real world, energy is scarce, money is tight, and deadlines are often hard. There rarely is a true need for building a mixed-initiative dialog. A simple directed dialog, tightly designed, well tested, beta tested, hardened, and deployed within budget, is going to be a far more usable voicebot than a sophisticated, but flabby one that tries to do too much.

4. “Yes” to menus
Although I am a fan of starting a dialog boldly with, “How can I help you?” I am also a fan of the explicit “You can say, A, B, or C,” either to help the user if they don’t know what to say from the outset, or later on in the flow, once a specific task has been picked to be completed. Smoothly traversing a well structured menu, where the choices are clear and the options natural, is a highly satisfying experience because it gets the user in and out quickly, and few things are more satisfying than not having to engage with a voicebot one second longer than one needs to.

5. “No” to human sounding voices
Here, I am a bit dogmatic about text to speech (TTS) systems that sound human in the context of voicebots. I don’t mind fully human prompts (that is, audio that was recorded by a human), and I also don’t mind TTS that sounds human in the context of reading long form text, and I am partial to TTS that knows its place and does not try to be more than it is. What I do mind is TTS in the context of voicebots that try hard to sound and act and sound human. For instance, I don’t like the tinkering with emotive prosody, especially the sort that is devoid of functional value. Equally irritating to me are voicebots that senselessly mimic social behavior, such as wishing me a good day, or saying my name (unless, again, saying my name fulfills some discernible functional purpose), or expressing sympathy or regret, or giving me praise. I find every bit of all this distracting, irritating, and unnecessary.

The VUI design space is way too young to be setting itself in its ways. These times should be heady times, when we are trying things out, experimenting, pushing the envelope, crashing idols, spawning new concepts, introducing novel techniques, and erecting competing schools of thought that relish every opportunity to duke it out. This is the time to challenge rules, hard or soft, to question best practices, new or well established, to scrutinize guidelines, commonsensical or strange, and see what can resist the challenges and what falls apart and makes way to let something more robust — in other words, more useful. Unless we do this, we will earn the mediocrity that is sure to settle in with the passage of quiet time.

Dr. Ahmed Bouzid, is CEO of Witlingo, a McLean, Virginia, based startup that builds products and solutions that enable brands to engage with their clients and prospects using voice, audio, and conversational AI. Prior to Witlingo, Dr. Bouzid was Head of Alexa’s Smart Home Product at Amazon and VP of Product and Innovation at Angel.com. Dr. Bouzid holds 12 patents in the Speech Recognition and Natural Language Processing field and was recognized as a “Speech Luminary” by Speech Technology Magazine and as one of the Top 11 Speech Technologists by Voicebot.ai. He is also an Open Voice Network Ambassador, heading their Social Audio initiative, and author at Opus Research. Some of his articles and media appearances can be found here and here. His new book, The Elements of Voice First Style, co-authored with Dr. Weiye Ma, is slated to be released by O’Reilly Media in early 2022.