VUI Aesthetics: What does it mean to create a Beautiful Voice First Experience?

By Ahmed Bouzid on June 1, 2022

In his 2004 instant classic, Emotional Design: Why We Love (or Hate) Everyday Things, Prof. Don Norman tells us that aesthetics play a surprisingly functional role in product usability. In a nutshell, Prof. Norman convincingly shows us that attractive products are not only perceived to work better than plain or unattractive products, but in fact are outright objectively more usable.

“Attractive things,” Prof. Norman writes, “make people feel good, which in turn makes them think more creatively…[and] happy people are more effective in finding alternative solutions and, as a result, are tolerant of minor difficulties.”

An attractive product is a product that one will use more often, and as one uses it, one learns how to use it, so that, when all is said and done, and whether the product was “designed right” (i.e., its affordances were optimally leveraged and the interface abides by best practices), a beautiful product is more usable as a consequence of its beauty than one not used as often or as lovingly simply because comparatively, it was not as lovely and attractive.

How can we apply this basic but powerful insight to our work as voicebot builders?

First, we must address the question: ‘What does it mean for a voicebot — physically invisible, temporally fleeting and ephemeral — to be aesthetically attractive?

The aesthetics of a voicebot — what makes someone say that a voicebot is “attractive” or, using our present day parlance, “sweet” — can be outlined across three broad dimensions.

First is the Acoustic dimension

How easy on the ear are the sounds that emanate from a voicebot. Here, we can point to three main properties:

Clarity: Is the audio clean and clear, or is it full of static?
Volume: Is the default audio (one that does not force the user to adjust the volume) loud enough for most users, or it is too loud or not loud enough?
Consistency: Is the audio consistently clear and consistently loud, or does it vary in clarity and volume? Audio that is clear and at the right volume and that remains so throughout the experience is far less taxing on the human ear.

Second is the Linguistic dimension

How concise and potent are the words and the phrases that the voicebot uses. Here too we have three dimensions:

Clarity: Are the words used easy to understand by the targeted human user (for instance a child, a person for whom English is not their native language)?
Punch: Is the language tight, action oriented, and tells exactly what the user needs to do, or is it loose, wordy, tedious, and leaves the user wondering what to do?
Consistency: Does the voicebot use the same words for the same objects it refers to, or does it change words (at times it uses “case” at other times in the same conversation uses “ticket”)? Does it always offer three choices or does it sometimes offer two, and other times four? Does it formulate its choice offerings using the same construct (e.g. “You can ask me…” or “Here’s what I can do for you…”) or does it use one constrict at some places of the conversation and another one at some other places?

Third is the Conversational dimension

How smoothly the flow between the voicebot and the user is proceeding. A “sweet” flow is one where the conversation is moving forward effortlessly, towards its target goal (say, securing a parking spot, obtaining the balance due, purchasing a movie ticket). Such a “flowing” conversation usually emerges if the following three mechanisms are managed well:

Turn ownership: To the extent that at any given time the turn is monopolized by either the voicebot or the human, the conversation flow will be marred. For instance, a voicebot that speaks long prompts, or one that lets the user speak for too long, is a voicebot that will sooner or later get into trouble.
Turn ownership transfer: Here, the concern is with the voicebot not reclaiming the turn before the human is ready to give it back (but exactly as soon as the user has given it back) and by the same token, the human is not placed in a position where they become impatient and cut off a voicebot in mid-sentence (for instance, a voicebot that treats the power user the same way a novice user).
Error recovery: Here, the concern is with navigating situations where the conversation has gone sideways: for instance, the user did not say anything, or said something but the voicebot did not hear them, or or said something but the voicebot could not recognize the words spoken or could not extract any meaning from those words. A voicebot that is designed to navigate out of such difficult moments smoothly (for instance, when the user doesn’t know what to say, the voicebot might give them a concrete example of what to say) is a voicebot that will increase its chances of delivering that sought after smooth experience.

Although invisible, temporally fleeting and ephemeral, voicebots can deliver experiences that one can describe as pleasant and wonderful, or unpleasant and awful. Beauty when it comes to voicebots boils down to how the voicebot sounds, the words and constructions it uses, and how elegantly it handles the back and forth of human to voicebot conversations. Build such attractive voicebots and users will be more patient and hence more disposed to investing the time and effort to learn how to use them.

‹ Opus Research Report: “Decision Makers’ Guide to Enterprise Intelligent Assistants (2022 Edition)”

Zoom Zooms in on The Conversational Cloud with Solvvy and Genesys ›

Categories: Intelligent Assistants