On Voice First Affordances

By Ahmed Bouzid on April 26, 2022

Art by Sue Davis

When you are about to enter a building and a flat door comes between you and the inside of that building, do you push on that door or do you pull on it? If all you are given is a flat surface, then you have no choice but to push. If you are given a handle, then you are probably inclined to pull on that handle.

“Affordance” is the word used to refer to the physical presentation of an object of itself to the user that enables the user to easily determine what action to take (in this case, to push or to pull). The affordance on the first door (the flat surface) and that of the second door (the handle) enable the user to know exactly what to do simply by looking at the shape of the door.

In the world of Voice First, where users don’t have any visible objects that they can scrutinize and therefore no visual presentations that they can leverage to infer what action to take, can one meaningfully talk about affordances?

The answer is yes, if we abstract and redefine affordance from the visual world and define it this way: Affordance is the presentation of an object of itself to the user that enables the user to easily determine what action to take next.

To get to what affordance means in Conversational Voice First, we need to answer the following questions: (a) What is an object in the context of a human conversing with a voicebot? (b) What is an action that a user can take as a result of the perceived presentation of objects to the user, and (c) What does it mean for an object to present itself to a user?

Let’s start with objects. It takes a lot of “things” to make a conversation happen. Things like participants, utterances, conversational turns, and even the very conversation itself.

Along with those things there are actions, such as Starting a conversation, Offering a turn, Requesting a turn, Pausing, Starting over, etc. This much is straightforward.

The tricky part is the notion of an object presenting itself to a user so that the user knows what to do as a result of the perceived presentation.

In the visual world, where physical properties persist and linger over time, the presentation of an object is a set of visual properties: for instance, the shape (flat surface or a handle that one can pull), the color (the color of the traffic light: red, amber, green), the structure (the porous colander, the sharp knife), etc. In the world of Voice First, the closest thing we can come to a property is the sound that is heard by the user. The key difference, however, between the visual and the auditory is that while visual cues persist separately of time, in Conversational Voice First, they are closely coupled with time. You launch your voicebot with a wake word: by doing so, you have signaled at that point in time to your voicebot that you wish to engage in a conversation. The assistant perhaps makes some sound immediately after you have finished speaking and then makes its noises and flashes something visual, indicating that it heard you and that it is ready to engage, and then listens to you (gives you back the conversational turn). The sound it made and whatever additional visuals it flashed at you are the affordances that tell you what to do next, i.e., for you to tell it what you want.

You: Hey Google
Google: (Sound) (Lights On)
You: Talk to The Motley Fool.

Once you have made your request and stopped talking, you cede the conversational turn back to the voicebot. That act (ceasing speech and ceding the turn) is an action, a cue — i.e., an affordance — to the voicebot on what it needs to do: i.e., process speech spoken by you.

Google: The Motley Fool. Give me the name of a company or ask me about how your watch list is doing.
You: My watch list.
Google: (Sound indicating whether your watch list of stocks gained or lost during the last day of active trading.) (Details about your watch list.)

Other examples of voice affordances: (a) The intonation of an utterance that indicates that the assistant is asking you a question and that you are expected to answer that question; (b) A double beep in case you don’t respond in time, indicating that the voicebot didn’t hear you and that you need to speak again; (c) Percolating sounds indicating that the assistant is working on something and retains the turn and will re-engage you as soon as it has something to say and that you need to patiently wait (your affordance).

To fully grasp what affordance means in the context of Voicefirst, it is useful to distinguish between the notion of “affordance” and that of “signifier”.

An affordance leverages the physical properties of an object to enable the user to infer what action to take next. A signifier is a literal message that tells the user explicitly what to do. For instance, the word “Pull” can be posted next to the handle to explicitly inform the user that they are expected to pull to open the door, and the word “Push” can be similarly posted on the surface to indicate to the user that they are expected to push on the surface. In voice, the words of an opening greeting — “Welcome to Millenium Trust Company” — serve to indicate that the conversation has started and that the user should get ready to engage; the words, “You can interrupt me at any time,” serve to communicate a signifier, since they explicitly tell you in explicit, plain language what to say to accomplish something (in this case, take the turn back).

Conventions for a Social Interactional Context
Last, it is important not to make the mistake of thinking of objects, actions, and relationships between users and such objects and actions as innate and absolute, as naturally received things in and of themselves. All affordances have meaning only in a social interactional context. Even a handle on a door, which seems to be an obvious, innately natural invitation to pull, is socially and contextually constructed. To know what to do with a handle, one has to first perceive it at all as an entity, and then to perceive it as a handle; one has to perceive the surface to which this handle is attached as a door; one has to understand that a door is a thing that one can “open” to get into a “building”; one has to know what a “building” is, etc. All of these objects, properties, actions, are taught and learned, not given from the sky. Deep down, they really are “mere” conventions that we acquire through ceaseless socialization.

This point is important because in order to build a useful system of Voice First affordances, we need to settle on a set of conventions that will make possible the existence of the affordances that users will be able use in their conversations with voicebot. Without establishing such conventions, affordances do not exist in the wild, naturally, or emerge exnihilo. We need to agree on how to signal the start of a conversation, the ceding of a turn, the reclaiming of a turn, the need to pause, etc., and we need to adopt such conventions in such a way that any user engaging with any voice assistant, knows what to do, regardless of who designed, built, and deployed that voicebot.

Unfortunately, as things stand, given the lack of a standard and a common protocol (Amazon Echo, Google Assistant, Samsung Bixby, Apple Siri all have their own particular ways to cue what the user is expected to say), we can’t truly speak of affordances in the same way that we speak about them in the world of visual and tactile interfaces. Until we do, the user will continue to face frustrations. [But fortunately, The Open Voice Network is working hard to propose standards and protocols. For more on their work, please visit: https://openvoicenetwork.org/]

Dr. Ahmed Bouzid, is CEO of Witlingo, a McLean, Virginia, based startup that builds products and solutions that enable brands to engage with their clients and prospects using voice, audio, and conversational AI. Prior to Witlingo, Dr. Bouzid was Head of Alexa’s Smart Home Product at Amazon and VP of Product and Innovation at Angel.com. Dr. Bouzid holds 12 patents in the Speech Recognition and Natural Language Processing field and was recognized as a “Speech Luminary” by Speech Technology Magazine and as one of the Top 11 Speech Technologists by Voicebot.ai. He is also an Open Voice Network Ambassador, heading their Social Audio initiative, and author at Opus Research. Some of his articles and media appearances can be found here and here. His new book, The Elements of Voice First Style, co-authored with Dr. Weiye Ma, is slated to be released by O’Reilly Media in 2022.