What if it were the “Eyes-First Revolution”?

By Ahmed Bouzid on November 9, 2021 • ( 1 )

Have you ever heard someone say: “Designing a highly usable website is difficult because the graphical user interface presents inherent challenges and limitations?” Neither have I. How about: “Designing a highly usable website is not something that anyone can do. It requires skill, knowledge, and a great deal of experience?” All the time. (Selecting pre-designed templates, by the way, doesn’t count as designing.)

In contrast, I have heard, and continue to hear, people say: “Designing a highly usable voice user interface is difficult because the voice user interface presents inherent challenges and limitations.” And rarely do I hear: “Designing a highly usable voice interface is not something that anyone can do. It requires skill, knowledge, and a great deal of experience.”

For the purposes of this essay, I will set aside the latter proposition and I will focus on the former: The proposition that there are some inherent attributes of voice user interfaces (VUIs) that make them perennially difficult not only to design for but also to use.

The usual “limitations” or “challenges” that people have in mind when they say that the voice user interface is difficult to design for and use revolve around speech recognition accuracy (you say “Newark” and it thinks you said “New York”), ephemerality (while images and text on a web page persist, spoken words of a voice interface vanish as soon as they are spoken), resulting in the user needing to remember what was said to them (cognitive load). This is compounded by mispronunciations of words by the interface (instead of “convert” as “CONvert,” it pronounces as “conVERT”), and exacerbated by the fact that one has to patiently listen to the interface when it is speaking, one word at a time, sometimes several sentences in a row.

For me, the interesting question is: Why is this the case? Why do we not start from the perspective that graphical user interfaces have inherent limitations that we need to work around, while with the voice interface we do?

Where Voicebots Rule the World

Here’s an exercise you may find interesting.

Imagine that technology had evolved in such a way that we had no graphical interfaces around us (no screens, no smartphones, no tablets) and that all of the technologies that help us do what we want to do are voice and audio based. In this universe, voicebots would be able, for instance, to do better than human dictation, enable us to control our environment by just speaking, enable us to book appointments, purchase books and concert tickets, listen to music — all of it by just saying a few words. And the voicebots execute our commands with accuracy, speed and efficiency. In this world, this technology (taken for granted and almost invisible) would be there mainly to help us cut to the chase: I’m feeling hungry and I want to place my usual order of Jimmy John’s. The voicebot, having interacted with me long enough to know what I want, takes care of the order for me and I would know what to do when it hears me say, “Get me a Jimmy John’s.”

Then imagine that some entrepreneur suddenly introduces the Graphical User Interface — an unheard of innovation up to that point that promises to revolutionize the way we interact with the world around us. Literal visions are cast on how this new way of interacting with our world — colors, blinking lights, screens, graphics — that tantalize the imagination of even the most jaded of venture capitalists. However, this amazingly exciting brand new interface, unlike the old fashioned voice one, doesn’t seem to do the “cutting to the chase” thing very well. Instead, it wants to take its time with you. It wants you to slow down, preferably sit down, pick up something, look at it, touch it, read the things that are on it, and click and swipe at it to get things done. Imagine you were a creature who evaluated interfaces by assessing how good they were at cutting to the chase. How would you rate this newfangled interface?

Here’s what I think we would say. We would say that this interface presents several challenges and limitations. To do anything that I want to do, it needs me to halt whatever I am doing, fetch a thing and put it in my hands, look at the thing for the duration of my using the interface and specify every little thing that I want to do! “My Lord is this thing dumb,” I would most probably say outloud as it would repeatedly keep asking me to to fill out text boxes that, as far as I am concerned, I had already filled out, including those that ask for my name, my email address, my phone number, my home address. And then there is all that tedious typing and tapping and swiping. Why? Why all this when I can just speak and get?

But lo and behold, I discovered that this interface, limited and tedious as it is, is really good for certain things where my aim is not to cut to the chase the way I can with the voice interface. Situations where I want to slow down and take my time making choices, where I need to reflect on things. And so I grow fond of the interface because it allows me to interact with what’s around me in a very novel way.

But to my disappointment, I notice that the innovators and entrepreneurs around me who have jumped on the “Eyes-First Revolution” are not really taking advantage of this new interface to solve new problems, but rather are repurposing this new interface to solve old problems — problems that were nicely solved by the good old voice interface. So, they give me a keyboard and tell me that I can now type instead of dictating, and buttons so that I can control my appliances without speaking, and a web page where I can schedule meetings on my calendar, and an app on a device that I can carry with me all the time for purchasing books and concert tickets, and another one for putting together song lists, including one that I call “Optimistic.” All of this I am told is progress because now, you see, in those situations where I am not able to speak or hear (referred to as ears-busy/mouth-busy), I can now do what I want to do using my eyes and hands instead.

The real opportunity when a new interface is introduced is not in figuring out how to best do with this new interface what existing interfaces can do (and probably do very well), but what new problems or new value that the new interface can deliver that can’t be easily delivered by existing interfaces. If the new interface needs you to slow down, to focus, to take your time, then find those use cases where such “limitations” are in fact enablers that unpack new ways of experiencing the world.

Getting back to the real world as it is today, what use cases can innovators think of where imperfect speech recognition, ephemerality, the need to remember what you heard, the need to put up with mispronunciations, the need to be patient and to listen to the voicebot speak, one word at a time, are not limitations but the exact attributes that one needs to deliver something that the user will find useful in some newly meaningful way? Taking that question seriously, I believe, is the first step towards delivering the by-and-large still unpacked promise of the voice user interface.

Dr. Ahmed Bouzid, is CEO of Witlingo, a McLean, Virginia, based startup that builds products and solutions that enable brands to engage with their clients and prospects using voice, audio, and conversational AI. Prior to Witlingo, Dr. Bouzid was Head of Alexa’s Smart Home Product at Amazon and VP of Product and Innovation at Angel.com. Dr. Bouzid holds 12 patents in the Speech Recognition and Natural Language Processing field and was recognized as a “Speech Luminary” by Speech Technology Magazine and as one of the Top 11 Speech Technologists by Voicebot.ai. He is also an Open Voice Network Ambassador, heading their Social Audio initiative, and author at Opus Research. Some of his articles and media appearances can be found here and here. His new book, The Elements of Voice First Style, co-authored with Dr. Weiye Ma, is slated to be released by O’Reilly Media in early 2022.