“Audio Ergo Sum!” (Part 2)

By Ahmed Bouzid on October 26, 2021

In Part One of this essay, I argued that the emergence and the reception of voice and audio are following a familiar pattern similar to the one we witnessed with other recent major digital disruptions. Just like with the Internet, the Web, cell phones, smart phones, and social media, audio is now traveling a phase where its increasing ubiquity — podcasts, AirPods, smart speakers, voice assistants, social audio — is setting off alarms in the heads of Marketers, with a growing minority beginning to understand that mainstream adoption is very much in the near horizon and simply a matter of time.

Source: Witlingo

But this not-so-new medium of voice and audio, I propose, is at the same time fundamentally different from the previous ones. The Internet, the Web, smartphones, and social media have all been essentially visual-tactile: they have all needed our eyes and our hands for us to engage with them. The emerging medium of audio, in contrast, is concerned with a different pair of our senses: that of hearing and speaking.

So, in a sense, although the larger pattern of disruption is familiar, the challenge to Marketers of what to do with this new way of engaging their prospects and customers, how to cast their brand and create differentiation, are new ones. Therefore, in my view, they represent a greater challenge to organizations than, say, the one of going from delivering a website to a desktop or a laptop to delivering a version of that website to a tablet or a smartphone, where the challenge was essentially one of interactional real estate and usability: i.e., how do we figure out how to translate what we were able to effectively deliver on a large screen with an expansive keyboard to a smaller screen and a very limited keyboard?

In a nutshell, then, up to now, the challenge has been primarily (though not always just) one of user experience design. In the case of the emergence of voice and audio, in contrast, the challenge seems to be primarily one of use case identification. What are the use cases where voice and audio can deliver value that the visual-tactile one cannot deliver, or can deliver it, but poorly? What are the use cases where the visual-tactile interface is the poor-man’s version of the superior one of the pure voice and audio one?

(It should come as no surprise to you that a product manager, this one, is making claims that the new center of the usability universe should no longer co-locate with the UX designer but instead should shift to the product manager.)

And so let’s go back to the 20 basic “Cartesian observations” that we made in Part One and let’s ask ourselves: How can those observations help us articulate a method of coming up with some basic heuristics that can help us think of use cases where voice and audio are essential to the solution of the problem that is faced in the use case or the opportunity for value delivery that it presents?

Here are some abstractions that I can think of.

1. Your attention is urgently needed
Sound barges in on you: it doesn’t wait for you to look at something. It impolitely invades your space and demands your attention. Such behavior is almost always boorish, except when it’s not: i.e., when you really want to know what the voice or the sound is trying to tell you. You need to get out of the building now. You need to put out the fire in the kitchen now. (15, 16, 17, and 19 apply here.)

2. Time is of the essence
Here the situation is less dire, but being alerted as soon as possible would be nice. Your toasted bread is ready for you. The longer you wait, the less you will enjoy that toasted bread. Ditto with the cup of tea in the microwave. (6, 7, 8, and 12 apply here.)

3. Your eyes and/or your hands are busy
Here, it may be that time is not of the essence, but rather that your eyes and/or your hands will be busy when something happens, or you want to do something (say, you want to know what the weather is like tonight). You are looking at your smartphone and the elevator has arrived. You are looking at your smartphone as the elevator arrives at your floor. (4 and 5 apply here.)

4. You are likely to space out
You are in your car at a gas station, the gas is pumping, and you are scanning your twitter feed and forget where you are. The sound of the pump’s thud when done snaps you out of your trance. (1, 2, 3, 19, and 20 apply here.)

5. You are likely to neglect something worth your attention
You didn’t close the fridge door. You forgot to take your money or your card at the ATM machine. (9,11, 13, and 14 apply here.)

Can you think of any additional real-life situations (and it is crucial, in compliance with the Cartesian method, that they be real-life situations that you have lived first-hand and not hypothetical scenarios) that can help us come up with new abstractions and heuristics in our quest to pinpoint value-delivering use cases?

Dr. Ahmed Bouzid, is CEO of Witlingo, a McLean, Virginia, based startup that builds products and solutions that enable brands to engage with their clients and prospects using voice, audio, and conversational AI. Prior to Witlingo, Dr. Bouzid was Head of Alexa’s Smart Home Product at Amazon and VP of Product and Innovation at Angel.com. Dr. Bouzid holds 12 patents in the Speech Recognition and Natural Language Processing field and was recognized as a “Speech Luminary” by Speech Technology Magazine and as one of the Top 11 Speech Technologists by Voicebot.ai. He is also an Open Voice Network Ambassador, heading their Social Audio initiative, and author at Opus Research. Some of his articles and media appearances can be found here and here.