The Five Key Constituencies of Voice & Audio

Art by Juj Winn

Audio and voice are on a healthy and steady ascendent and that, upon first consideration, I find is an interesting phenomenon. I find it interesting because it seems to me to run counter the ethos of more-and-more. More modalities, more channels, more features, more data, and more choices. And yet, here’s a medium that ostensibly offers less: no colors, no shapes, no faces, no buttons to push, nothing to swipe, nothing to tap, no windows to open or close — just sound heard and sound spoken. So, who on earth is going to be attracted to something that gives us not more, but less than what we are usually accustomed to getting?

Here are the five key constituencies who are, or at least should be, cheering for voice and audio.

The Creator
Unless you are a graduate student cooped up all day in the library, reading reams and reams of text and churning out paragraph after paragraph of language, chances are that you are going to consume and create at least an order of magnitude more information by listening and speaking than by reading and writing. Indeed, we start speaking a lot earlier than we start writing, let alone become competent enough writers to effortlessly express ourselves clearly and succinctly through text. So, speaking and listening to us humans are more than expressive modalities: they are a fundamental manifestation of our humanness. But beyond this, recording high quality audio or capturing a stream of speech to be processed by some software are just a lot less expensive proposition than capturing and managing high resolution video.

What is also compelling and exciting about audio is how seamlessly unobtrusive and minimally disruptive it is to the flow of a naturally occurring event. Whether you are capturing an interesting conversation or registering a thought or an insight, one click on your smartphone to start recording and that’s it. No tripods to set up, no lighting to adjust, no, “Hold off, let me get this hair under control,” before the recording starts.

The Advertiser
For the advertiser, audio, whether linear or on demand, whether listened to via a smartphone, a smart speaker, or earbuds such as AirPods, is a medium that almost guarantees that the consumer of an ad will listen to the whole ad. You can avert your eyes from an a magazine ad, mute, turn away from a TV commercial, or change the channel, skip a google ad or mute it with a click on your tablet, but when you are listening and your eyes or your hands (or both) are busy, you have no easy recourse to skipping an ad (if you can skip it, that is). Endure you must, and we usually do. And that, whether we like it or not, is music to the ears of ad men and ad women out there.

The Physically Busy
Whether you are driving, combing your hair, sunbathing, eyes closed, while holding a piña colada, or trimming your flowers, shining your shoes, folding your clothes, taking a shower, polishing your nails, washing your dog, voice and audio are your go-to interface for doing things (adding something to your shopping list, firing up some music) and consuming information (asking for the time, getting the answer to a question that just popped into your head while taking in the sun). Try doing any of that with a surface while your eyes are busy (or closed) or your hands are busy (or dirty), and let me know how much you enjoy the experience.

The Physically Challenged
Through my work with the AARP Foundation, I have had the opportunity many times to witness first hand just how liberating smart speakers are for us human beings when we age and gradually (and sometimes very quickly and suddenly) lose our freedom and our ability to exercise our will upon the world around us. Simple things that we never thought twice about being able to do once all of a sudden become impossible tasks, or become very difficult: turning the lights on and off, tuning in to our favorite radio station, setting up a timer, getting the answer to a question (‘When is the next Orioles game?’). Constantly needing to have someone on the ready to help you with doing such “simple” things is not something that we would wish on anyone. And even when we do have someone around us who can help us, unless it is such a person’s job to fulfill these requests, we feel loath to be so imposing — sometimes even if the person that we are asking to help us is being paid to be on our beck and call.

In other words, voice and audio give us the ability to retain our freedom and our dignity as we grow older and less capable. Yes, it is most certainly not healthy to completely rely on yourself: seeking help from other humans and letting others help you is part of living a healthy social life. But becoming completely, or nearly completely, reliant on other people to do simple things all the time is tragic.

The Young
Perhaps the one constituency of voice and audio — and specifically, smart speakers and far field speech voice assistants — that may turn out to be impacted in the most fundamental way by the technology, and, given that they are at the start of their life, with the most lasting consequence, are members of the Alpha Generation. The impact of audio has already started with Gen Z, who, according to studies, “don’t just listen to music, they are living it to amplify or escape their surroundings.” But for the Alpha Generation, the children who are growing up with the Amazon Echo and Google Assistant as a fact of life (they have not known a world without smart speakers), the always available smart speaker has enabled them to do things on their own before they could read or write or type. Yes, with the smart phone they could tap and swipe when they were two, but they couldn’t search and explore on their own — say find games, or answer questions such as “What is the biggest dinosaur?” or “Who is the longest person alive?” These smart speakers can even answer the notorious “Why” questions that toddlers are master experts in asking, such as, “Why do we need to drink water?” and “Why do dogs bark?” and “Why do bees like flowers?”. In other words, the smart speaker for this generation stands for independence, for the ability to explore the world on their own, without having to rely on, or wait for, adults — those large people who know everything and can do anything — or (God forbid) older siblings, to be around. And this — a very young human being not suffering the yoke of anyone — is, as far as I am concerned, a new and wonderful way to kick off life.


Dr. Ahmed Bouzid, is CEO of Witlingo, a McLean, Virginia, based startup that builds products and solutions that enable brands to engage with their clients and prospects using voice, audio, and conversational AI. Prior to Witlingo, Dr. Bouzid was Head of Alexa’s Smart Home Product at Amazon and VP of Product and Innovation at Angel.com. Dr. Bouzid holds 12 patents in the Speech Recognition and Natural Language Processing field and was recognized as a “Speech Luminary” by Speech Technology Magazine and as one of the Top 11 Speech Technologists by Voicebot.ai. He is also an Open Voice Network Ambassador, heading their Social Audio initiative, and author at Opus Research. Some of his articles and media appearances can be found here and here. His new book, The Elements of Voice First Style, co-authored with Dr. Weiye Ma, is slated to be released by O’Reilly Media in early 2022.



Categories: Conversational Intelligence, Intelligent Assistants, Articles