The Five Voice-First Fallacies

By Ahmed Bouzid on July 8, 2021

The Fallacy of the Fish and the Bird
In the beginning, God created the Fish and the demigods marveled at this creature that God had created. “Look at how it glides through the oceans,” they exalted, “and how it weaves through the corals.” Then God created the Bird and the demigods were not impressed. And they said, “This creature of Yours, O Creator of all, does not swim! Nor does it glide through the oceans nor does it weave through the corals.” Then they said, “And behold how many fish You have created and how few birds You have created. We expected more, O Creator of all, far more than this thing which does not swim but merely flies.”

The iPhone is the Fish and Alexa is the Bird.

The point: It is nonsensical to measure the success of the latter with the metrics of the former. Many say: Look at how many mobile apps have been published on the App Store, and then they compare that number to how many Amazon Alexa skills and Google Assistant actions have been published on the Alexa skills store or the Assistant actions directory. If we were to think like that, then the mobile revolution has been a dismal failure – compare how many websites exist out there to how many apps have been deployed. Not that the statement is completely without merit (there is a bit of truth in there), but still, we would not say that mobile has been a failure or a disappointment. We understand that websites and mobile apps are different animals that serve different purposes. In the case of voice, perhaps the number of skills and actions are not the way to measure the success of the medium. Perhaps it is something else, such as: How many times do we use the interface in our daily life without interrupting our life flows? In the case of mobile, it’s nearly zero. In the case of voice, when done well, it is not zero (it’s the whole point of the interface). Now, divide X by zero and you get infinity, so that one could technically be on the dot saying that, at least on that particular measure (and why not adopt that measure?), voice is infinitely better than mobile!

The moral: All of this comparing and contrasting is not very useful.

The Fallacy of More is More
I’ve heard it many times said, as running subtext as well as spoken thought, that voice-only interfaces are bereft and barren, poor and painful, to be used only when one must do so for lack of better, richer means. Many do truly believe that there is a hierarchy of ideal interfaces in a realm worthy of Plato himself, where the highest of the order are called multi-modal, lower are the visual and the tactile, and lowest of them all, the voice only — the “blind interface.” But such believers fail to see that they are committing an age-old fallacy: The fallacy that holds that the more modalities an interface offers, the better it is. But in reality, this is not the case. Whether or not an interface is powerful depends on the context of its use and what that context requires. Where I need to see something and at the same time say something, the multi-modal interface is best; where I do not wish to (or cannot) see anything and I do not wish to (or cannot) touch anything, then the voice-only interface is best.

The Linguistic Fallacy
This fallacy is committed by those who speak of “Voice First” with the fervor and the zeal that only the new adherent dares speak with. To them, “Voice First” means that the voice interface is the culmination of all interfaces devised, the one at the sharp end of the evolutionary spear, the interface of the bold, the daring, the ones who embrace our future without fear, casting aside as so much unwanted baggage the visual and the tactile interfaces. But the fault here is a simple one: It is one of simply not understanding what “Voice First” is meant to mean. By “Voice First” one is not to mean that Voice is “first,” the winner of some contest. It means merely that Voice is the primary interface for a given form factor, in appropriate scenarios (lying down on your bed, eyes closed, arms folded, resting), for instance the smart speaker or the earbud. It is the primary modality that the user will try to accomplish their task first before trying another interface.

The Foreign Speaker Fallacy
This fallacy is committed by those who react to a robot not understanding their speech by speaking louder or slower than how a native of the language would speak normally to another native. In such instances, the robot is treated like a foreign speaker of sorts, an entity that must be addressed through near shouts, with the words spoken in distinct, separate, small bites. But the fallacy here that is committed relates to the speaker not understanding how robots are trained. The robot, which has no brain or cognitive capabilities, “understands” by doing nothing more than recognizing similarities with what it had heard before (i.e. by matching the sounds it hears with the sounds with which it had been made to hear/trained). And such robots are never trained with loud, staccato speech, but with speech that is average in volume and average in flow and speed.

The Anthropomorphism Fallacy
This fallacy is committed by those who believe that the human-to-human model of interaction is the model to emulate when building a conversational system for human to voicebot interaction. Such believers operate, and not unreasonably, with the proposition that humans not only already know how to operate with other human beings, but that they do so by dint of habit, and some claim genetic inheritance, without needing to think, pulling up and deploying patterns that require little effort to follow. They greet humans when they open their conversations, they try not to interrupt them, they are polite, they end the conversation with a series of quick back-and-forth utterances, and so on. And so, when a designer sits down to begin designing a voice interface, they imagine in their mind a human speaking to a human. “What would a human say,” they ask, or “How would a human react to this?” But the robot is not a human, it does not feel, it has no face to save, does not hold grudges, cannot (and should not be able to) hurt you, and so on. The designer should keep all of that in mind and always be driven by: How do I enable the user to use this robot, through language, and do so in such a way that it makes the interaction most enjoyable, most effective, and most efficient, for the human user. Lean on human-to-human patterns where it makes sense, break away from them and create new ones when it doesn’t.

Dr. Ahmed Bouzid, is CEO of Witlingo, a McLean, Virginia, based startup that builds products and solutions that enable brands to engage with their clients and prospects using voice, audio, and conversational AI. Prior to Witlingo, Dr. Bouzid was Head of Alexa’s Smart Home Product at Amazon and VP of Product and Innovation at Angel.com. Dr. Bouzid holds 12 patents in the Speech Recognition and Natural Language Processing field and was recognized as a “Speech Luminary” by Speech Technology Magazine and as one of the Top 11 Speech Technologists by Voicebot.ai. He is also an Open Voice Network Ambassador, heading their Social Audio initiative, and author at Voicebot.ai and Opus Research. Some of his articles and media appearances can be found here and here.