Making Voice Great Again: A Ubiquitous User Interface for AI

scotty-talking-to-mouseHeadlines in the trade press read like the output of a BuzzPhrase Generator app. An IPO legitimize TPaaS (Telco Platform as a Service.) Huge firms seek to corner “Big Data in the Cloud.” Start-ups take command of “The Internet of Things.” These highly-ambiguous terms have replaced “Mobile”, “Social” and “Collaboration” in the word clouds of the Digital Age. They are shiny objects with great growth potential and transformative power.

Information industry giants, namely (and alphabetically) Amazon, Alphabet (Google), Apple, Facebook, IBM, Microsoft, Oracle and Salesforce, have entered pitched battle to define and dominate the emerging Digital Commerce marketscape. In their sights are companies and individuals with mad skills and intellectual property governing a set of core technologies that include Automated Speech Processing, Natural Language Understanding, Machine Learning, Semantic Search and Knowledge Management. For better or worse, they collectively refer to these diverse disciplines as “Artificial Intelligence” or “AI.”

First Comes Branding
For branding purposes each leviathan assigns its own, advantageous nomenclature to AI. For IBM, it’s “Cognitive Computing.” Alphabet offers “DeepMind.” Oracle considers it “Adaptive Intelligence.” The others have some variation on the theme of cognition, deep learning, neural networking and understanding. All represent a move away from rules-based, statistical language models towards applying computer power towards recognizing meaning and understanding the intent of individuals engaging in conversations, commerce or other activities on digital networks and in the real world.

The giants of the information industry invested billions of dollars and millions of person-hours to develop their algorithms, write their code and, most importantly, gain access to the raw data and what Google calls their “knowledge graph,” referring to voluminous amounts of structured and unstructured data that provide serve as the grist for accurate answers and suggestions. Yet they, themselves, have learned that their investments have no value unless they support consumption models, or user interfaces that make the product of their knowledge graph accessible to broadest audience possible.

Then Comes Democratization
When introducing Einstein, Salesforce referred to this as “democratizing AI.” Facebook’s bots on Messenger represent its efforts to make AI-powered digital assistants available to everyone with a smartphone. These efforts try to build upon increased confidence that individuals are finding when they put Apple’s Siri. But they are taking on increasing importance as people gained confidence in assigning new tasks to Amazon’s Alexa running on one of the Amazon Echo family of hardware, which introduced the world to highly accurate voice user interface throughout the home and has attracted a formidable set of competitive products from Google carrying the name Assistant and running on a new line of hardware under the brand Google Home.

Meanwhile, Facebook’s measure of success has been the recasting of “Conversational Commerce” as “bots on messaging platforms.” Even though it gives primacy emojis to bots that support text-input through messaging platforms, it has actually served to burnish the reputation of long-time providers of software and services that support a voice user interface (VUI) because it makes self-expression, in all its forms as foundational to self-service.

Text has its place where smartphones and tablets are in the hands of the billions of users that have them on their person, but voice is ubiquitous and is the sole modality to take command of electronic devices in the home, and command centers in cars, kiosks, offices and public places.

As a prime example, spoken input played a prominent role at the “Made by Google” event on Oct 4, when Google CEO Sundar Pichai observed that digital world is making a fundamental shift from “mobile-first to an AI-first”. His subtext, of course, was that voice and natural language input take on increased importance as we, mere humans, take best advantage of all the computing power and knowledge that resides in the public cloud.

By introducing Google Assistant along with a Google Home, a set of voice-activated speakers and other appliances, Alphabet takes direct aim at Amazon’s Echo, a set of devices whose installed base probably reached  million homes in its first eight months on the market and is ramping up geometrically as it adds “skills” and expands its geographic coverage. “Made by Google” demonstrated how well-positioned the huge company is to expand beyond search and video display (YouTube is one of its big properties and Chromecasting is a major force toward integrating TV sets into Cloud-based distribution).

To prepare for the new age, Google has done its share of acquisitions and acquihires. Having used Goog411 as a mechanism for collecting a corpus of proper names, addresses and other hard-to-understand utterances and Voice Search invoked with “Ok Google” to bring a constant stream of spoken queries into its database, it is confident that it can support highly-accurate recognition and responses to everyday commands and queries. That corpus of spoken input – along with the rest of the “world’s information” that Google organizes, have become the basis of the answers, suggestions and other responses that are evolving into an “individual Google for each user.”
 
No Leviathan Left Behind
Apple and IBM have their own “Democratize AI” initiatives. The latest version of iOS 10 and MacOS 10 (Sierra) have made the venerable Siri more capable, understanding and supportive of more tasks. Under an initiative called IBM Watson Virtual Assistant, Big Blue has developed a freemium model that enables enterprise executives to build their own virtual assistants “out-of-the-box”. Meanwhile, Microsoft is making sure that Cortana ships with every Windows-based device and a growing number of Android and iOS devices as well.

But the latest news is from Samsung, the one company whose dominant position among smartphone makers is under siege on two fronts. On the outside Google’s Pixel phone (which is manufactured to Google’s specs by HTC) will bring “Assistant” into every pocket in the United States. Internally it is fighting a series of fires (literally) surrounding shortcomings of its wildly popular Galaxy Note 7.

With the heat turned up in its core markets, it is acquiring Viv, the Intelligent Assistance start-up run by two of Siri’s original founders. No personal communications product, home entertainment unit or kitchen appliance will be complete without the ability to hear, understand and react to what is said to it. S-Voice has been featured on Samsung smartphones for almost a decade but it was largely for voice search, dictation and command and control of features and functions. Viv will add the ability to understand the intent of the speaker, as well as the results of a couple of years of building relationships with businesses large and small who must comprise an “ecosystem” to rival Amazon Alexa’s roster of thousands of skills, which is growing daily.

At the time of the acquisition, Dag Kittlaus, who played founding roles at both Siri and Viv, explained “our objective is ubiquity”. That is true for all the initiatives in Intelligent Assistance. Voice and Natural Language input will be core to promoting ubiquity.



Categories: Conversational Intelligence, Intelligent Assistants

Tags: , ,

2 replies

  1. Nice high-level view survey of the stakes and the players. The field is just emerging – it still is at its beginnings and quite immature. Seen from a practitioner’s point of view: today, most exchanges with virtual agents are still confined to ONE user utterance (a question, a command, maybe an expletive – in most implementations you can’t yet tell facts about yourself and have the VA remember them). Siri now asks Mobile or Home when you say ‘Call XYZ’, but that’s about the extent of its multi-turn dialog support (ie, multi step dialog in a given context). A particular virtual assistant can do a form of multi-turn, but when you look closer you see it’s basically trying to fill a form. If you don’t answer to the point but instead you try to clarify what was asked of you (very humanlike:) your mileage will definitely vary. I haven’t yet seen virtual assistants where you can do complex tasks (except ours, at http://www.northsideinc.com 🙂 or who can navigate in 3D and (virtually) manipulate 3D objects, like robots in our Bot Colony game. Since almost all implementations are Machine Learning based, there’s a latent degree of imprecision built in, which makes it difficult to address applications requiring precise understanding, such as financial services. The ability to converse about everyday life and have a virtual assistant understand WHY you said something, or ask a relevant clarification question is non-existent. We hope to have a demo of conversation about everyday life next month as part of a free game we will announce. It’s exciting to be in Natural Language Understanding now. It’s a mountain we’ve been climbing this mountain for the last 15 years (that’s why our North Side logo is a stylized Everest North Side :). Some days the peak looks closer than on others – but there’s still a long hike ahead.

  2. Hi Eugene: Thanks for providing your perspective. I have trouble thinking of Intelligent Assistance and Conversational Commerce as “at its beginnings”. I agree that it is “emerging” and “immature”, but believe that the cycle at which new technologies are introduced (faster processing, cheap memory, faster I/O to and from The Cloud), is accelerating the time it takes for a new implementation to be mounted and start learning and refining its ability to recognize and respond to intents. As for conversations and 3-D interactions, I’m looking forward to seeing what you are doing.
    -Dan