Promising a report in about a year, the W3C has launched a new “Speech Incubator Group”, chaired by Voxeo’s Dan Burnett (who is also a co-chair of the VoiceXML 3.0 standards committee). “Initiating members” come from Microsoft, Google, AT&T, Mozilla and OpenStream. The group will converse primarily through email and is chartered to define a set of requirements that will make it easier to integrate automated speech processing (both recognition and synthesis) into HTML5-based Web pages. In essence, it is to make voice-based interaction with Web pages more seamless.
For those of us that thought VoiceXML, whose latest working draft of version (3.0) was officially released by the W3C on August 31, was designed to support voice-based interaction through the Web, the Speech Incubator Group introduces a new wrinkle. Apparently there are enough anomalous issues between the HTML5-based browsers and the slew of standards related to speech processing and an easy-to-use “voice user interface”. That’s why one of the group’s charters is the “Creation of change requests to SRGS, SSML, SISR, VoiceXML 3, or other languages and specifications where appropriate for the purposes of consistency.”
For those who need a glossary of abbreviations: SRGS is the Speech Recognition Grammar Specification; SSML is the Speech Synthesis Markup Language; and SISR refers to “Semantic Interpretation for Speech Recognition”. The fact that so many standards exist illustrates how RC (Recombinant Communications) can be complicated at times. In this article, Joab Jackson appears to quote Dan Burnett explaining that VoiceXML, by itself, would not be suitable for multi-modal Web-based interactions because it is too voice-driven. By contrastm, “the voice capabilities of HTML would be stateless, or not require a dedicated session with the user.”
I’m having a deja vu here as I think back to the implementations of XHTML+Voice (so-called X+V) from a few years back. Seems like we continue to use the standards making process to try to smooth out rough spots between the existing base of (standards-based) applications and scripts and the latest shiny objects that are out there. If it makes it easier for developers to continue to build cool, multimodal mashups, I’m all for it.
Categories: Articles