The Web has become successful due to standards, the simple point-and-click interface, the visual nature of the content, and the ability to search for content. Due to this success, we have begun to see the emergence of devices designed to access this content. Many of these new devices include reduced screens and limited support for the point-and-click paradigm. WML was created to specifically address devices with limited bandwidth and small screens. VoiceXML was created for devices with no screen at all. What is the role of voice for these devices and their impact on the corresponding markup languages? How can we best leverage existing content? How can we best share approaches across the various markup languages? We consider these questions and others as we explore how we might voice enable the Internet.
Perhaps surprisingly, we can use voice to interact with existing HTML content extremely well. Anchors provide a natural means of navigation if we create grammars on-the-fly as the browser renders each page. Conversa Web is an excellent example of such a voice browser. Additionally, Conversa Web includes voice commands for page navigation, favorites, history, and other browser control functions. To interact with images, we automatically insert numbers for voice reference. We use the same technique for ambiguity (two different anchors with the same anchor text lead to different places) and form elements. Voice interaction with existing HTML content is not only a pleasant experience, but it makes Web pages more accessible to a wider audience and serves a variety of vertical markets. We have recently completed a similarly voice-enabled browser for WML using the Phone.com simulator.
Whenever possible, we have tried to
By providing voice-enabled players for "legacy" content, we start with a base of billions or millions of pages (HTML and WML, respectively) rather than zero. By providing a minimally complete set of enhancements, we make the enhancements easier to learn and apply. By interacting with existing mechanisms, we amplify the effect of the voice additions and integrate well with existing tools. To leverage the work of others, we base our existing TTS and grammar syntaxes on JSML and JSGF, respectively. To support multiple frameworks, we have successfully used the same types of TTS and grammar markups within both HTML and WML.
The Web is generally thought to be "stateless", but it does offer mechanisms that can be adopted for dialog and state information. It's important to realize that
Can we do better? For example, can we effectively express dialogs in a declarative manner? For all but the simplest cases it seems that we need multiple pages or some sort of procedural scripting language involved in the process.
As we move forward, we hope that members of the various standards bodies such as the W3C, WAP Forum, and VoiceXML Forum consider the Design Considerations outlined above. The world of the Web seems to be gravitating toward browsers on devices with a full screen, reduced screen, or no screen. While we expect entirely different interaction appropriate for the various form factors, we hope that voice-enablement mechanisms might be shared across such devices whenever possible.