Voice Enabling the Internet

Authors:: Charles Hemphill, Conversa.
hemphill@conversa.com; Michael Robin, Conversa.
mikero@conversa.com

Introduction

The Web has become successful due to standards, the simple point-and-click interface, the visual nature of the content, and the ability to search for content. Due to this success, we have begun to see the emergence of devices designed to access this content. Many of these new devices include reduced screens and limited support for the point-and-click paradigm. WML was created to specifically address devices with limited bandwidth and small screens. VoiceXML was created for devices with no screen at all. What is the role of voice for these devices and their impact on the corresponding markup languages? How can we best leverage existing content? How can we best share approaches across the various markup languages? We consider these questions and others as we explore how we might voice enable the Internet.

Voice Interaction with Existing Content

Perhaps surprisingly, we can use voice to interact with existing HTML content extremely well. Anchors provide a natural means of navigation if we create grammars on-the-fly as the browser renders each page. Conversa Web is an excellent example of such a voice browser. Additionally, Conversa Web includes voice commands for page navigation, favorites, history, and other browser control functions. To interact with images, we automatically insert numbers for voice reference. We use the same technique for ambiguity (two different anchors with the same anchor text lead to different places) and form elements. Voice interaction with existing HTML content is not only a pleasant experience, but it makes Web pages more accessible to a wider audience and serves a variety of vertical markets. We have recently completed a similarly voice-enabled browser for WML using the Phone.com simulator.

Voice Enhancing Content

While we can effectively interact by voice with content fundamentally designed for the mouse and keyboard, we can improve the interaction by adding voice-specific operations. The major operations include synthetic and natural voice output and the addition of grammar based input. Conversations.com includes many examples of voice-enhanced content that can be used with Conversa Web. Including tags for a synthesized voice offers a low-bandwidth means of interacting with users. Tags for recorded audio provides for a more natural voice. For grammars, we support transmission of recognition results both through a URL ("voice" forms) and within the page using JavaScript. A JavaScript-based speech object supports dynamic integration of TTS, recorded audio, and grammars within the page. We have additionally constructed simulators for WML where we use the same approach to voice-enhance WML content.

Design Considerations

Whenever possible, we have tried to

voice enable content without change,
add minimal enhancements to provide full voice interaction,
add the enhancements so that they interact with existing mechanisms,
use existing standards,
use the same enhancements across markup languages.

By providing voice-enabled players for "legacy" content, we start with a base of billions or millions of pages (HTML and WML, respectively) rather than zero. By providing a minimally complete set of enhancements, we make the enhancements easier to learn and apply. By interacting with existing mechanisms, we amplify the effect of the voice additions and integrate well with existing tools. To leverage the work of others, we base our existing TTS and grammar syntaxes on JSML and JSGF, respectively. To support multiple frameworks, we have successfully used the same types of TTS and grammar markups within both HTML and WML.

What about Dialog and State?

The Web is generally thought to be "stateless", but it does offer mechanisms that can be adopted for dialog and state information. It's important to realize that

web pages create a graph through anchor specifications,
most information resides on servers,
session tokens can track state on servers,
client-side scripting languages can track local state,
client-side scripting languages interact with other multimedia aspects.

We can create dialogs by using a combination of anchors and grammars for input and TTS and audio for output. The voice browser can create a result object for utterances based on grammars and send them either to a Web page (via standard get and post mechanisms) or invoke a JavaScript function with the result object. In the case of a Web page, the resulting markup can continue the dialog along with an optional session token. In the case of JavaScript, a function can process the result object and subsequently affect the dialog. JavaScript has the further advantage of integrating input from other multimedia modalities and also affecting multimodal output such as the visual display. In the case of WML, WMLScript may play a similar role.

Can we do better? For example, can we effectively express dialogs in a declarative manner? For all but the simplest cases it seems that we need multiple pages or some sort of procedural scripting language involved in the process.

Recommendations

As we move forward, we hope that members of the various standards bodies such as the W3C, WAP Forum, and VoiceXML Forum consider the Design Considerations outlined above. The world of the Web seems to be gravitating toward browsers on devices with a full screen, reduced screen, or no screen. While we expect entirely different interaction appropriate for the various form factors, we hope that voice-enablement mechanisms might be shared across such devices whenever possible.