Nuance Position Paper

1. Introduction

This document provides Nuance's perspective on the importance of incorporating speech technologies with the emerging Multimodal Web and some of the issues that need to be addressed. We view this workshop as an exploratory investigation for the purposes of understanding different perspectives and sharing ideas to gain a collective understanding of what is required to create the infrastructure and applications for a Multimodal Web. The W3C standards and WAP Forum standards are being embraced by the industry, however this has also been accompanied by some confusion, especially on how to support both these standards in building applications for newer mobile devices capable of voice and visual input/output. Also note that statements, suggestions or omissions of any topics does not imply Nuance's strategy or lack thereof.

2. Nuance's Position

Nuance has a vast amount of experience in speech technologies, especially in the development of speech recognition and voice authentication software. Nuance has focused on creating the scalable infrastructure components and tools, and also has a vast amount of real-world application development and deployment expertise to support large numbers of users in many languages. Nuance's products embrace the use of industry open standards wherever possible; and we continue to work with standards bodies to define additional standards and/or extensions that enable the development of richer multimodal interactions over speech enabled devices.

Nuance's voice browser product, Voyager, can be used from any telephone to access content and people across the Voice Web. Recently, the number of phones capable of visual output (e.g. WAP-enabled phones) has been increasing dramatically. Furthermore, other mobile devices (e.g. PDAs) are also evolving to allow the use of audio input/output. This convergence of technologies highlights the need for the creation of infrastructure (standards, servers, tools, utilities etc.) to enable the creation of even richer multimodal applications. Currently, the Nuance Voyager product already facilitates this convergence in a number of ways:

User based personalization component allows users to define their preferences and profiles including such information as the their preferred device.
Voice authentication and identification facilities can be leveraged to authenticate users, protect users privacy and enable secure voice-based transactions (V-Commerce).
Flexible enough to accept different input mechanisms as well as to generate multiple output formats.

Voyager is a key component of the Voice Web, an interconnected network that allows secure access to any information through the use of speech via any a voice browser. The Multimodal Web could leverage of some of the functionality of the Voice Web, for example the use of voice links (similar to hyperlinks) and cookies etc.

3. What needs to be addressed in this workshop.

VoiceXML is emerging as the de-facto standard for dialog markup. WML and WMLScript has already been embraced by a large community of developers as the de-facto standard for building applications to be deployed on wireless networks.

There are numerous issues that can significantly help to clarify some of the confusion that presently surrounds the WAP and W3C standards. For example:

1. What is the nature of the standards collaboration between the W3C and WAP Forum standards bodies?

2. How does the W3C view the WAP standards now and in the future, when bandwidth over wireless networks increases to levels amenable to transporting data in HTML format? Are there changes or extensions to WAP and HTML that will bridge any interoperability issues?

3. Who drives the multimodal application? e.g. Is it the WAP microbrowser on a mobile phone or a voice browser?

4. How are multimodal dialogs synchronized? For example, one could envisage VoiceXML containing directives to output data in WML, or a WML application invoking a VoiceXML dialog to gather one or more pieces of information?

5. For client devices (mobile phones) to enable simultaneous input through speech (VoiceXML) and visual output (WAP), what additional capabilities are required by the handset? Are these capabilities exposed through standards based APIs? If so, who defines these standards?

4. Possible outcomes

We expect that this workshop may likely result in the formation of a working group to focus on one or more of the following:

1. The definition and promotion of new standards and/or extensions to facilitate the creation of rich multimodal dialogs for multiple client devices including mobile devices, PCs, TVs, network appliances etc.

2. Establish a set of requirements that addresses the issues of creating multimodal dialogs using existing standards:

speech (VoiceXML)
wireless data standards for operating in a multimodal web using mobile devices (WAP)

3. Identify and establish working relationships with other dependent subgroups or external standards bodies.

4. Together with other standards bodies, to establish the overall architecture and set the future direction for the emerging Multimodal Web.

Position Paper on W3C/WAP Workshop on the Multimodal Web

1. Introduction

2. Nuance's Position

3. What needs to be addressed in this workshop.

4. Possible outcomes