Position Paper for W3C

Abstract

Mobile access to the internet would benefit tremendously from a combined speech and graphical user interface. Speech as an input modality is natural and can be very efficient - for example, on a device with a small display and keypad, speech can bypass multiple layers of menus. On the other hand, a display is often a better way of presenting information to the user - output which is long-winded when rendered as speech can be quickly browsed on a display.

This note presents the work on multimodal dialog that is one of the activities of the W3C Voice Browsing Working Group. It also introduces some work at HP Labs on multimodal systems.

What the workshop needs to address

Contributions to discussions at this workshop

Multimodal markup language - status

Multimodal browsers allow users to interact via a combination of modalities, for instance, speech recognition and synthesis, displays, keypads and pointing devices. The W3C Voice Browser working group has been developing requirements and specifications for a Speech Interface Framework. One part of the activity has been on Multimodal Dialog. The work is concerned with a markup language that allows an author to write an application that uses spoken dialog interaction together with other modalities (e.g. a visual interface). The focus is on multimodal dialog where there is a small screen and keypad or a small screen, keypad and pointing device. Within the multimodal activity, we have not specifically addressed universal access, i.e. the issue of rendering the same pages of markup to devices with different capabilities.

At our May face to face meeting, six companies demonstrated multimodal systems. Since then we have published a prioritized list of requirements for multimodal dialog interaction. We are now working on a language specification.

Synchronization of visual and voice markup: Maverick system

Summary

How can we make it easy to create multimodal services? At HP Labs, we have defined extensions to an experimental voice markup language that allow us to build multimodal systems. We have a demonstrator - Maverick - which synchronizes the extended voice browser with an HTML browser. Maverick allows speech input/output to be alternatives to other modalities. It also allows speech input/output to be combined with other modalities. We have submitted our language extensions to the W3C Voice Browsing working group to contribute to the work on multimodal dialog. The approach we used is described below.

Approach used with Maverick multimodal system

In essence DialogML (the dialog markup language being defined by the W3C Voice Browsing WG) is a scripting language for filling out forms using speech input and output. Context free grammars are used to interpret what the user says in response to spoken prompts. Grammar rules have the side-effect of filling out particular form fields. DialogML makes it easy to specify sub-dialogs for asking the user to fill out any form fields that haven't already been filled and for dealing with various kinds of errors. DialogML allows users to answer questions by pressing buttons on a regular phone keypad which are transmitted as DTMF tones. These are interpreted in exactly the same way as spoken responses. A DTMF grammar is used to bind the keys to values which fill out form fields.

The extension of DialogML to support visual prompts and mouse clicks follows a similar line to spoken responses and DTMF input. Just as you specify a speech grammar for spoken input, you can specify a visual Web page for users to click on or type into. The end result is just the same - the user's response is translated into attribute value pairs that fill out form fields in the DialogML page. Likewise, spoken inputs that fill fields in the DialogML form are used to update the visual Web page.

The DialogML browser needs a URL for the visual Web page and a Web browser to show it on. This could be on the client device such as a 3rd generation cell phone, or on a different machine, for instance, a regular telephone and a desktop browser connected to a LAN. A simple approach to demonstrate multimodality is to use HTML and scripting, so that the user's interaction with the HTML page can be intercepted by the script and forwarded to the DialogML interpreter as commands to fill out fields in the DialogML form - one way to do this is via a Java applet called by the HTML script code.

This approach makes it easy to combine different modalities. It doesn't matter whether you speak or click on an image map or type in a visual form field, all of these have the same effect - that of filling out fields in the DialogML form. There are some practical issues for the DialogML browser such as how long to wait for input (regardless of modality). It is no longer always sufficient to wait for the first input before moving on to the next step in the dialog. A multimodal integration module combines the information (semantics) provided by the individual modalities. The integration follows temporal constraints, i.e. events from different modalities have to be within a certain time window in order to be combined into one semantic entity. Other extensions are needed in order to have a higher degree of control over the synchronization of speech output with other outputs such as visual or audio - SMIL elements can be used to provide this

Position Paper for W3C/WAP Workshop on the Multimodal Web