Multimodal Application Developer Feedback

W3C Working Group Note 14 April 2006

This version:: http://www.w3.org/TR/2006/NOTE-mmi-dev-feedback-20060414/
Latest version:: http://www.w3.org/TR/mmi-dev-feedback/
Previous version:: This is the first publication.
Editors:: Andrew Wahbe, VoiceGenie Technologies; Gerald McCobb, IBM; Klaus Reifenrath, Nuance; Raj Tumuluri, Openstream; Sunil Kumar, V-Enable

Abstract

Several years of multimodal application development in various business areas and on various device platforms has provided developers enough experience to provide detailed feedback about what they like, dislike, and want to see improve and continue. This experience is provided here as an input to the specifications under development in the W3C Multimodal Interaction and Voice Browser Activities.

Status of this Document

This section describes the status of this document at the time of its publication. Other documents may supersede this document. A list of current W3C publications and the latest revision of this technical report can be found in the W3C technical reports index at http://www.w3.org/TR/.

This document is a W3C Working Group Note. It represents the views of the W3C Multimidal Interaction Working Group at the time of publication. The document may be updated as new technologies emerge or mature. Publication as a Working Group Note does not imply endorsement by the W3C Membership. This is a draft document and may be updated, replaced or obsoleted by other documents at any time. It is inappropriate to cite this document as other than work in progress.

Comments on this document can be sent to www-multimodal@w3.org, the public forum for discussion of the W3C's work on Multimodal Interaction. To subscribe, send an email to www-multimodal-request@w3.org with the word subscribe in the subject line (include the word unsubscribe if you want to unsubscribe). The archive for the list is accessible online.

This document was produced by a group operating under the 5 February 2004 W3C Patent Policy. This document is informative only. W3C maintains a public list of any patent disclosures made in connection with the deliverables of the group; that page also includes instructions for disclosing a patent. An individual who has actual knowledge of a patent which the individual believes contains Essential Claim(s) must disclose the information in accordance with section 6 of the W3C Patent Policy.

1 Introduction

IBM, VoiceGenie Technologies, Nuance, V-Enable, and OpenStream customers have been developing multimodal applications in a broad range of business areas, including Field-Force Productivity, Health Care and Life Sciences, Warehouse and Distribution, Industrial Plant Floor, Financial and Information Services, Directory Assistance, and the Mobile Web. Customer device platforms have included PC's (desktops, laptops, and tablets), PDA's, kiosks, appliances, equipment consoles, and web browser-based smart phones. The multimodal applications primarily extended the traditional GUI mode of interaction with speech, with the location of the speech services either local on the device or distributed on a remote server. Several XML markup languages were used to develop these applications, including XHTML+Voice (X+V) and xHMI.

During the process of developing these applications, developers found features they liked about the development environment they were using and found features they thought were lacking. Their experiences were collected and are summarized here as feedback for the W3C Multimodal Interaction and Voice Browser Working Groups to consider when specifying future multimodal and voice authoring capabilities. We also solicit comments from the wider multimodal development community on the extent to which these observations are consistent with their own development experiences.

The developers surveyed were expert in various programming languages and application environments. Developers expert in C/C++ and Java generally speech enabled native applications on small devices. Device platforms included Windows Mobile, BREW, embedded Linux, Symbian, and J2ME. Developers expert in the Web generally speech enabled browser based applications. Web browser platforms included Opera, Access' NetFront, Windows Mobile Internet Explorer, and the Nokia Series 60. Web developers understood the web programming model very well but generally were new to speech. They liked XHTML, XML namespaces, XML Events, CSS, JavaScript, and VoiceXML with its ability to hide platform details. Developers expert in VoiceXML and dictation had backgrounds in speech and telephony and generally worked on adding GUI to voice and dictation applications.

2 What developers liked

2.1 Reusable and pluggable modality components

Developers preferred to develop modality components that are reusable and pluggable.

Use Case: VoiceXML modality component

A VoiceXML modality component is reused without modification in different multimodal applications.

2.2 Modular modality components

Modular modality components are preferred because they can be authored separately by the modality experts.

Use Case: XHTML and VoiceXML modality components

A VoiceXML expert authors the voice modality component and an XHTML expert authors the GUI component. Modality component coordination is handled independently, for example, by X+V <sync> and <cancel> elements.

2.3 Declarative synchronization between modalities

Use Case: X+V <sync> element

The X+V <sync> element provides a declarative synchronization of XHTML form control elements and the VoiceXML <field> element. The <sync> element allows input from one speech or visual modality to set the field in the other modality. Also, setting the focus of an <input> element that is synchronized with a VoiceXML field updates the FIA to visit that VoiceXML field.

2.4 Scripting and semantic interpretation

Developers liked support for modality component integration via scripting and semantic interpretation.

Use Case: Timed notifications of an operating room medical procedure

A timed notification changes dynamically as time progresses. The notification depends on the current state of the application as well as the notification state. For a GUI+speech multimodal application a notification may be a TTS output and a new GUI page, corresponding to the next step of an operating room medical procedure.

Use Case: Integrated pen and speech interaction with a map

The user says "zoom in here" while drawing an area on a map. The application responds by enlarging the detail of the area within the boundary drawn by the user.

2.5 Styling

Developers liked CSS for styling each modality. For example, the CSS3 module for styling speech based on SSML was useful for styling the voice modality.

Use Case: TTS rendering of a news article on the web

The news article is read by the computer in a realistic voice that uses a different sounding voices for headlines, section headings, and text. There are also a pauses between paragraphs and before article headlines.

3 What developers would like to see

3.1 Global grammars

Developers would like support for top-level ("global") grammars that are active across multiple windows (e.g., HTML frames or portlets) of the application.

Use Case: Top-level menus

An application has top level menus "buy", "sell", and "trade". At any time while involved in the "buy" dialog, a user can say "trade" and be switched to the "trade" multimodal dialog.

3.2 Speech grammars for HTML links and controls

Developers would like support for explicitly adding speech grammars to activate HTML links and controls. An automatically created speech grammar may not capture everything the user may say.

Use Case: Hotel booking application: get list of hotels

Before booking a hotel reservation the user looks up a list of available hotels. On the page along with the reservation is a link labeled "Available Hotels." The developer anticipates that besides "available hotels", the user may say "show me the available hotels" or ask "what hotels are available", and adds these two phrases to the grammar for activating the link.

Use Case: Hotel booking application: submit reservation

The reservation form's submit button says "submit reservation", but the developer anticipates that a user might say "submit booking" instead, and adds "submit booking" to the grammar for activating the button.

3.3 Speech prompts for voice-enabled HTML links and controls

Developers would like support for explicitly adding speech prompts to voice-enabled HTML hyperlinks and controls. The prompts can provide more information than the visual labels attached to the HTML hyperlinks and input fields.

Use Case: Hotel booking application: enter Hotel name

The user is prompted to enter a hotel name with the following TTS: "please enter a hotel name. You can get a list of available hotels by saying 'show me available hotels.'"

3.4 Speech-enabled widgets

Developers would like to see speech enabled UI widgets which contain a simple dialog flow (e.g. widgets which contain confirmation or disambiguation steps). This allows an author to configure the dialog properties (prompts, grammars, confirmation-mode, confidence thresholds, etc.) of an HTML control or hyperlink.

Use Case: Hotel booking application: confirm hotel

The user says the name of one of the available hotels. The application repeats the name of the hotel back to the user and asks if it is correct. If the user says 'yes' then the application fills in the HTML field with the user's input.

3.5 Use speech to activate links and change focus

It should be easy to use speech to do more than fill in HTML form controls. For example, there should be declarative support for activating an HTML link or changing focus within an HTML page.

Use Case: Speech enabled bookmark page

A page that displays the user's bookmarks is speech-enabled such that each bookmark has an associated grammar for moving the browser to the bookmarked page.

3.6 Back functionality

Developers like to see support for a consistent and intuitive "back" handling across modalities. The browser "Back" multimodal functionality should be built-in and not require custom code.

Use Case: Browser "back" button

The user can either press the browser back button or say "browser go back" to return to the previous multimodal page. All spoken commands which control the browser are preceded by "browser" so there is no collision with an application grammar.

4 What developers would like to see continue and improve

4.1 Support for both off-line and on-line multimodal interaction

Multimodal interaction should be supported both for applications that are on-line, that is, are connected to the network, as well as for off-line applications. If the multimodal application goes from an on-line to an off-line state, multimodal interaction should still be supported by the modality components that run locally on the device.

Use Case: Access of medical information while walking down a hallway

A doctor carrying a wireless tablet accesses patient medical information while walking down a hallway. Loss of wireless connectivity does not prevent the multimodal application from interacting with the doctor or presenting information it has stored on the doctor's tablet.

Use Case: Multimodal application in hospital operating room

An off-line multimodal application in an operating room delivers timely instructions to the doctor.

4.2 Support for events distributed over the network

Because a modality may be distributed on a remote server, there must be support for distributed events between a modality and the interaction manager.

Use Case: Driving directions

A user accesses a multimodal driving directions application using a cell-phone. The application tells the user to turn right at the next intersection. An arrow pointing right pops up over a map. The application had received an event to display an arrow from the server.

4.3 Support for implicit events

Implicit event support includes both implicit event generation and implicit event handling. At different stages in the operation of the modality component, there will be either event generation or event handling by the component itself. For example, the VoiceXML modality component could implicitly generate a focus event when the FIA selects a new form input item.

Use Case: Hotel booking application: name, address, phone number

A hotel booking application has a form with separate HTML input fields for entering name, street address, city, state and phone number. When the user selects one of the fields the user hears a prompt for entering the correction information into the field. The visual input focus is coordinated with the speech input focus.

4.4 VoiceXML tag and feature support

VoiceXML support should include, for example, the <object> and <mark> tags and the "record while recognition is in progress" feature.

Use case: Windows program for calculating stock purchase totals

The <object> element can be used to load a reusable platform-specific plug-in. For example, the application would load a Windows program which calculates stock purchase totals using the <object> element.

Use case: Read part of an e-mail message

The <mark> tag can be used to mark how much of the text was actually read before the user left the page. When the user returns to the page the rest of the text can be read beginning where the user left off.

Use case: Unrecognized user input

The recording of an unrecognized user input can be logged by the speech recognizer.

4.5 Support for both directed and user-initiated dialogs

There must be arbitrary as well as procedural speech access to the visual application. For a dialog mechanism used in conjunction with a visual form there should be support for user-initiated dialogs. For example, the user should be able to jump to arbitrary points in the dialog by changing the visual focus (e.g., by clicking on a text box).

Use Case: Form filling for air travel reservation

The air travel reservation application takes the user step by step through making a reservation, beginning with the origin and destination of the flight. After the user has been given a selection of flights, the user clicks on the visual departure date field to change the departure date.

Use Case: Application with two HTML forms

The user is taken step-by-step through filling out a set of HTML fields in a form. Before all the fields have been filled, the user clicks on a field belonging to the other form.

4.6 Mixed-initiative interaction

Dialog mechanisms that combine speech and text input must support mixed-initiative interaction.

Use Case: Flight reservation application

A flight reservation application has separate HTML input fields for entering destination airport, date of travel and seating class. With a single utterance "I'd like to go to San Francisco on April 20th, business class" the user fills in all the fields at one time.

4.7 Access to speech confidence scores and n-best list by the application

Confidence scores and n-best lists are useful for example to allow the user to pick from a set of results supplied by an input recognizer.

Use Case: Select a football player

A user says the name of a favorite football player. A number of players matched the user's input with the same low confidence score. Instead of asking the user to repeat the name, the application displays a visual list of player names that was matched. The user selects a name from the list.

4.8 Access to device details

The developer would like access to device information such as, for example, the cell phone number, phone model, and display screen size. Typically in any mobile application the content is very specific to the device and at times personalized for the user. Access to device specific details such device model (e.g., Nokia 6680) helps the application reduce the grammar size and render device specific content. Access to user information such as the phone number allows the application to personalize the content for the user.

Use Case: Mobile appointment application

When user 'George' accesses the appointment application the application says "Welcome 'George'" and presents a list of appointments for the day. The user can select any of his appointments by saying an appointment label shown on his phone. Each label is short enough to fit entirely on George's display.

4.9 Choice of ASR

The developer would like to have more control over ASR. An example is the capability of a multimodal application to choose between a local ASR or network based ASR depending on the location of the grammar. The developer should be allowed to pick the ASR depending on the application logic.

Use Case: Music search mobile application

In a music search mobile application the application uses network-based ASR to perform a search for a particular Artist/Album such as 'Green Day', '50 Cent' etc. In case of network-based recognition the grammar is changing dynamically and is large in size. The same music application may use local ASR for the purpose navigating through the application using commands such as 'Home', 'Next Page' etc.

4.10 Controlling N-Best choice of ASR

The application should be able to control the number of results it wants from ASR based on either a number N (say return top 5 matches) or confidence score (say return > 0.8 score). The developer should be able to author this N-Best list control.

Use Case: Select a football player mobile application

As with the previous football player selection use case, the list of players is visually displayed for the user to select. The user can make a selection from the visual presentation. The ASR may return more than 10 results as part of its N-Best response mechanism. However, the application depending on the screen size may choose to display only the top 5 entries on the screen. The application requests only the top 5 players in the N-best result instead of receiving 10 results and then ignoring the last 5 results.