Multimodal Interaction Activity Proposal

This briefing package was created in conformance with the W3C Process Document and Guidebook for Working Group Chairs.

1. Executive Summary

The Web at its outset focussed on visual interaction, using keyboards and pointing devices to interact with Web pages based upon HTML. More recently, work has been underway to allow any telephone to be used to access appropriately designed Web services using touch-tone (DTMF), spoken commands, listening to prerecorded speech, synthetic speech and music.

The next step will be to develop specifications that allow multiple modes of interaction, offering users the choice of using their voice, or the use of a key pad, stylus or other input device. This is expected to be of considerable value to mobile users, where speech recognition can be used to overcome difficulties arising from the use of small key pads for text input, particularly for ideographic alphabets. Spoken interaction is also a boon when there is a need for hands-free operation. Complementing speech, ink entered with a stylus can be used for handwriting, gestures, drawings, and specific notations for mathematics, music, chemistry and other fields. Ink is expected to be popular for instant messaging

This document proposes the creation of a Multimodal Interaction Activity whose goal is to ensure development of a suite of specifications that together cover all necessary aspects of multimodal interaction with the Web. This work will build on top of W3C's existing specifications, for instance, combining XHTML, SMIL and XForms with markup for speech synthesis and speech recognition, or alternatively by the provision of mechanisms for loosely coupling visual interaction with voice dialogs represented in VoiceXML. Additional work will focus on a means to provide the ink component of Web-based, multimodal applications.

2. Background

What are Multimodal user interfaces?

Traditional Web browsers present a visual rendering of Web pages written in HTML, and allow you to interact through the keyboard and a pointing device such as a mouse, roller ball, touch pad or stylus. Telephone-based voice user interfaces, by contrast, present information using a combination of synthetic speech and pre-recorded audio, and allow you to interact via spoken commands or phrases. You may also be able to use touch-tone (DTMF) keypads.

Multimodal user interfaces support multiple modes of interaction:

Input modes: speech, keypads, pointing devices, and ink
Output modes: speech, audio, and bitmapped displays

Ink is used here for information that describes the motion of a stylus in terms of position, velocity and pressure. It can be used for handwriting and gesture recognition.

Here are just a few ideas for ways to exploit multimodal user interfaces:

Presenting complementary information on different output modes:

When using a cellphone to ask a voice portal for information about the local weather forecast, a picture could be sent to the cellphone to complement the spoken forecast. When asking for walking directions to a nearby restaurant, a map could be displayed. For an incoming call, the display could show a photograph of the caller.
Allowing you to switch between different modes depending on the context:

It could be too noisy for speech recognition to work, or you may be unable or simply not allowed to speak. Under these circumstances, you may want to use the keypad or pointing device instead of speech input. You may be comfortable looking at a form on the display, but choose to use speech to fill in text fields, rather than struggling with the cellphone keypad.

What has been done already?

The W3C Voice Browser working group published a set of requirements for multimodal interaction in July 2000. The working group also invited participants to demonstrate proof of concept examples of multimodal applications. A number of such demonstrations were shown at the working group's face to face meeting held in Paris in May 2000.

To get a feeling for future work, the W3C together with the WAP Forum held a joint workshop on the Multimodal Web in Hong Kong on 5-6 September 2000. This workshop addressed the convergence of W3C and WAP standards, and the emerging importance of speech recognition and synthesis for the Mobile Web. The workshop's recommendations encouraged W3C to set up a multimodal working group to develop standards for multimodal user interfaces for the Web.

Why isn't multimodal being addressed in the Voice Browser working group ?

Although the Voice Browser working group developed requirements for multimodal interaction, the pressure of work on spoken dialogs and related specifications has made it impractical to devote further time to work on multimodal interaction. Many of the Member organizations involved are now eager for W3C to launch a separate multimodal working group.

What factors are considered important?

The following points have emerged from discussions within the Voice Browser working group and the Multimodal workshop:

The primary focus is on mobile devices with a wireless network connection, i.e. cellphones, personal digital assistants and cars, but other markets are anticipated (e.g. kiosks and meeting rooms)
The timescale standards are needed by is primarily driven by the pace of deployment of 3G wireless networks
Input modes should include: speech, keypads, pointing devices, and ink
Output modes should include: speech, audio, and bitmapped displays
The architecture should allow for both local and remote speech processing

What is the market within the area of the proposal? Who or what group wants this (providers, users, etc.)?

Multimodal interaction will enrich the user experience for mobile services, and is seen as an important stimulus for usage of 3G mobile networks. Additional market opportunities exist for desktop and kiosk based systems.

The organizations driving work on multimodal interaction include vendors of speech technology, vendors of mobile devices, browser companies, carriers and infrastructure solution providers

What community will benefit from this activity?

The ubiquity of mobile devices makes it likely that success of this activity will be very widespread indeed, involving a great number of people. Although, the initial focus is likely to be on mobile services, multimodal interaction will also be of value on desktops, kiosks and other systems.

Are members of this community part of W3C now?

Yes.

Will they join the effort?

Yes. Over half the organizations participating in the Voice Browser working group have indicated a desire to join the proposed Multimodal Interaction working group. The Multimodal workshop with some 50 or so participants strongly recommended W3C to start work in this area. There have also been repeated requests from individual Member organizations.

Who or what currently exists in the market?

There are no open standards for multimodal interaction over the Web.

Is the market mature/growing/developing a niche?

Mobile Web services are expected to take off with recent agreements based upon W3C's work on mobile profiles for XHTML and CSS, and the worldwide deployment of packet wireless technology (e.g. 2.5 and 3G networks).

Pure telephony solutions based upon VoiceXML are now being deployed on a large scale, bringing much awaited usability benefits compared with traditional touch-tone based interactive voice response systems.

The mobile industry is now eager to embrace the benefits of multimodal interaction.

What competing technologies exist?What competing organizations exist?

The SALT Forum was launched on 15th October 2001 with a mission to develop standards for speech enabling HTML and XHTML. The announcement states their intention to submit specifications to a standards body during 2002. Their work may provide contributions for W3C work on multimodal interaction, involving a need for liaison between the two organizations.

What Team resources will be consumed (technical and administrative)?

See Section "Resources" below.

What is the scope of the work?

See the scope section in the proposed charter.

What are initial timetables?

See the proposed timeline in the proposed charter..

Is there a window of opportunity that cannot be missed?

Yes. The announcement of the SALT Forum, proprietary multimodal solutions from several companies, and research projects in other companies, plus the repeated requests for W3C to launch work on multimodal interaction combine to show that it is clearly time for W3C to act. Failure to act now might lead to the fragmentation and reduced interoperability of the Web.

What intellectual property (for example, an implementation) must be available for licensing and is this intellectual property available for a reasonable fee and in a non-discriminatory manner?

See the section about Intellectual Property in the proposed charter.

How might a potential Recommendation interact and overlap with existing international standards and Recommendations?

Any work done within the Multimodal Interaction Activity should take advantage of existing W3C specifications, such as XHTML, SMIL and XForms, as well as the specifications under development in the Voice Browser Activity, for instance, speech synthesis, speech recognition, and VoiceXML.

The IETF is expected to work on protocols for control of remote speech recognition and speech synthesis resources.

There is an opportunity to exploit work being done in ETSI on distributed speech recognition (STQ-Aurora DSR working group), and standardized spoken command, control and editing vocabularies (STF 182).

What organizations are likely to be affected by potential overlap?

We do not anticipate overlap with any other organization. 3GPP has recently started a distributed speech recognition activity based on input from ETSI, but this work is expected to be complementary to W3C's work on multimodal interaction. More details are given in the proposed charter.

Is this activity likely to fall within the dominion of an existing working group?

No, the Voice Browser Working Group has suspended its work on multimodal interaction in the anticipation of a separate working group being set up to work on multimodal interaction. Some work items previously done in the the Voice Browser working group may serve as input to the Multimodal Interaction working group.

Should new groups be created?

Yes. An Activity with a single chartered Working Group, namely the Multimodal Interaction Working Group should be created. See the proposal section for more details.

How should this area be coordinated with related W3C Activities?

It is essential that there is close cooperation with the Voice Browser working group. This could be achieved in a variety of ways: encouraging joint participation in both groups, shared face to face meetings, and joint teleconferences. Coordination will also be needed with other W3C working groups, for instance, XForms, HTML, and SYMM. The Multimodal Interaction Working Group's chair would be expected to participate in the Hypertext Coordination Group to deal with such issues as arranging reviews of other Working Group's specifications.

3. Current W3C Status

The charter of the Voice Browser working group included the provision for work on multimodal interaction. This led to the publication of draft requirements, and demonstations by several companies of proof of concept implementations of multimodal systems. Nearly 50 Member organizations are participating in the Voice Browser working group. In a straw poll over half of these have indicated an interest in participating in further work on multimodal interaction. The September 2000 Multimodal workshop, jointly organized by W3C and the WAP Forum, made a strong recommendation to W3C to set up a Multimodal working group. The workshop adopted the "Hong Kong Manifesto" which basically states that a new W3C working group should be created to address this area and coordinate activities to develop multimodal specifications supporting both visual and verbal user interfaces. The W3C Voice Browser Working Group has also approved the "Hong Kong Manifesto" and as a result, suspended its work on multimodal in the anticipation of the new activity. Since then, there have been repeated requests by Member organizations for W3C to launch this work.

4. Proposal: Multimodal Interaction Activity

4.1 Proposed Working Group

The Multimodal Interaction working group will be tasked with the development of a suite of specifications that together cover all necessary aspects of multimodal interaction with the Web. This work will build on top of W3C's existing specifications, for instance, combining XHTML, SMIL and XForms with markup for speech synthesis and speech recognition, or alternatively by the provision of mechanisms for loosely coupling visual interaction with voice dialogs represented in VoiceXML. Additional work will focus on a means to provide the ink component of Web-based, multimodal applications.

The proposed work should take place under the Interaction Domain. The proposed charter is as follows:

Multimodal Interaction Working Group charter proposal.

Multimodal Interaction Activity Proposal

1. Executive Summary

2. Background

3. Current W3C Status

4. Proposal: Multimodal Interaction Activity

4.1 Proposed Working Group

4.2 Proposed Timeline

4.3 Resource statement

4.4 Intellectual Property