Next steps for W3C work on Multimodal Standards

What is W3C doing with regards to standards for multimodal user interfaces to the Web? This page sets out what has already been done and what W3C plans to do in the future.

What are Multimodal user interfaces?

Traditional Web browsers present a visual rendering of Web pages written in HTML, and allow you to interact through the keyboard and a pointing device such as a mouse, roller ball, touch pad or stylus. Voice user interfaces, by contrast, present information using a combination of synthetic speech and pre-recorded audio, and allow you to interact via spoken commands or phrases. You may also be able to use touch tone (DTMF) keypads.

Multimodal user interfaces support multiple modes of interaction:

Input modes: speech, keypads, pointing devices, and electronic ink
Output modes: speech, audio, plain text and displays

Electronic ink is the term for information that describes the motion of a stylus in terms of position, velocity and pressure. It can be used for handwriting and gesture recognition.

Here are just a few ideas for ways to exploit multimodal user interfaces:

Presenting complementary information on different output modes:

When using a cellphone to ask a voice portal for information about the local weather forecast, a picture could be sent to the cellphone to complement the spoken forecast. When asking for walking directions to a nearby restaurant, a map could be displayed. For an incoming call, the display could show a photograph of the caller.
Allowing you to switch between different modes depending on the context:

It could be too noisy for speech recognition to work, or you may be unable or simply not allowed to speak. Under these circumstances, you may want to use the keypad or pointing device instead of speech input. You may be comfortable looking at a form on the display, but choose to use speech to fill in text fields, rather than struggling with the cellphone keypad.

What has been done already?

The W3C Voice Browser working group published a set of requirements for multimodal interaction in July 2000. The working group also invited participants to demonstrate proof of concept examples of multimodal applications. A number of such demonstrations were shown at the working group's face to face meeting held in Paris in May 2000.

To get a feeling for future work, the W3C together with the WAP Forum held a joint workshop on the Multimodal Web in Hong Kong on 5-6 September 2000. This workshop addressed the convergence of W3C and WAP standards, and the emerging importance of speech recognition and synthesis for the Mobile Web. The workshop's recommendations encouraged W3C to set up a multimodal working group to develop standards for multimodal user interfaces for the Web.

Why isn't multimodal being addressed in the Voice Browser working group

Although the Voice Browser working group developed requirements for multimodal interaction, the pressure of work on spoken dialogs and related specifications has made it impractical to devote time to further work on multimodal standards. As a result, W3C now expects to create a new multimodal working group later this year.

Next steps

To ensure that the new multimodal work group can act swiftly to fulfil commercial requirements, W3C member organizations are invited to submit detailed proposals to W3C for the markup language and synchronization protocols needed to support multimodal interaction. Submissions should consider the following points:

The primary focus is on mobile devices with a wireless network connection, i.e. cellphones, personal digital assistants and cars, but other markets are anticipated (e.g. kiosks and meeting rooms)
The timescale is driven by deployment of 3G wireless networks
Input modes should include: speech, keypads, pointing devices, and electronic ink
Output modes should include: speech, audio, and bitmapped displays
The architecture should allow for both local and remote speech processing

W3C Members are encouraged to collaborate on proposals, as this will make it easier to ascertain broad industry support. In late Summer 2001, a charter for a Multimodal working group will be drawn up based upon the proposals that get the broadest industry backing.

Some ideas that have been suggested include:

The ability to speech-enable visual Web pages without changes to the markup language. In principle, this could be achieved via scripting, for instance, by adding a new Speech object to ECMAScript analogous to the existing String and Math objects.
The ability to loosely couple visual and voice markup, for instance, by modular extensions to VoiceXML to reference visual pages written in XHTML (or SMIL or SVG etc.). Authors may also want a means to combine visual and voice markup in the same document.
Some kind of synchronization protocol to sync the client and server when a distributed model is in use.
"The ability to support distributed speech recognition processing by performing front-end feature extraction at the client and other recognition processing at a remote server. Distributed speech recognition is being developed by the ETSI Aurora Working Group
The ability to use pen-based gestures to constrain voice recognition (click and speak)
A means to transfer electronic ink.

Information on how to make a submission to W3C can be found here.

Dave Raggett <dsr@w3.org>, Voice Browser Activity lead, February 2001