Multimodal Interaction Working Group Charter

This charter is written in accordance with the W3C Process, section 4.2.2 ( Working Group and Interest Group Charters).

Table of Contents

  1. Mission Statement
  2. Scope
  3. Deliverables
  4. Duration
  5. Success Criteria
  6. Release Policy
  7. Milestones
  8. Confidentiality
  9. Relationship with other W3C Activities
  10. Coordination with External Groups
  11. Communication Mechanisms
  12. Voting Mechanisms
  13. Participation
  14. Intellectual Property

1. Mission Statement

Executive Summary

A new W3C Working Group on multimodal interaction is proposed with the goal of extending the user interface to Web applications to support multimodal interaction by developing markup specifications for synchronization across multiple modalities and devices. Multimodal user interfaces can be used in two ways: to present the same or complementary information on different output modes, and to enable switching between different input modes depending on the current context and physical environment. The Working Group's specifications should be implementable on a royalty-free basis, see section 14 for details. Scope and deliverables are identified in sections 2 and 3.


— Web pages you can speak to and gesture at

The Web at its outset focussed on visual interaction, using keyboards and pointing devices to interact with Web pages based upon HTML. More recently, work has been underway to allow any telephone to be used to access appropriately designed Web services using touch-tone (DTMF), spoken commands, listening to prerecorded speech, synthetic speech and music.

The next step will be to develop specifications that allow multiple modes of interaction, offering users the choice of using their voice, or the use of a key pad, stylus or other input device. For output, users will be able to listen to spoken prompts and audio, and to view information on graphical displays. This is expected to be of considerable value to mobile users, where speech recognition can be used to overcome difficulties arising from the use of small key pads for text input, particularly for ideographic alphabets. Spoken interaction is also a boon when there is a need for hands-free operation. Complementing speech, ink entered with a stylus or imaging device can be used for handwriting, gestures, drawings, and specific notations for mathematics, music, chemistry and other fields. Ink is expected to be popular for instant messaging.

The Multimodal Interaction working group is tasked with the development of a suite of specifications that together cover all necessary aspects of multimodal interaction with the Web. This work will build on top of W3C's existing specifications, for instance, combining XHTML, SMIL and XForms with markup for speech synthesis and speech recognition, or alternatively by the provision of mechanisms for loosely coupling visual interaction with voice dialogs represented in VoiceXML. Additional work will focus on a means to provide the ink component of Web-based, multimodal applications.

What has been done already?

The W3C Voice Browser working group published a set of requirements for multimodal interaction in July 2000. The working group also invited participants to demonstrate proof of concept examples of multimodal applications. A number of such demonstrations were shown at the working group's face to face meeting held in Paris in May 2000.

To get a feeling for future work, the W3C together with the WAP Forum held a joint workshop on the Multimodal Web in Hong Kong on 5-6 September 2000. This workshop addressed the convergence of W3C and WAP standards, and the emerging importance of speech recognition and synthesis for the Mobile Web. The workshop's recommendations encouraged W3C to set up a multimodal working group to develop standards for multimodal user interfaces for the Web.


Recent years have seen a tremendous growth in interest in using speech as a means to interact with Web-based services over the telephone. W3C responded to this by establishing the Voice Browser activity and working group. This Group developed requirements and specifications for the W3C Speech Interface Framework. There is now an emerging interest in combining speech interaction with other modes of interaction. The Multimodal Interaction working group is chartered to work on developing standards for such multimodal interaction.

Multimodal interaction will enable the user to speak, write, and type as well as hear and see using a more natural user interface than either of today's screen-oriented browsers and voice-oriented browsers. Either individually or in combination, these telephone, handheld, and laptop devices will support input modalities including speech, telephone keypads (DTMF), keyboards, touch pads, and mouse/stylus input (pointing, ink and handwriting), and output modalities including sound and display.

The different modalities may be supported on a single device or on separate devices working in tandem, for example, you could be talking into your cellphone and seeing the results on a PDA. Voice may also be offered as an adjunct to browsers with high resolution graphical displays, providing an accessible alternative to using the keyboard or screen. This can be especially important in automobiles or other situations where hands and eyes free operation is essential. Voice interaction can escape the physical limitations on keypads and displays as mobile devices become ever smaller. It is much easier to say a few words than it is to thumb them in on a keypad where multiple key presses may be needed for each character.

Mobile devices working in isolation generally lack the power to recognize more than a few hundreds of spoken commands. The storage limitations restrict the use of prerecorded speech prompts. Small speech synthesisers are possible, but tend to produce robotic sounding speech that many users find tiring to listen to. A solution is to process speech recognition and synthesis remotely on more powerful platforms. Multimodal applications can offer the best of two worlds: connected and standalone. The ability to use the full gamut of modalities when online, and to fallback to a subset when offline. For instance, to use speech input when online and the ability to fall back on keypads and pointing devices when offline, or when the situation precludes speaking (either for social reasons or because of high background noise).

Multimodal Interaction

Here are just a few ideas for ways to exploit multimodal user interfaces:

2. Scope

The Multimodal Interaction working group is tasked with the development of specifications covering the following goals:

  1. To support a combination of input modes, including speech, keyboards, pointing devices, touch pads and electronic pens
  2. To support the combination of aural and visual output modes
  3. To support a combination of local and remote speech recognition services
  4. To support the use of servers for speech synthesis and for streaming prerecorded speech and music
  5. To support a combination of local and remote processing of ink entered using a stylus, electronic pen or imaging device.
  6. To support varying degrees of coupling between visual and aural interaction modes
  7. To support the use of a remote voice dialog engine, e.g. a voice gateway running VoiceXML
  8. To support the coordination of interaction across more than one device, e.g. cellphone and wall mounted display

The Working Group is free to prioritize these goals as appropriate, and to drop individual goals, e.g. in case that there is insufficient interest or that there are not enough resources to meet them in the timeframe set out in Section 7.

This work will build on top of W3C's existing specifications, for instance, combining XHTML, SMIL and XForms with markup for speech synthesis and speech recognition, and by the provision of mechanisms for loosely coupling visual interaction with voice dialogs represented in VoiceXML.

The Working Group will also serve as a coordination body with existing industry groups working on related specifications, and to provide a pool of experts on multimodal interaction, some of which will participate in the other W3C working groups relevant to multimodal interaction.

3. Deliverables

This Section describes an initial set of deliverables for achieving the goals stated in Section 2. At the discretion of the Chair, the Working Group can adapt this set as needed during the course of its work. However, all deliverables must fall within the scope of this charter, and sufficient resources to address them need to be available within the Working Group.

The Multimodal Interaction working group is expected to advance specifications to W3C Recommendation status covering the following functional areas:

These specifications will be developed following investigations into the areas described immediately below in sections 3.1 through 3.5. These investigations will cover use cases, requirements and potential solutions. It is anticipated that these investigations will be published as W3C Notes.hese specifications will be developed following investigations into the areas described immediately below in sections 3.1 through 3.5. These investigations will cover both requirements and potential solutions. It is anticipated that these investigations will be published as W3C Notes.

The timescales for deliverables (see section 7) will be refined at the initial face to face meeting proposed for February 2002, and subsequently made publically available on the W3C website. At the discretion of the Chair, additional work items may be added during the lifetime of the Working Group, provided they fall within the scope of this charter, and there are sufficient resources available within the Working Group.

The first face to face meeting of the Multimodal Interaction working group is scheduled to take place on 25 February - 1st March 2002 at the W3C Technical Plenary in Cannes, France. This will provide an opportunity for direct liaison with other W3C working groups.

3.1 An investigation into a means for integrating multimodal interaction with XForms

XForms provides a means to separate user interaction from the underlying form instance data. Another advance, is the support for structured data, represented as XML — hitherto, forms were limited to flat lists of name/value pairs. Work to date has focussed on visual interaction with displays, keyboards and pointing devices. Further work is now needed to investigate the appropriate means to integrate additional modalities, minimizing the need for work on modality specific techologies.

Limited natural language processing will be appropriate for language input whether this comes from speech, ink, or keystrokes. XML will be used to express the semantic results of the user's input, based upon a means for application developers to specify how this XML is to be generated from the analysis produced by the natural language processor. Having obtained the semantic results, the next step is to interpret the meaning in the context of the application, for instance to fill out fields in a form, follow hypertext links, shift the focus of interaction and so on. It is desireable to give application developers flexibility in how they handle this step, e.g. via a scripting interface. Developers should be able to design for mixed initiative applications, where users can give a command or question that shifts the interaction away from the current focus. Natural language processing may also be valuable for dynamically generating visual and aural responses.

It is proposed that the Multimodal Interaction working group take over work on the natural language semantics markup language (NLSML) from the Voice Browser working group, where it has been suspended to free up time for work on other items.

3.2 An investigation into the issues arising from the integration of multimodal interaction with XHTML and SMIL

Speech offers a means to enliven the user experience when interacting with visual Web pages, bringing the possibility of playing aural commentaries when a page is loaded, the use of speech for filling out form fields, for following links and other kinds of actions. The process of playing a given prompt or listening using a given grammar can be triggered through the XHTML event model. For example, clicking on a form field, would fire an onFocus event that initiates an aural prompt and activates a speech grammar designed for that field. The events triggered by recognition could be used to interpret the results, or to reprompt in the case of an error.

An investigation will be undertaken with a view to making proposals for combining XHTML with the W3C speech synthesis and speech grammar markup languages, plus additional markup as appropriate. This work would be expected to specify a scripting model for added flexibility, and to be aligned with other deliverables of the Multimodal Interaction Activity. Some issues to be considered are the integration with XForms (see section 3.1), the simultaneous activation of local (field) grammars and global (navigation) grammars, the use of scripting for mixed initiative, and context dependent error handling.

SMIL (synchronized multimedia integration language) is an XML language for synchronizing multiple activities occurring in parallel or in sequence. Activities can be started, paused, and stopped. The timing of these events can be made to depend on other events. For instance you could cause an audio stream to stop 2 seconds after a button press event occurs. SMIL includes the means for skipping forward and backwards in a presentation, according to synchronization points. You can also define hypertext links into the middle of a presentation.

SMIL has a rich potential for fine grained temporal control of multimodal interaction. Further work is needed to investigate the implications for speech synthesis and speech recognition, in particular the event model, and markup for embedding speech interaction as an integral part of a SMIL application. Other work at W3C is looking into the combination of XHTML and SMIL as a declarative means for authoring dynamic presentations, and an alternative to using scripting (dynamic HTML).

Note: the aim is to develop a single specification that can be applied to both XHTML and SMIL for simple kinds of dialog. When richer dialog structure is needed, application developers will be able to loosely couple XHTML or SMIL with a voice browser based upon VoiceXML or other means, such as Java. The basis for achieving this is the subject of section 3.4.

3.3 A framework for the utilization of local and remote resources for speech and ink

Mobile devices have limited computational and memory resources, and as a result, may benefit from access to a remote high quality speech synthesis engine. Speech synthesis is taken here to include the combination of synthetic speech and prerecorded audio.

Speech recognition on hand-held devices is currently limited to vocabularies of no more than a few hundred words. For larger vocabularies, network access is needed to dedicated speech recognition engines. Network access is also needed to connect to servers offering speaker verification services.

While small devices can support limited character and gesture recognition, server based processing of ink will enable the provision of richer capabilities without need for changes to the device itself. For example, more flexible handwriting recognition, user authentication, and specialized input modes for mathematics, music, and other application specific gestures. Direct support for ink will avoid the loss of information when converting to bitmap or vector formats.

An investigation will be undertaken to recommend the basis for a client device to identify and make use of local and remote resources resources. This work will focus on the markup and scripting interface, and not the protocols, which are likely to be developed outside of the W3C, for instance, work in the IETF on MRCP.

3.4 A framework for coupling disparate components of a multimodal system

A multimodal system may involve several devices, for instance the combination of a cell phone and a wall mounted display. These will need to be coordinated to form a coherent multimodal application. Another example is a PDA providing the user interface, and coupled to a voice gateway running VoiceXML. The ability to couple devices provides a way to escape the limitations of each device.

An investigation will be undertaken to recommend the basis for coupling such disparate components. Some of the issues to be considered include the set up and termination of sessions, the use of XML to represent events, the asynchronous nature of such events, the ability to recover from temporary breaks in network connectivity, and potential change of IP addresses. This work will focus on the markup and scripting interface, and not the protocols, which are likely to be developed outside of the W3C. For instance, the SIP working group in the IETF is specifying how to use SIP to transfer events. This work item may lead to proposals for extensions to VoiceXML for review by the Voice Browser Activity.

3.5 Coordination

The Multimodal Interaction working group will coordinate its work with other W3C working groups and with relevant external groups, see sections 9 and 10 below.

4. Duration

This Working Group is scheduled to last for two years, from February 25th, 2002 to February 25th, 2004.

5. Success Criteria

The Working Group has fulfilled its mission if it succeeds in unifying the efforts of vendors and content providers to stimulate the development and widespread use of multimodal systems conforming to the W3C specifications developed by the Working Group. See section 7 for the timeline for each of the planned specifications.

6. Release Policy

By default, all documents under development by the Working Group are available to W3C Members from the group's web page. Selected documents will be made publically available via the W3C's technical reports page after approval from W3C management. The types of documents (Notes, Working Drafts etc.) are defined by the W3C Process.

Documents must have at least one editor and one or more contributors. Documents should have a date by which they will be declared stable. Any remaining issues at this date will be described in the document to avoid delaying its wider release.

7. Milestones

This is a provisional list of milestones for the deliverables identified in section 3, and liable to change. The Multimodal Interaction working group will be tasked with maintaining publically accessible information describing the documents under development and the schedule for their standardization.

25th February - 1st March, 2002
First Face to Face meeting in Cannes, France, hosted by W3C. This is a plenary session with the opportunity to coordinate with other working groups
December 2002
Complete investigations in all areas. Start work on draft specifications.
September 2003
Last Call reviews of draft specifications. Development of test suites.
October-December 2003
Respond to Last Call reviews.
January 2004
Advance specs to Candidate Recommendations. Interoperability demonstrations.
25th February, 2004
Seek charter extension if needed, otherwise close Working Group.

It is anticipated that a charter extension will be sought to see the specifications through to Recommendation status, and to prepare errata as needed, based upon subsequent experience.

8. Confidentiality

Access to email discussions and to documents developed by the working group will be limited to W3C Members and Invited Experts, until released for publication by the joint agreement of the working group and the W3C management team. Working group members are required to honor the confidentiality of the group's discussions and working documents, until such time that the work is publically released. Invited experts are bound by the W3C Invited Expert and Collaborators Agreement. Participants working for W3C Member organizations are bound by their contract with W3C.

9. Relationship with other W3C Activities

The Multimodal Interaction Working Group will have to take into account technologies developed by other groups within W3C, and to advise them about the requirements for Multimodal Interaction and to ask them to review specifications prepared by the Working Group, covering proposals for extensions to existing or future Web standards. At the time the charter was written, the following ongoing W3C activities are concerned: (listed in alphabetical order)

10. Coordination with External Groups

The following is a list of groups that are known or presumed to be working on, or interested in, standards relating to multimodal browsers, with pointers to the respective projects. The W3C Multimodal Dialog working group will need to liaise with these groups.

3GPP is studying different ways to include speech-enabled services comprising both speech-only and multimodal services in 3G networks. One option for distributed speech recognition is based on the ETSI's STQ Aurora developments. Other options are dependent on the general study on speech enabled services. 3GPP may be interested in working on integrating remote access to speech synthesis resources. W3C should keep a watching brief. There is a possible connection to proposals (e.g. MRCP) for the IETF to develop protocols for accessing remote speech synthesis and speech recognition resources.
DARPA Communicator program
The program carries out research on the next generation of intelligent conversational interfaces to distributed information. The goal is to support the creation of speech-enabled interfaces that scale gracefully across modalities, from speech-only to interfaces that include graphics, maps, pointing and gesture.
Enterprise Computer Telephony Forum
ECTF works to remove obstacles to interoperability for computer telephony systems. Its specifications impact: voice mail, unified messaging, media gateways, voice activated services and more. See the ECTF Solutions FAQ.
European Telecommunications Standards Institute (ETSI)
A non-profit organization whose mission is "to determine and produce the telecommunications standards that will be used for decades to come". ETSI's work is complementary to W3C's. Of particular note is ETSI STQ Aurora work on Distributed Speech Recognition, and ETSI STF 182, which is working on a standard spoken vocabulary for command, control and editing.
The SIP working group in the IETF is developing the means to exchange events over SIP, and this may be of value to a multimodal synchronization framework. SIP provides a framework for initiating and controlling sessions involving multiple devices and server resources. Preliminary discussions have also started on using the Media Resource Control Protocol (MRCP) in the body of SIP messages for the purpose of controlling Prompt Players, Text to Speech, and Speech Recognition Engines. It is suggested that the Multimodal Interaction Working Group evaluate these protocols and provide feedback as appropriate. This liaison should also be tracked in the regular W3C IETF coordination teleconferences.
The International Telecommunication Union's Study Group 16 is working on distributed speech recognition and verification.
SALT Forum
The SALT Forum was launched on 15th October 2001 with a mission to develop standards for speech enabling HTML and XHTML. The announcement states their intention to submit specifications to a standards body during 2002. Their work may provide contributions for W3C work on multimodal interaction, involving a need for liaison between the two organizations.
SIP Forum
The SIP Forum is a non profit association whose mission is to promote awareness and provide information about the benefits and capabilities that are enabled by SIP.
VoiceXML Forum
The VoiceXML Forum is an industry organization providing educational, marketing, and conformance testing services for VoiceXML. The Forum originally developed VoiceXML, but the specification is now maintained by W3C. Both organizations have signed a memorandum of understanding setting out the goals of both parties.
WAP Forum
The Wireless Application Protocol Forum aims to provide standards for Internet communications and advanced telephony services on digital mobile phones, pagers, personal digital assistants and other wireless terminals.
Daisy Consortium
Publishes talking books for people with visual impairments.
National Library Service for the Blind and Physically Handicapped & NISO Digital Talking Book Committee
Concerned with standards relating to players for digital talking books.

11. Communication Mechanisms

11.1 Email

The archived member-only mailing list is the primary means of discussion within the group.

Certain topics need coordination with external groups. The Chair and the Working Group can agree to discuss these topics on a public mailing list. The archived mailing list is used for public discussion of W3C proposals for Multimodal Interaction, and Working Group members are encouraged to subscribe. As a precaution against spam you must be subscribed in order to send a message to the list. To subscribe send a message with the word subscribe in the subject line to

For discussions relating purely to speech, there is the public mailing list The archive is available online.

11.2 Phone

A weekly one-hour phone conference will be held. The exact details, dates and times will be published in advance on the working group page. Additional phone conferences may be scheduled as necessary on specific topics.

11.3 Meetings

Face to face meetings will be arranged 3 to 4 times a year. Meeting details are made available on the W3C Member Calendar and from the Working Group page. The Chair is responsible for providing publically accessible summaries of Working Group face to face meetings, which will be announced on

11.4 Public Web pages

The Multimodal Interaction Activity will maintain public pages on the W3C website to describe the status of work and pointers to the working group, charter, Activity statement, and email archives.

12. Voting Mechanisms

The Group works by consensus. In the event of failure to achieve consensus, the Chair may resort to a vote as described in the Process Document. Each Member company which has at least one Group member in good standing may vote. There is one vote per W3C Member company. Votes are held by email to allow all participants a chance to vote; there is a two week voting period followed by a period of two working days for the announcement of the result. W3C staff and invited experts do not vote; however in the event of a tie the chair has a casting vote. If the issue is solved by consensus during the voting period, the vote is cancelled.

Note: the term good standing is defined in the W3C Process.

13. Participation

The Multimodal Interaction Working Group will be chaired by (to be announced). The W3C staff contact, and activity lead will be Dave Raggett. Resources of additional W3C team members will be required for some of the deliverables, should the conditions for starting these deliverables be met.

by W3C Members

Requirements for meeting attendance and timely response are described in the Process document. Participation (meetings, reviewing, and writing drafts) is expected to consume time amounting to one day per week for the lifetime of the group. Working group participants are required not to disclose information obtained during participation, until that information is publically available.

W3C Members may also offer to review one or more working drafts from the group for clarity, consistency, technical merit, fitness for purpose and conformance with other W3C specifications. The only participation requirement is to provide the review comments by the agreed-to date.

by invited experts

As decided on a case by case basis, invited experts may attend a single meeting or a series; they may in some cases be subscribed to the Group mailing list. For the duration of their participation, invited experts are encouraged to adopt the same requirements for meeting attendance and timely response as are required of W3C Members. Invited experts are subject to the same requirement for information disclosure as are required of W3C Members.

by W3C Team

The W3C team will be responsible for the mailing lists, public and working group pages, for the posting of meeting minutes, and for liaison with the W3C communications staff for the publication of working drafts. W3C team members are expected to adopt the same requirements for meeting attendance, timely response and information disclosure as are required of W3C Members. The W3C staff contact will be expected to devote 40% of his time to this Activity.

14. Intellectual Property

W3C promotes an open working environment. Whenever possible, technical decisions should be made unencumbered by intellectual property right (IPR) claims. W3C's policy for intellectual property is set out in section 2.2 of the W3C Process document.

Members of the Working Group are expected to disclose any intellectual property they have in the area. This WG will work on a royalty- free basis, as defined in the W3C Current Patent Practice document. The Working Group is thus obliged to produce a specification which relies only on intellectual property available on a royalty-free basis.

If it proves impossible to produce specifications implementable on a royalty-free basis, then a Patent Advisory Group will be launched as described in the W3C Current Patent Practice document.

Members disclose patent and other IPR claims by sending email to <>, an archived mailing list that is readable by Members and the W3C Team. Members must disclose all IPR claims to this mailing list but they may also copy other recipients. IPR disclosures are expected to be made public.