Multimodal Interaction Working Group Charter

This charter is written in accordance with the W3C Process, section 4.2.2 (Working Group and Interest Group Charters).

Mission Statement
Scope
Deliverables
Duration
Success Criteria
Release Policy
Milestones
Confidentiality
Relationship with other W3C Activities
Coordination with External Groups
Communication Mechanisms
Voting Mechanisms
Participation
Intellectual Property

1. Mission Statement

Executive Summary

A new W3C Working Group on multimodal interaction is proposed with the goal of developing specifications to cover all aspects of multimodal interaction with the Web. Multimodal user interfaces can be used in two ways: to present complementary information on different output modes, and to enable switching between different modes depending on the current context and physical environment. Scope and deliverables are identified. The Working Group's specifications should be implementable on a royalty-free basis, see section 14 for details.

Introduction

— Web pages you can speak to and gesture at

The Web at its outset focussed on visual interaction, using keyboards and pointing devices to interact with Web pages based upon HTML. More recently, work has been underway to allow any telephone to be used to access appropriately designed Web services using touch-tone (DTMF), spoken commands, listening to prerecorded speech, synthetic speech and music.

The next step will be to develop specifications that allow multiple modes of interaction, offering users the choice of using their voice, or the use of a key pad, stylus or other input device. For output, users will be able to listen to spoken prompts and audio, and to view information on graphical displays. This is expected to be of considerable value to mobile users, where speech recognition can be used to overcome difficulties arising from the use of small key pads for text input, particularly for ideographic alphabets. Spoken interaction is also a boon when there is a need for hands-free operation. Complementing speech, ink entered with a stylus can be used for handwriting, gestures, drawings, and specific notations for mathematics, music, chemistry and other fields. Ink is expected to be popular for instant messaging.

The Multimodal Interaction working group is tasked with the development of a suite of specifications that together cover all necessary aspects of multimodal interaction with the Web. This work will build on top of W3C's existing specifications, for instance, combining XHTML, SMIL and XForms with markup for speech synthesis and speech recognition, or alternatively by the provision of mechanisms for loosely coupling visual interaction with voice dialogs represented in VoiceXML. Additional work will focus on a means to provide the ink component of Web-based, multimodal applications.

What has been done already?

The W3C Voice Browser working group published a set of requirements for multimodal interaction in July 2000. The working group also invited participants to demonstrate proof of concept examples of multimodal applications. A number of such demonstrations were shown at the working group's face to face meeting held in Paris in May 2000.

To get a feeling for future work, the W3C together with the WAP Forum held a joint workshop on the Multimodal Web in Hong Kong on 5-6 September 2000. This workshop addressed the convergence of W3C and WAP standards, and the emerging importance of speech recognition and synthesis for the Mobile Web. The workshop's recommendations encouraged W3C to set up a multimodal working group to develop standards for multimodal user interfaces for the Web.

Background

Recent years have seen a tremendous growth in interest in using speech as a means to interact with Web-based services over the telephone. W3C responded to this by establishing the Voice Browser activity and working group. This Group developed requirements and specifications for the W3C Speech Interface Framework. There is now an emerging interest in combining speech interaction with other modes of interaction. The Multimodal Interaction working group is chartered to work on developing standards for such multimodal interaction.

Multimodal interaction will enable the user to speak, write, and type as well as hear and see using a more natural user interface than either of today's screen-oriented browsers and voice-oriented browsers. Either individually or in combination, these telephone, handheld, and laptop devices will support input modalities including speech, telephone keypads (DTMF), keyboards, touch pads, and mouse/stylus input (pointing, ink and handwriting), and output modalities including sound and display.

The different modalities may be supported on a single device or on separate devices working in tandem, for example, you could be talking into your cellphone and seeing the results on a PDA. Voice may also be offered as an adjunct to browsers with high resolution graphical displays, providing an accessible alternative to using the keyboard or screen. This can be especially important in automobiles or other situations where hands and eyes free operation is essential. Voice interaction can escape the physical limitations on keypads and displays as mobile devices become ever smaller. It is much easier to say a few words than it is to thumb them in on a keypad where multiple key presses may be needed for each character.

Mobile devices working in isolation generally lack the power to recognize more than a few hundreds of spoken commands. The storage limitations restrict the use of prerecorded speech prompts. Small speech synthesisers are possible, but tend to produce robotic sounding speech that many users find tiring to listen to. A solution is to process speech recognition and synthesis remotely on more powerful platforms. Multimodal applications can offer the best of two worlds: connected and standalone. The ability to use the full gamut of modalities when online, and to fallback to a subset when offline. For instance, to use speech input when online and the ability to fall back on keypads and pointing devices when offline, or when the situation precludes speaking (either for social reasons or because of high background noise).

Multimodal Interaction

Here are just a few ideas for ways to exploit multimodal user interfaces:

Presenting complementary or matching information on different output modes:

When using a cellphone to ask a voice portal for information about the local weather forecast, a picture could be sent to the cellphone to complement the spoken forecast. When asking for walking directions to a nearby restaurant, a map could be displayed. For an incoming call, the display could show a photograph of the caller.
Allowing you to switch between different modes depending on the context:

It could be too noisy for speech recognition to work, or you may be unable or simply not allowed to speak. Under these circumstances, you may want to use the keypad or pointing device instead of speech input. You may be comfortable looking at a form on the display, but choose to use speech to fill in text fields, rather than struggling with the cellphone keypad.

2. Scope

The Multimodal Interaction working group is tasked with the development of specifications covering the following goals:

To support a combination of input modes, including speech, keyboards, pointing devices, touch pads and electronic pens
To support the combination of aural and visual output modes
To support a combination of local and remote speech recognition services
To support the use of servers for speech synthesis and for streaming prerecorded speech and music
To enable the server-side processing of ink entered using a stylus or electronic pen.
To support varying degrees of coupling between visual and aural interaction modes
To support the use of a remote voice dialog engine, e.g. a voice gateway running VoiceXML
To support the coordination of interaction across more than one device, e.g. cellphone and wall mounted display

This work will build on top of W3C's existing specifications, for instance, combining XHTML, SMIL and XForms with markup for speech synthesis and speech recognition, and by the provision of mechanisms for loosely coupling visual interaction with voice dialogs represented in VoiceXML.

The Working Group will also serve as a coordination body with existing industry groups working on related specifications, and to provide a pool of experts on multimodal interaction, some of which will participate in the other W3C working groups relevant to multimodal interaction.

3. Deliverables

The Multimodal Interaction working group is expected to independently advance specifications to W3C Recommendation status covering each of the following functional areas:

Speech enabling XHTML and SMIL using XForms for data binding
Loose coupling of disparate multimodal components
Natural Language Processing

These specifications will be developed following investigations into the areas described immediately below in sections 3.1 through 3.5. These investigations will cover both requirements and potential solutions. It is anticipated that these investigations will be published as W3C Notes.

The timescales for deliverables (see section 7) will be refined at the initial face to face meeting proposed for February 2002, and subsequently made publically available on the W3C website. At the discretion of the Chair, additional work items may be added during the lifetime of the Working Group, provided they fall within the scope of this charter, and there are sufficient resources available within the Working Group.

The first face to face meeting of the Multimodal Interaction working group is scheduled to take place on 25 February - 1st March 2002 at the W3C Technical Plenary in Cannes, France. This will provide an opportunity for direct liaison with other W3C working groups.

3.1 An investigation into a means for integrating speech interaction with XForms

XForms provides a means to separate user interaction from the underlying form instance data. Work to date has focussed on visual interaction. Further work is now needed to investigate the appropriate means to integrate support for speech as a basis for multimodal interaction. XML will be used to express the semantic results of recognition, based upon a means for application developers to specify how this XML is to be generated from the syntactic output of the natural language processor.

Having obtained the semantic results, the next step is to apply them to the XForms instance data. A model of speech acts may be appropriate to cater for different types of operations on the instance data, for instance, queries versus assertions, and operations such as addition, and deletion. Additional work may be needed to determine how to support dialog mechanisms that do not fit into the XForms model. It is proposed that the Multimodal Interaction working group take over work on the natural language semantics markup language (NLSML) from the Voice Browser working group, where it has been suspended to free up time for work on other items.

3.2 An investigation into the issues arising from the integration of speech interaction with XHTML and SMIL

Speech offers a means to enliven the user experience when interacting with visual Web pages, bringing the possibility of playing aural commentaries when a page is loaded, the use of speech for filling out form fields, for following links and other kinds of actions. The process of playing a given prompt or listening using a given grammar can be triggered through the XHTML event model. For example, clicking on a form field, would fire an onFocus event that initiates an aural prompt and activates a speech grammar designed for that field. The events triggered by recognition could be used to interpret the results, or to reprompt in the case of an error.

An investigation will be undertaken with a view to making proposals for combining XHTML with the W3C speech synthesis and speech grammar markup languages, plus additional markup as appropriate. This work would be expected to specify a scripting model for added flexibility, and to be aligned with other deliverables of the Multimodal Interaction Activity. Some issues to be considered are the integration with XForms (see section 3.1), the simultaneous activation of local (field) grammars and global (navigation) grammars, the use of scripting for mixed initiative, and context dependent error handling.

SMIL (synchronized multimedia integration language) is an XML language for synchronizing multiple activities occurring in parallel or in sequence. Activities can be started, paused, and stopped. The timing of these events can be made to depend on other events. For instance you could cause an audio stream to stop 2 seconds after a button press event occurs. SMIL includes the means for skipping forward and backwards in a presentation, according to synchronization points. You can also define hypertext links into the middle of a presentation.

SMIL has a rich potential for fine grained temporal control of multimodal interaction. Further work is needed to investigate the implications for speech synthesis and speech recognition, in particular the event model, and markup for embedding speech interaction as an integral part of a SMIL application. Other work at W3C is looking into the combination of XHTML and SMIL as a declarative means for authoring dynamic presentations, and an alternative to using scripting (dynamic HTML).

Note: the aim is to develop a single specification that can be applied to both XHTML and SMIL for simple kinds of dialog. When richer dialog structure is needed, application developers will be able to loosely couple XHTML or SMIL with a voice browser based upon VoiceXML or other means, such as Java. The basis for achieving this is the subject of section 3.5.

3.3 A mechanism for the utilization of local and remote resources for speech synthesis

Mobile devices have limited computational and memory resources, and as a result, may benefit from access to a remote high quality speech synthesis engine. An investigation will be undertaken to recommend the basis for a client device to identify and make use of local and remote synthesis resources. Speech synthesis is taken here to include the combination of synthetic speech and prerecorded audio. This work will focus on the markup and scripting interface, and not the protocols, which are likely to be developed outside of the W3C.

3.4 A mechanism for the utilization of local and remote resources for speech recognition and speaker verification

Speech recognition on hand-held devices is currently limited to vocabularies of no more than a few hundred words. For larger vocabularies, network access is needed to dedicated speech recognition engines. An investigation will be undertaken to recommend the basis for a client device to identify and make use of local and remote recognition resources. This work will focus on the markup and scripting interface, and not the protocols, which are likely to be developed outside of the W3C.

3.5 A mechanism for coupling disparate components of a multimodal system

A multimodal system may involve several devices, for instance the combination of a cell phone and a wall mounted display. These will need to be coordinated to form a coherent multimodal application. Another example is a PDA providing the user interface, and coupled to a voice gateway running VoiceXML.

An investigation will be undertaken to recommend the basis for coupling such disparate components. Some of the issues to be considered include the set up and termination of sessions, the use of XML to represent events, the asynchronous nature of such events, the ability to recover from temporary breaks in network connectivity, and potential change of IP addresses. This work will focus on the markup and scripting interface, and not the protocols, which are likely to be developed outside of the W3C. This work item may lead to proposals for extensions to VoiceXML for review by the Voice Browser Activity.

3.6 Coordination

The Multimodal Interaction working group will coordinate its work with other W3C working groups and with relevant external groups, see sections 9 and 10 below.

4. Duration

This Working Group is scheduled to last for two years, from February 25th, 2002 to February 25th, 2004.

5. Success Criteria

The Working Group has fulfilled its mission if it succeeds in unifying the efforts of vendors and content providers to stimulate the development and widespread use of multimodal systems conforming to the W3C specifications developed by the Working Group. See section 7 for the timeline for each of the planned specifications.

6. Release Policy

By default, all documents under development by the Working Group are available to W3C Members from the group's web page. Selected documents will be made publically available via the W3C's technical reports page after approval from W3C management. The types of documents (Notes, Working Drafts etc.) are defined by the W3C Process.

Documents must have at least one editor and one or more contributors. Documents should have a date by which they will be declared stable. Any remaining issues at this date will be described in the document to avoid delaying its wider release.

7. Milestones

This is a provisional list of milestones for the deliverables identified in section 3, and liable to change. The Multimodal Interaction working group will be tasked with maintaining publically accessible information describing the documents under development and the schedule for their standardization.

25th February - 1st March, 2002: First Face to Face meeting in Cannes, France, hosted by W3C. This is a plenary session with the opportunity to coordinate with other working groups
December 2002: Complete investigations in all areas. Start work on draft specifications.
September 2003: Last Call reviews of draft specifications. Development of test suites.
October-December 2003: Respond to Last Call reviews.
January 2004: Advance specs to Candidate Recommendations. Interoperability demonstrations.
25th February, 2004: Seek charter extension if needed, otherwise close Working Group.

It is anticipated that a charter extension will be sought to see the specifications through to Recommendation status, and to prepare errata as needed, based upon subsequent experience.

8. Confidentiality

Access to email discussions and to documents developed by the working group will be limited to W3C Members and Invited Experts, until released for publication by the joint agreement of the working group and the W3C management team. Working group members are required to honor the confidentiality of the group's discussions and working documents, until such time that the work is publically released. Invited experts are bound by the W3C Invited Expert and Collaborators Agreement. Participants working for W3C Member organizations are bound by their contract with W3C.

9. Relationship with other W3C Activities

The Multimodal Interaction Working Group will have to take into account technologies developed by other groups within W3C, and to advise them about the requirements for Multimodal Interaction and to ask them to review specifications prepared by the Working Group, covering proposals for extensions to existing or future Web standards. At the time the charter was written, the following ongoing W3C activities are concerned: (listed in alphabetical order)

CSS WG, XSL WG: Style sheets are a very important means to achieve reusability of Web content on different devices. CSS already includes some features for aural rendering and additional features may be needed for Multimodal Interaction. Proposals for new or revised style properties will need to be reviewed by the CSS working group.
Device Independence Working Group: This is chartered to review W3C specifications with the aim of minimizing barriers to re-use of content on different devices. Future work of this group may include the development of technologies that facilitate handling devices with different capabilities on the Web. The Multimodal Interaction working group is chartered to develop technologies for specific types of devices which allow multimodal access to web content. As such, its specifications should be subject to review by the Device Independence working group. The Multimodal working group is also a potential customer for the technologies developed by the Device Independence working group (if any), and should review whether they are suitable for the specific case of devices allowing multimodal access.
DOM Working Group: This is responsible for application programming interfaces to XML documents, the DOM working group may be asked to review proposals for additional programing interfaces for Multimodal Dialogs, e.g. for controlling the dialog, for setting up phone calls, and for controlling streaming audio.
Hypertext Coordination Group: this has the responsibility for ensuring that reviews between working groups are planned and carried out so as to meet requirements for deliverables and deadlines. The Multimodal Interaction working group will be represented in the Coordination Group by its chair.
Internationalization: The specifications for Multimodal Interaction are expected to satisfy W3C requirements for internationalization, and will need to be reviewed by the I18N working group.
Mobile Access Interest Group: Multimodal Interaction are likely to be of considerable interest to the Mobile Access IG.
Platform for Privacy Preferences (P3P): P3P defines a means for associating privacy policies and preferences with personally-identifiable data collected from Web pages. To ensure that P3P can be used with multimodal applications, the Multimodal Interaction working group will be expected to take advantage of W3C work on the integration of P3P and XForms. The P3P specification working group will be asked to review specifications produced by the Multimodal Interaction working group.
Scalable Vector Graphics: W3C's SVG specification is likely to be valuable for displaying vector graphics in multimodal applications.
Synchronized Multimedia: W3C's SMIL specification is likely to be valuable for fine grained temporal control of multimodal interaction. The Multimodal Interaction Working Group should coordinate its work in this area with the Synchronized Multimedia Interest Group and Working Group. Examples of items of mutual interest include the set of events needed to control prompts, and how streaming media can be controlled from scripts.
Voice Browser working group: This group is responsible for the development and maintenance of the VoiceXML, speech synthesis, speech grammar and related specifications. There will be a need for close cooperation between the Multimodal Interaction and Voice Browser working groups. The circumstance may arise where it becomes appropriate for one or more voice related work items to be transferred from one group to the other.
Web Accessibility Initiative (WAI): Any specifications produced by the Multimodal Interaction working group will need to be reviewed for accessibility by the Protocols and Data Formats working group. For instance, accessibility requirements for markup and for multimodal user agents.
XForms: Dedicated to reducing the effort needed to create applications based upon forms. XForms separates the user interface, logic and data. The Multimodal Interaction working group will be expected to adhere to this architecture, and to collaborate with the XForms working group over any additional requirements for multimodal interaction.
XHTML: The display of Web content on multimodal browsers will be based upon XHTML, and may result in new requirements for work on future versions of XHTML.
XML Schema: The Multimodal Interaction working group will be expected to provide schemas for all markup language specifications developed by the Group.

10. Coordination with External Groups

The following is a list of groups that are known or presumed to be working on, or interested in, standards relating to multimodal browsers, with pointers to the respective projects. The W3C Multimodal Dialog working group will need to liaise with these groups.

3GPP: 3GPP is working on distributed speech recognition with a view to exploiting ETSI's STQ Aurora results. 3GPP may be interested in working on integrating remote access to speech synthesis resources. W3C should keep a watching brief. There is a possible connection to proposals for the IETF to develop protocols for accessing remote speech synthesis and speech recognition resources.
DARPA Communicator program: The program carries out research on the next generation of intelligent conversational interfaces to distributed information. The goal is to support the creation of speech-enabled interfaces that scale gracefully across modalities, from speech-only to interfaces that include graphics, maps, pointing and gesture.
Enterprise Computer Telephony Forum: ECTF works to remove obstacles to interoperability for computer telephony systems. Its specifications impact: voice mail, unified messaging, media gateways, voice activated services and more. See the ECTF Solutions FAQ.
European Telecommunications Standards Institute (ETSI): A non-profit organization whose mission is "to determine and produce the telecommunications standards that will be used for decades to come". ETSI's work is complementary to W3C's. Of particular note is ETSI STQ Aurora work on Distributed Speech Recognition, and ETSI STF 182, which is working on a standard spoken vocabulary for command, control and editing.
IETF: The IETF may become involved in work on accessing remote speech synthesis and recognition resources. If this occurs then close liaison will be needed. One model would be for the W3C Multimodal Interaction Working Group to develop detailed requirements as input to the IETF specification work, and for W3C to subsequently track progress in the IETF and review the resultant specifications against these requirements. This liaison should also be tracked in the regular W3C IETF coordination teleconferences.
SALT Forum: The SALT Forum was launched on 15th October 2001 with a mission to develop standards for speech enabling HTML and XHTML. The announcement states their intention to submit specifications to a standards body during 2002. Their work may provide contributions for W3C work on multimodal interaction, involving a need for liaison between the two organizations.
VoiceXML Forum: The VoiceXML Forum is an industry organization providing educational, marketing, and conformance testing services for VoiceXML. The Forum originally developed VoiceXML, but the specification is now maintained by W3C. Both organizations have agreed to a memorandum of understanding setting out the goals of both parties.
WAP Forum: The Wireless Application Protocol Forum aims to provide standards for Internet communications and advanced telephony services on digital mobile phones, pagers, personal digital assistants and other wireless terminals.
Daisy Consortium: Publishes talking books for people with visual impairments.
National Library Service for the Blind and Physically Handicapped & NISO Digital Talking Book Committee: Concerned with standards relating to players for digital talking books.

11. Communication Mechanisms

11.1 Email

The archived member-only mailing list w3c-multimodal-wg@w3.org is the primary means of discussion within the group.

Certain topics need coordination with external groups. The Chair and the Working Group can agree to discuss these topics on a public mailing list. The archived mailing list www-multimodal@w3.org is used for public discussion of W3C proposals for Multimodal Interaction, and Working Group members are encouraged to subscribe. As a precaution against spam you must be subscribed in order to send a message to the list. To subscribe send a message with the word subscribe in the subject line to www-multimodal-request@w3.org.

For discussions relating purely to speech, there is the public mailing list www-voice@w3.org. The archive is available online.

11.2 Phone

A weekly one-hour phone conference will be held. The exact details, dates and times will be published in advance on the working group page. Additional phone conferences may be scheduled as necessary on specific topics.

11.3 Meetings

Face to face meetings will be arranged 3 to 4 times a year. Meeting details are made available on the W3C Member Calendar and from the Working Group page. The Chair is responsible for providing publically accessible summaries of Working Group face to face meetings, which will be announced on www-multimodal@w3.org.

11.4 Public Web pages

The Multimodal Interaction Activity will maintain public pages on the W3C website to describe the status of work and pointers to the working group, charter, Activity statement, and email archives.

12. Voting Mechanisms

The Group works by consensus. In the event of failure to achieve consensus, the Chair may resort to a vote as described in the Process Document. Each Member company which has at least one Group member in good standing may vote. There is one vote per W3C Member company. Votes are held by email to allow all participants a chance to vote; there is a two week voting period followed by a period of two working days for the announcement of the result. W3C staff and invited experts do not vote; however in the event of a tie the chair has a casting vote. If the issue is solved by consensus during the voting period, the vote is cancelled.

Note: the term good standing is defined in the W3C Process.

13. Participation

The Multimodal Interaction Working Group will be chaired by (to be announced). The W3C staff contact, and activity lead will be Dave Raggett. Resources of additional W3C team members will be required for some of the deliverables, should the conditions for starting these deliverables be met.

by W3C Members

Requirements for meeting attendance and timely response are described in the Process document. Participation (meetings, reviewing, and writing drafts) is expected to consume time amounting to one day per week for the lifetime of the group. Working group participants are required not to disclose information obtained during participation, until that information is publically available.

W3C Members may also offer to review one or more working drafts from the group for clarity, consistency, technical merit, fitness for purpose and conformance with other W3C specifications. The only participation requirement is to provide the review comments by the agreed-to date.

by invited experts

As decided on a case by case basis, invited experts may attend a single meeting or a series; they may in some cases be subscribed to the Group mailing list. For the duration of their participation, invited experts are encouraged to adopt the same requirements for meeting attendance and timely response as are required of W3C Members. Invited experts are subject to the same requirement for information disclosure as are required of W3C Members.

by W3C Team

The W3C team will be responsible for the mailing lists, public and working group pages, for the posting of meeting minutes, and for liaison with the W3C communications staff for the publication of working drafts. W3C team members are expected to adopt the same requirements for meeting attendance, timely response and information disclosure as are required of W3C Members. The W3C staff contact will be expected to devote 40% of his time to this Activity.

14. Intellectual Property

W3C promotes an open working environment. Whenever possible, technical decisions should be made unencumbered by intellectual property right (IPR) claims. W3C's policy for intellectual property is set out in section 2.2 of the W3C Process document.

Members of the Multimodal Interaction Working Group and any other Working Group constituted within the Multimodal Interaction Activity are expected to disclose any intellectual property they have in this area. Any intellectual property essential to implement specifications produced by this Activity must be at least available for licensing on a royalty-free basis. At the suggestion of the Working Group, and at the discretion of the Director of W3C, technologies may be accepted if they are licensed on reasonable, non-discriminatory terms.

Members disclose patent and other IPR claims by sending email to <patent-issues@w3.org>, an archived mailing list that is readable by Members and the W3C Team. Members must disclose all IPR claims to this mailing list but they may also copy other recipients.