This charter is written in accordance with the W3C Process, section 4.2.2 (Working Group and Interest Group Charters).
A new W3C Working Group on multimodal interaction is proposed with the goal of developing specifications to cover all aspects of multimodal interaction with the Web. Multimodal user interfaces can be used in two ways: to present complementary information on different output modes, and to enable switching between different modes depending on the current context and physical environment. Scope and deliverables are identified. The Working Group's specifications should be implementable on a royalty-free basis, see section 14 for details.
— Web pages you can speak to and gesture at
The Web at its outset focussed on visual interaction, using keyboards and pointing devices to interact with Web pages based upon HTML. More recently, work has been underway to allow any telephone to be used to access appropriately designed Web services using touch-tone (DTMF), spoken commands, listening to prerecorded speech, synthetic speech and music.
The next step will be to develop specifications that allow multiple modes of interaction, offering users the choice of using their voice, or the use of a key pad, stylus or other input device. For output, users will be able to listen to spoken prompts and audio, and to view information on graphical displays. This is expected to be of considerable value to mobile users, where speech recognition can be used to overcome difficulties arising from the use of small key pads for text input, particularly for ideographic alphabets. Spoken interaction is also a boon when there is a need for hands-free operation. Complementing speech, ink entered with a stylus can be used for handwriting, gestures, drawings, and specific notations for mathematics, music, chemistry and other fields. Ink is expected to be popular for instant messaging.
The Multimodal Interaction working group is tasked with the development of a suite of specifications that together cover all necessary aspects of multimodal interaction with the Web. This work will build on top of W3C's existing specifications, for instance, combining XHTML, SMIL and XForms with markup for speech synthesis and speech recognition, or alternatively by the provision of mechanisms for loosely coupling visual interaction with voice dialogs represented in VoiceXML. Additional work will focus on a means to provide the ink component of Web-based, multimodal applications.
The W3C Voice Browser working group published a set of requirements for multimodal interaction in July 2000. The working group also invited participants to demonstrate proof of concept examples of multimodal applications. A number of such demonstrations were shown at the working group's face to face meeting held in Paris in May 2000.
To get a feeling for future work, the W3C together with the WAP Forum held a joint workshop on the Multimodal Web in Hong Kong on 5-6 September 2000. This workshop addressed the convergence of W3C and WAP standards, and the emerging importance of speech recognition and synthesis for the Mobile Web. The workshop's recommendations encouraged W3C to set up a multimodal working group to develop standards for multimodal user interfaces for the Web.
Recent years have seen a tremendous growth in interest in using speech as a means to interact with Web-based services over the telephone. W3C responded to this by establishing the Voice Browser activity and working group. This Group developed requirements and specifications for the W3C Speech Interface Framework. There is now an emerging interest in combining speech interaction with other modes of interaction. The Multimodal Interaction working group is chartered to work on developing standards for such multimodal interaction.
Multimodal interaction will enable the user to speak, write, and type as well as hear and see using a more natural user interface than either of today's screen-oriented browsers and voice-oriented browsers. Either individually or in combination, these telephone, handheld, and laptop devices will support input modalities including speech, telephone keypads (DTMF), keyboards, touch pads, and mouse/stylus input (pointing, ink and handwriting), and output modalities including sound and display.
The different modalities may be supported on a single device or on separate devices working in tandem, for example, you could be talking into your cellphone and seeing the results on a PDA. Voice may also be offered as an adjunct to browsers with high resolution graphical displays, providing an accessible alternative to using the keyboard or screen. This can be especially important in automobiles or other situations where hands and eyes free operation is essential. Voice interaction can escape the physical limitations on keypads and displays as mobile devices become ever smaller. It is much easier to say a few words than it is to thumb them in on a keypad where multiple key presses may be needed for each character.
Mobile devices working in isolation generally lack the power to recognize more than a few hundreds of spoken commands. The storage limitations restrict the use of prerecorded speech prompts. Small speech synthesisers are possible, but tend to produce robotic sounding speech that many users find tiring to listen to. A solution is to process speech recognition and synthesis remotely on more powerful platforms. Multimodal applications can offer the best of two worlds: connected and standalone. The ability to use the full gamut of modalities when online, and to fallback to a subset when offline. For instance, to use speech input when online and the ability to fall back on keypads and pointing devices when offline, or when the situation precludes speaking (either for social reasons or because of high background noise).
Here are just a few ideas for ways to exploit multimodal user interfaces:
Presenting complementary or matching information on different output modes:
When using a cellphone to ask a voice portal for information about the local weather forecast, a picture could be sent to the cellphone to complement the spoken forecast. When asking for walking directions to a nearby restaurant, a map could be displayed. For an incoming call, the display could show a photograph of the caller.
Allowing you to switch between different modes depending on the context:
It could be too noisy for speech recognition to work, or you may be unable or simply not allowed to speak. Under these circumstances, you may want to use the keypad or pointing device instead of speech input. You may be comfortable looking at a form on the display, but choose to use speech to fill in text fields, rather than struggling with the cellphone keypad.
The Multimodal Interaction working group is tasked with the development of specifications covering the following goals:
This work will build on top of W3C's existing specifications, for instance, combining XHTML, SMIL and XForms with markup for speech synthesis and speech recognition, and by the provision of mechanisms for loosely coupling visual interaction with voice dialogs represented in VoiceXML.
The Working Group will also serve as a coordination body with existing industry groups working on related specifications, and to provide a pool of experts on multimodal interaction, some of which will participate in the other W3C working groups relevant to multimodal interaction.
The Multimodal Interaction working group is expected to independently advance specifications to W3C Recommendation status covering each of the following functional areas:
These specifications will be developed following investigations into the areas described immediately below in sections 3.1 through 3.5. These investigations will cover both requirements and potential solutions. It is anticipated that these investigations will be published as W3C Notes.
The timescales for deliverables (see section 7) will be refined at the initial face to face meeting proposed for February 2002, and subsequently made publically available on the W3C website. At the discretion of the Chair, additional work items may be added during the lifetime of the Working Group, provided they fall within the scope of this charter, and there are sufficient resources available within the Working Group.
The first face to face meeting of the Multimodal Interaction working group is scheduled to take place on 25 February - 1st March 2002 at the W3C Technical Plenary in Cannes, France. This will provide an opportunity for direct liaison with other W3C working groups.
XForms provides a means to separate user interaction from the underlying form instance data. Work to date has focussed on visual interaction. Further work is now needed to investigate the appropriate means to integrate support for speech as a basis for multimodal interaction. XML will be used to express the semantic results of recognition, based upon a means for application developers to specify how this XML is to be generated from the syntactic output of the natural language processor.
Having obtained the semantic results, the next step is to apply them to the XForms instance data. A model of speech acts may be appropriate to cater for different types of operations on the instance data, for instance, queries versus assertions, and operations such as addition, and deletion. Additional work may be needed to determine how to support dialog mechanisms that do not fit into the XForms model. It is proposed that the Multimodal Interaction working group take over work on the natural language semantics markup language (NLSML) from the Voice Browser working group, where it has been suspended to free up time for work on other items.
Speech offers a means to enliven the user experience when interacting with visual Web pages, bringing the possibility of playing aural commentaries when a page is loaded, the use of speech for filling out form fields, for following links and other kinds of actions. The process of playing a given prompt or listening using a given grammar can be triggered through the XHTML event model. For example, clicking on a form field, would fire an onFocus event that initiates an aural prompt and activates a speech grammar designed for that field. The events triggered by recognition could be used to interpret the results, or to reprompt in the case of an error.
An investigation will be undertaken with a view to making proposals for combining XHTML with the W3C speech synthesis and speech grammar markup languages, plus additional markup as appropriate. This work would be expected to specify a scripting model for added flexibility, and to be aligned with other deliverables of the Multimodal Interaction Activity. Some issues to be considered are the integration with XForms (see section 3.1), the simultaneous activation of local (field) grammars and global (navigation) grammars, the use of scripting for mixed initiative, and context dependent error handling.
SMIL (synchronized multimedia integration language) is an XML language for synchronizing multiple activities occurring in parallel or in sequence. Activities can be started, paused, and stopped. The timing of these events can be made to depend on other events. For instance you could cause an audio stream to stop 2 seconds after a button press event occurs. SMIL includes the means for skipping forward and backwards in a presentation, according to synchronization points. You can also define hypertext links into the middle of a presentation.
SMIL has a rich potential for fine grained temporal control of multimodal interaction. Further work is needed to investigate the implications for speech synthesis and speech recognition, in particular the event model, and markup for embedding speech interaction as an integral part of a SMIL application. Other work at W3C is looking into the combination of XHTML and SMIL as a declarative means for authoring dynamic presentations, and an alternative to using scripting (dynamic HTML).
Note: the aim is to develop a single specification that can be applied to both XHTML and SMIL for simple kinds of dialog. When richer dialog structure is needed, application developers will be able to loosely couple XHTML or SMIL with a voice browser based upon VoiceXML or other means, such as Java. The basis for achieving this is the subject of section 3.5.
Mobile devices have limited computational and memory resources, and as a result, may benefit from access to a remote high quality speech synthesis engine. An investigation will be undertaken to recommend the basis for a client device to identify and make use of local and remote synthesis resources. Speech synthesis is taken here to include the combination of synthetic speech and prerecorded audio. This work will focus on the markup and scripting interface, and not the protocols, which are likely to be developed outside of the W3C.
Speech recognition on hand-held devices is currently limited to vocabularies of no more than a few hundred words. For larger vocabularies, network access is needed to dedicated speech recognition engines. An investigation will be undertaken to recommend the basis for a client device to identify and make use of local and remote recognition resources. This work will focus on the markup and scripting interface, and not the protocols, which are likely to be developed outside of the W3C.
A multimodal system may involve several devices, for instance the combination of a cell phone and a wall mounted display. These will need to be coordinated to form a coherent multimodal application. Another example is a PDA providing the user interface, and coupled to a voice gateway running VoiceXML.
An investigation will be undertaken to recommend the basis for coupling such disparate components. Some of the issues to be considered include the set up and termination of sessions, the use of XML to represent events, the asynchronous nature of such events, the ability to recover from temporary breaks in network connectivity, and potential change of IP addresses. This work will focus on the markup and scripting interface, and not the protocols, which are likely to be developed outside of the W3C. This work item may lead to proposals for extensions to VoiceXML for review by the Voice Browser Activity.
The Multimodal Interaction working group will coordinate its work with other W3C working groups and with relevant external groups, see sections 9 and 10 below.
This Working Group is scheduled to last for two years, from February 25th, 2002 to February 25th, 2004.
The Working Group has fulfilled its mission if it succeeds in unifying the efforts of vendors and content providers to stimulate the development and widespread use of multimodal systems conforming to the W3C specifications developed by the Working Group. See section 7 for the timeline for each of the planned specifications.
By default, all documents under development by the Working Group are available to W3C Members from the group's web page. Selected documents will be made publically available via the W3C's technical reports page after approval from W3C management. The types of documents (Notes, Working Drafts etc.) are defined by the W3C Process.
Documents must have at least one editor and one or more contributors. Documents should have a date by which they will be declared stable. Any remaining issues at this date will be described in the document to avoid delaying its wider release.
This is a provisional list of milestones for the deliverables identified in section 3, and liable to change. The Multimodal Interaction working group will be tasked with maintaining publically accessible information describing the documents under development and the schedule for their standardization.
It is anticipated that a charter extension will be sought to see the specifications through to Recommendation status, and to prepare errata as needed, based upon subsequent experience.
Access to email discussions and to documents developed by the working group will be limited to W3C Members and Invited Experts, until released for publication by the joint agreement of the working group and the W3C management team. Working group members are required to honor the confidentiality of the group's discussions and working documents, until such time that the work is publically released. Invited experts are bound by the W3C Invited Expert and Collaborators Agreement. Participants working for W3C Member organizations are bound by their contract with W3C.
The Multimodal Interaction Working Group will have to take into account technologies developed by other groups within W3C, and to advise them about the requirements for Multimodal Interaction and to ask them to review specifications prepared by the Working Group, covering proposals for extensions to existing or future Web standards. At the time the charter was written, the following ongoing W3C activities are concerned: (listed in alphabetical order)
The following is a list of groups that are known or presumed to be working on, or interested in, standards relating to multimodal browsers, with pointers to the respective projects. The W3C Multimodal Dialog working group will need to liaise with these groups.
The archived member-only mailing list w3c-multimodal-wg@w3.org is the primary means of discussion within the group.
Certain topics need coordination with external groups. The Chair and the Working Group can agree to discuss these topics on a public mailing list. The archived mailing list www-multimodal@w3.org is used for public discussion of W3C proposals for Multimodal Interaction, and Working Group members are encouraged to subscribe. As a precaution against spam you must be subscribed in order to send a message to the list. To subscribe send a message with the word subscribe in the subject line to www-multimodal-request@w3.org.
For discussions relating purely to speech, there is the public mailing list www-voice@w3.org. The archive is available online.
A weekly one-hour phone conference will be held. The exact details, dates and times will be published in advance on the working group page. Additional phone conferences may be scheduled as necessary on specific topics.
Face to face meetings will be arranged 3 to 4 times a year. Meeting details are made available on the W3C Member Calendar and from the Working Group page. The Chair is responsible for providing publically accessible summaries of Working Group face to face meetings, which will be announced on www-multimodal@w3.org.
The Multimodal Interaction Activity will maintain public pages on the W3C website to describe the status of work and pointers to the working group, charter, Activity statement, and email archives.
The Group works by consensus. In the event of failure to achieve consensus, the Chair may resort to a vote as described in the Process Document. Each Member company which has at least one Group member in good standing may vote. There is one vote per W3C Member company. Votes are held by email to allow all participants a chance to vote; there is a two week voting period followed by a period of two working days for the announcement of the result. W3C staff and invited experts do not vote; however in the event of a tie the chair has a casting vote. If the issue is solved by consensus during the voting period, the vote is cancelled.
Note: the term good standing is defined in the W3C Process.
The Multimodal Interaction Working Group will be chaired by (to be announced). The W3C staff contact, and activity lead will be Dave Raggett. Resources of additional W3C team members will be required for some of the deliverables, should the conditions for starting these deliverables be met.
Requirements for meeting attendance and timely response are described in the Process document. Participation (meetings, reviewing, and writing drafts) is expected to consume time amounting to one day per week for the lifetime of the group. Working group participants are required not to disclose information obtained during participation, until that information is publically available.
W3C Members may also offer to review one or more working drafts from the group for clarity, consistency, technical merit, fitness for purpose and conformance with other W3C specifications. The only participation requirement is to provide the review comments by the agreed-to date.
As decided on a case by case basis, invited experts may attend a single meeting or a series; they may in some cases be subscribed to the Group mailing list. For the duration of their participation, invited experts are encouraged to adopt the same requirements for meeting attendance and timely response as are required of W3C Members. Invited experts are subject to the same requirement for information disclosure as are required of W3C Members.
The W3C team will be responsible for the mailing lists, public and working group pages, for the posting of meeting minutes, and for liaison with the W3C communications staff for the publication of working drafts. W3C team members are expected to adopt the same requirements for meeting attendance, timely response and information disclosure as are required of W3C Members. The W3C staff contact will be expected to devote 40% of his time to this Activity.
W3C promotes an open working environment. Whenever possible, technical decisions should be made unencumbered by intellectual property right (IPR) claims. W3C's policy for intellectual property is set out in section 2.2 of the W3C Process document.
Members of the Multimodal Interaction Working Group and any other Working Group constituted within the Multimodal Interaction Activity are expected to disclose any intellectual property they have in this area. Any intellectual property essential to implement specifications produced by this Activity must be at least available for licensing on a royalty-free basis. At the suggestion of the Working Group, and at the discretion of the Director of W3C, technologies may be accepted if they are licensed on reasonable, non-discriminatory terms.
Members disclose patent and other IPR claims by sending email to <patent-issues@w3.org>, an archived mailing list that is readable by Members and the W3C Team. Members must disclose all IPR claims to this mailing list but they may also copy other recipients.