Multimodal Interaction Working Group Charter

The mission of the Multimodal Interaction Working Group, part of the Multimodal Interaction Activity, is to develop open standards that enable the following vision:

Join the Multimodal Interaction Working Group.

End date 31 March 2011
Confidentiality Proceedings are Member-only, but the group sends regular summaries of ongoing work to the public mailing list.
Initial Chairs Deborah Dahl
Initial Team Contacts
(FTE %: 30)
Kazuyuki Ashimura
Usual Meeting Schedule Teleconferences: Weekly (1 main group call and 3 task force calls)
Face-to-face: as required up to three per year



The primary goal of this group is to develop W3C Recommendations that enable multimodal interaction with various devices including desktop PCs, mobile phones and less traditional platforms such as cars and intelligent home environments. For rapid adoption on a global scale, it should be possible to add simple multimodal capabilities to existing markup languages in a way that is backwards compatible with widely deployed devices, and which builds upon widespread familiarity with existing Web technologies. The standards should be scalable to enable richer capabilities for subsequent generations of multimodal devices.

Users will be able to provide input via speech, handwriting, motion or keystrokes, with output presented via displays, pre-recorded and synthetic speech, audio, and tactile mechanisms such as mobile phone vibrators and Braille strips. Application developers will be able to provide an effective user interface for whichever modes the user selects. To encourage rapid adoption, the same content can be designed for use on both old and new devices. The capability of possible multimodal access depends on the devices used. For example, users of multimodal devices which include not only keypads but also touch panel, microphone and motion sensor can enjoy all the possible modalities, while users of devices with restricted capability prefer simpler and lighter modalities like keypads and voice.

Target Audience

The target audience of the Multimodal Interaction Working Group are vendors and service providers of multimodal applications, and should include a range of organizations in different industry sectors like:

Mobile and hand-held devices
As a result of increasingly capable networks, devices, and speech recognition technology, the number of existing multimodal applications, especially mobile applications, is rapidly accelerating. Multimodal Voice Search in particular is a relatively new and compelling use case, and has been implemented in applications by a number of companies, including Google, Microsoft, Yahoo, Vlingo, V-Enable, SpeechCycle, Novauris, AT&T, Openstream, Vocalia, Metaphor Solutions and Melodis. Speech offers a welcome means to interact with smaller devices, allowing one-handed and hands-free operation. Users benefit from being able to choose which modalities they find convenient in any situation. The Working Group should be of interest to companies developing smart phones and personal digital assistants or who are interested in providing tools and technology to support the delivery of multimodal services to such devices.
Home appliances and home networks
Multimodal interfaces are expected to add value to remote control of home entertainment systems, as well as finding a role for other systems around the home. Companies involved in developing embedded systems and consumer electronics should be interested in W3C's work on multimodal interaction.
Enterprise office applications and devices
Multimodal has benefits for desktops, wall mounted interactive displays, multi-function copiers and other office equipment which offer a richer user experience and the chance to use additional modalities like speech and pens to existing modalities like keyboards and mice. W3C's standardization work in this area should be of interest to companies developing client software and application authoring technologies, and who wish to ensure that the resulting standards live up to their needs.
Intelligent IT ready cars
With the emergence of dashboard integrated high resolution color displays for navigation, communication and entertainment services, W3C's work on open standards for multimodal interaction should be of interest to companies working on developing the next generation of in-car systems.

Completed work

Under previous charters, going back to 2002, The Multimodal Interaction Working Group has created the following W3C Technical Reports

The following suite of specifications published by the group is known as the W3C Multimodal Interaction Framework.

In addition to the above, here is a list of documents published by the group.

Work to do

Multimodal Architecture and Interfaces (MMI Architecture)
Bring the MMI Architecture to Recommendation: finalize generic architecture and define profiles for using the MMI Architecture with specific use cases

To assist with realizing this goal, the Multimodal Interaction Working Group is tasked with providing a loosely coupled architecture for multimodal user interfaces, which allows for co-resident and distributed implementations, and focuses on the role of markup and scripting, and the use of well defined interfaces between its constituents. The framework is motivated by several basic design goals including (1) Encapsulation, (2) Distribution, (3) Extensibility, (4) Recursiveness and (5) Modularity.

Where practical this should leverage existing W3C work. See the list of dependencies for the information of related groups.

The Working Group may expand the Multimodal Architecture and Interfaces document and include exploration of some language definition which would make it easier to adapt existing I/O devices and software, e.g., an MRCP-enabled ASR engine, to interact with the framework of the life cycle events defined in the architecture.

Multimodal Authoring
publish examples of the MMI Architecture in use

The working group will investigate and recommend how various W3C languages can be extended for use in a multimodal environment using the multimodal life cycle events. The Working Group may prepare W3C notes on how the following languages to participate in multimodal specifications by incorporating the life cycle events from the multimodal architecture: XHTML, VoiceXML, MathML, SMIL, SVG, InkML and other languages that can be used in a multimodal environment. The working group is also interested in investigating how CSS and Delivery Context: Client Interfaces (DCCI) can be used to support Multimodal Interaction applications, and if appropriate, may write a W3C Note.

Modality Components Definition

Define Modality Components in the MMI Architecture which are responsible for controlling the various input and output modalities on various devices. The possible examples of Modality Components include ink capture and playback, biometrics, media capture and playback (audio, video, images, sensor data), speech recognition, text to speech, SVG, geolocation, voice dialog and emotion.

The group will generate a document which describes (1) the detailed definition of Modality Components and (2) how to build concrete Modality Components. This is not expected a recommendation track document but should be folded into an informative appendix of the Multimodal Architecture and Interfaces specification. Even though the group will provide several actual Modality Component codes as examples for the document, the main purpose of the document is not providing actual codes for various possible Modality Components but clearly defining what Modality Components are and how to build them.

The Working Group will also create separate documents for selected Ink and Voice modalities. Those documents will be first published as WG Notes and then incorporated with the Multimodal Architecture and Interfaces specification. The first public WG Notes are expected in 4Q 2009.

Extensible Multi-Modal Annotations (EMMA)

EMMA is a data exchange format for the interface between different levels of input processing and interaction management in multimodal and voice-enabled systems. It provides the means for input processing components, such as speech recognizers, to annotate application specific data with information such as confidence scores, time stamps, and input mode classification (e.g. key strokes, touch, speech, or pen). EMMA also provides mechanisms for representing alternative recognition hypotheses and groups and sequences of inputs. EMMA is a target data format for the semantic interpretation specification developed in the W3C Voice Browser Activity, which defines augmentations to speech grammars that allow extraction of application specific data as a result of speech recognition. EMMA supercedes earlier work on the natural language semantics markup language (NLSML) in the Voice Browser Activity.

EMMA became a W3C Recommendation in February 2009 and is the first specification from the W3C Multimodal Interaction working group to become a recommendation. The EMMA specification has been implemented by more than 10 different companies and institutions.

In the period defined by the new charter, the group will actively maintain and address any issues that arise with the existing EMMA specification, possibly resulting in the publication of an interim draft. The group will also investigate potential extensions of the EMMA language to support new features such as:

  • Representation of multimodal outputs for information presentation and embodied conversational agents
  • Extended support for multimodal corpora annotation and logging
  • Biometric processing, emotion detection and representation
  • Other new features arise through the adoption and use of the EMMA standard

The group may publish an updated version of EMMA specification with the above capability.

InkML - an XML language for ink traces

Bringing InkML to Recommendation

InkML provides a range of features to support real-time ink streaming, multi-party interactions and richly annotated ink archival. Applications may make use of as much or as little information as required, from minimalist applications using only simple traces to more complex problems, such as signature verification or calligraphic animation, requiring full dynamic information. As a platform-neutral format for digital ink, InkML can support collaborative or distributed applications in heterogeneous environments, such as courier signature verification and distance education. The specification is the product of several years of work by a cross-sector working group with input from Apple, Corel, HP, IBM, Maplesoft, Microsoft and Motorola as well as invited experts from academia and other sources.

This work includes defining Modality Components for various possible Ink applications, e.g., Ink capture, Ink playback, handwriting recognition, gesture recognition. The work on defining Modality Components will be held in collaboration with the "Modality Components Definition" work described above.


Bring Emotion Markup Language to Candidate Recommendation.

EmotionML will provide representations of emotions and related states for technological applications. The possible use cases include:

  • Opinion mining / sentiment analysis in Web 2.0, to automatically track customer's attitude regarding a product across blogs (e.g., Sentimine, Jodange)
  • Affective monitoring, such as "lie detection" using a polygraph, fear detection for surveillance purposes or using wearable sensors to test customer satisfaction
  • Character design and control for games and virtual worlds (e.g., Emotion AI Engine, MystiTool for Second Life)
  • Social robots, such as guide robots engaging with visitors (e.g., Fujitsu "enon", BlueBotics RoboX)
  • Expressive speech synthesis, generating synthetic speech with different emotions, such as happy or sad, friendly or apologetic (e.g., Loquendo TTS, IBM Research TTS)
  • Emotion recognition (e.g., for spotting angry customers in speech dialog systems)
  • Support for people with disabilities, such as educational programs for people with autism

Some of those applications already exist on the market, while others only as research prototypes. However, development is very fast in this area.

Notation for emotions is needed to be standardized, because emotions are conceptually clear in the scientific literature but engineers tend to get it wrong when they try to create actual applications. W3C can help avoid fragmentation of emotion-related technology by providing a scientifically well-founded format that can be generally used.

Naturalistic, interactive multimodal applications need to account for emotions and related human factors. EmotionML will serve as a "plug-in" language suitable for use in three different areas: (1) manual annotation of data; (2) automatic recognition of emotion-related states from user behavior; and (3) generation of emotion-related system behavior.

The specification of EmotionML can build on previous work of the Emotion Incubator Group and the Emotion Markup Language Incubator Group. These groups have identified use cases and requirements, and have drafted elements of a specification for a core set of requirements. The work of those groups have shown that there is a high degree of consensus already on how to represent emotions, so the Multimodal Interaction Working Group thinks a standardization looks feasible.

The following design goals motivate the specification:

  1. Plug-in language. It should be possible to use EmotionML markup in different contexts where emotions and related states need to be represented.
  2. Scientific validity. Representations should reflect the state of knowledge in the affective sciences to the extent that these are practically suitable and relevant, and provide support for state-of-the-art emotion models.
  3. Controlled extensibility. As there are no agreed vocabularies for representing emotions, it must be possible to use custom emotion vocabularies. Independently of the vocabularies used, the structure of EmotionML documents should stay the same. It should be possible to validate documents independently of the choice of vocabularies, including the verification that vocabularies are correctly used.

The EmotionML task in the Multimodal Interaction Working Group will continue the work of the Emotion Markup Language Incubator Group in the Recommendation Track, and aims to produce a W3C Recommendation. Due to the previous work, we expect to have a First Public Working Draft within the first three months of this charter, and to make further rapid progress after that.

Maintenance work

Following the publication of EMMA and InkML for Recommendations, the Working Group will be maintaining the specifications; that is, responding to questions and requests on the public mailing list and issuing errata as needed. The Working Group will also consider publishing additional versions of the specification, depending on such factors as feedback from the user community and any requirements generated for EMMA and InkML by the Multimodal Architecture and Interfaces work and the Multimodal Authoring work.

Success Criteria

For each document to advance to proposed Recommendation, the group will produce a technical report with at least two independent and interoperable implementations for each required feature. The working group anticipates two interoperable implementations for each optional feature but may reduce the criteria for specific optional features.


The following documents are expected to become W3C Recommendations:

The following documents are either notes or are not expected to advance toward Recommendation:


This Working Group is chartered to last until 31 January 2011. The first face to face meeting after re-chartering is planned in association with the technical plenary.

Here is a list of milestones identified at the time of re-chartering. Others may be added later at the discretion of the Working Group. The dates are for guidance only and subject to change.

Note: The group will document significant changes from this initial schedule on the group home page.
Specification FPWD LC CR PR Rec
Multimodal Architecture and Interfaces Completed March 2010 June 2010 4Q 2010 1Q 2011
EMMA 2.0 4Q 2009 January 2011 TBD TBD TBD
InkML Completed 1st LC: Completed

2nd LC: September 2009
November 2009 April 2010 June 2010
EmotionML October 2009 July 2010 March 2011 TBD TBD
Ink Modality Component Definition December 2009
(as a WG Notes)
- - - -
Voice Modality Component Definition December 2009
(as a WG Notes)
- - - -


W3C Groups

These are W3C activities that may be asked to review documents produced by the Multimodal Interaction Working Group, or which may be involved in closer collaboration as appropriate to achieving the goals of the Charter.

Cascading Style Sheets
styling for multimodal applications
Compound Document Formats or its successor
mixed markup such as XHTML+SVG
XHTML and XML Events
Hypertext Coordination Group
the "backplane" framework; The Working Group will also liaise with the Web Application Formats Working Group via the Hypertext Coordination Group
support for human languages across the world
Mobile Web Initiative
browsing the Web from mobile devices
Scalable Vector Graphics
graphical user interfaces
Semantic Web
the role of metadata
Synchronized Multimedia or its successor
timing model for multimedia presentations
Timed Text

synchronized text and video

The MMI Architecture may include a video content server as a Modality Component, so collaboration on how to handle a Modality Component for video service would be beneficial to both groups.

The Working Group may cooperate with two other Working Groups ( Media Fragments, Media Annotations) in the Video in the Web activity as well.

Voice Browsers
voice interfaces
WAI Protocols and Formats
ensuring accessibility
Web Services
application messaging framework
separating forms into data models, logic and presentation
XML Protocol
XML based messaging
Ubiquitous Web Applications
collaboration on Personalization and the Delivery Context and Geolocation
InkML includes subset of MathML functionalities with the <mapping> element

Furthermore, Multimodal Interaction Working Group expects to follow these W3C Recommendations:

External Groups

This is an indication of external groups with complementary goals to the Multimodal Interaction activity. W3C has formal liaison agreements with some of them, e.g. VoiceXML Forum.

protocols and codecs relevant to multimodal applications
open standards and widely available industry specifications for entertainment devices and home network
work on human factors and command vocabularies
SpeechSC working group has a dependency on EMMA
ISO/IEC JTC 1/SC 37 Biometrics
user authentication in multimodal applications
OASIS BIAS Integration TC
defining methods for using biometric identity assurance in transactional Web services and SOAs
having initial discussion on how EMMA could be used as a data format for biometrics
standardizing small set of key interfaces from Web services to mobile devices
VoiceXML Forum
an industry association for VoiceXML


To be successful, the Multimodal Interaction Working Group is expected to have 10 or more active participants for its duration. Effective participation in the Multimodal Interaction Working Group is expected to consume one work day per week for each participant; two days per week for editors. The Multimodal Interaction Working Group will also allocate the necessary resources for building Test Suites for each specification.

In order to make rapid progress, the Multimodal Interaction Working Group consists of several subgroups, each working on a separate document. The Multimodal Interaction Working Group members may participate in one or more subgroups.

Participants are reminded of the Good Standing requirements of the W3C Process.

Experts from appropriate communities may also be invited to join the working group, following the provisions for this in the W3C Process.

Working Group participants are not obligated to participate in every work item, however the Working Group as a whole is responsible for reviewing and accepting all work items.

For budgeting purposes, we may hold up to three full group face-to-face meetings per year if we believe them to be beneficial. The Working Group anticipate holding a face-to-face meeting in association with the technical plenary. The Chair will make Working Group meeting dates and locations available to the group in a timely manner according to the W3C Process. The Chair is also responsible for providing publicly accessible summaries of Working Group face to face meetings, which will be announced on www-multimodal@w3.org.


This group primarily conducts its work on the Member-only mailing list w3c-mmi-wg@w3.org (archive). Certain topics need coordination with external groups. The Chair and the Working Group can agree to discuss these topics on a public mailing list. The archived mailing list www-multimodal@w3.org is used for public discussion of W3C proposals for Multimodal Interaction Working Group and for public feedback on the group's deliverables.

Information about the group (deliverables, participants, face-to-face meetings, teleconferences, etc.) is available from the Multimodal Interaction Working Group home page.

All proceedings of the Working Group (mail archives, teleconference minutes, face-to-face minutes) will be available to W3C Members. Summaries of face-to-face meetings will be sent to the public list.

Decision Policy

As explained in the Process Document (section 3.3), this group will seek to make decisions when there is consensus. When the Chair puts a question and observes dissent, after due consideration of different opinions, the Chair should record a decision (possibly after a formal vote) and any objections, and move on.

This charter is written in accordance with Section 3.4, Votes of the W3C Process Document and includes no voting procedures beyond what the Process Document requires.

Patent Policy

This Working Group operates under the W3C Patent Policy (5 February 2004 Version). To promote the widest adoption of Web standards, W3C seeks to issue Recommendations that can be implemented, according to this policy, on a Royalty-Free basis.

For more information about disclosure obligations for this group, please see the W3C Patent Policy Implementation.

About this Charter

This charter for the Multimodal Interaction Working Group has been created according to section 6.2 of the Process Document. In the event of a conflict between this document or the provisions of any charter and the W3C Process, the W3C Process shall take precedence.

The most important changes from the previous charter are:

Deborah Dahl, Chair, Multimodal Interaction Working Group
Kazuyuki Ashimura, Multimodal Interaction Activity Lead

$Date: 2009/08/11 14:43:22 $