Multimodal Interaction Specifications and Notes

This is intended to provide a brief summary of each of the Multimodal Interaction Working Group's major work items.


This suite of specifications is known as the W3C Multimodal Interaction Framework.


The following lists current and completed specifications. Additional work is expected on topics described in the Scope section of the charter.

Multimodal Architecture

Main Architecture specification

The MMI Architecture provides a loosely coupled architecture for multimodal user interfaces, which allows for co-resident and distributed implementations, and focuses on the use of well-defined interfaces between its constituents. The framework is motivated by several basic design goals including (1) Encapsulation, (2) Distribution, (3) Extensibility, (4) Recursiveness and (5) Modularity. The MMI Architecture includes Modality Components, which process specific modalities such as speech or handwriting, an Interaction Manager, which coordinates processing among the Modality Components, and the Life Cycle events, which support communication between the Interaction Manager and the Modality Components.

Discovery & Registration of Multimodal Modality Components

This document is addressed to people who want either to develop Modality Components for Multimodal Applications distributed over a local network or “in the cloud”. With this goal, in a multimodal system implemented according to the Multimodal Architecture Specification, the system must discover and register its Modality Components in order to preserve the overall state of the distributed elements. In this way, Modality Components can be composed with automation mechanisms in order to adapt the Application to the state of the surrounding environment.

MMI Authoring
MMI Best Practices

Extensible Multi-Modal Annotations (EMMA)

EMMA 1.0

EMMA is a data exchange format for the interface between different levels of input processing and interaction management in multimodal and voice-enabled systems. It provides the means for input processing components, such as speech recognizers, to annotate application specific data with information such as confidence scores, time stamps, and input mode classification (e.g. key strokes, touch, speech, or pen). EMMA also provides mechanisms for representing alternative recognition hypotheses including lattice and groups and sequences of inputs. EMMA 1.0 has been completed. The group will publish a new EMMA 1.1 version of the specification which incorporates new features that address issues brought up through EMMA implementations.

Emma 1.1

Emma 1.1 includes a set of new features based on feedback from implementers as well as added clarification text in a number of places throughout the specification. The new features include: support for adding human annotations (emma:annotation, emma:annotated-tokens), support for inline specification of process parameters (emma:parameters, emma:parameter, emma:parameter-ref), support for specification of models used in processing beyond grammars (emma:process-model, emma:process-model-ref), extensions to emma:grammar to enable inline specification of grammars, a new mechanism for indicating which grammars are active (emma:grammar-active, emma:active), support for non-XML semantic payloads (emma:result-format), support for multiple emma:info elements and reference to the emma:info relevant to an interpretation (emma:info-ref), and a new attribute to complement the emma:medium and emma:mode attributes that enables specification of the modality used to express an input (emma:expressed-through).

Emma 2.0

The W3C Multimodal Interaction Working Group aims to develop specifications to enable access to the Web using multimodal interaction. This document is part of a set of specifications for multimodal systems, and provides details of an XML markup language for containing and annotating the interpretation of user input and production of system output. Examples of interpretation of user input are a transcription into words of a raw signal, for instance derived from speech, pen or keystroke input, a set of attribute/value pairs describing their meaning, or a set of attribute/value pairs describing a gesture. The interpretation of the user's input is expected to be generated by signal interpretation processes, such as speech and ink recognition, semantic interpreters, and other types of processors for use by components that act on the user's inputs such as interaction managers. Examples of stages in the production of a system output, are creation of a semantic representation, an assignment of that representation to a particular modality or modalities, and a surface string for realization by, for example, a text-to-speech engine. The production of the system's output is expected to be generated by output production processes, such as a dialog manager, multimodal presentation planner, content planner, and other types of processors such as surface generation.

InkML - an XML language for digital ink traces

InkML provides a range of features to support real-time ink streaming, multi-party interactions and richly annotated ink archival. Applications may make use of as much or as little information as required, from minimalist applications using only simple traces to more complex problems, such as signature verification or calligraphic animation, requiring full dynamic information. As a platform-neutral format for digital ink, InkML can support collaborative or distributed applications in heterogeneous environments, such as courier signature verification and distance education. This work is complete as InkML has reached the Recommendation stage. However, the Multimodal Interaction Working Group welcomes feedback on the InkML standard.

Emotion Markup Language (EmotionML) 1.0

EmotionML provides representations of emotions and related states for technological applications. As the web is becoming ubiquitous, interactive, and multimodal, technology needs to deal increasingly with human factors, including emotions. The language is conceived as a "plug-in" language suitable for use in three different areas: (1) manual annotation of data; (2) automatic recognition of emotion-related states from user behavior; and (3) generation of emotion-related system behavior.

Working Group Notes

Working Group Notes are non standards-track documents that support, clarify, or otherwise provide additional information about the specifications.


The group has published several Notes that provide additional information about the Multimodal Architecture and Interfaces specification.



Since EMMA 1.0 became a W3C Recommendation, a number of new possible use cases for the EMMA language have emerged. These include the use of EMMA to represent multimodal output, biometrics, emotion, sensor data, multi-stage dialogs, and interactions with multiple users. So the Working Group have decided to work on a document capturing use cases and issues for a series of possible extensions to EMMA, and published a Working Group Note to seek feedback on the various different use cases.