Multimodal Web Applications Current Status

This page summarizes the relationships among specifications, whether they are finished standards or drafts. Below, each title links to the most recent version of a document.

Completed Work

W3C Recommendations have been reviewed by W3C Members, by software developers, and by other W3C groups and interested parties, and are endorsed by the Director as Web Standards. Learn more about the W3C Recommendation Track.

Group Notes are not standards and do not have the same level of W3C endorsement.



Emotion Markup Language (EmotionML) 1.0

As the Web is becoming ubiquitous, interactive, and multimodal, technology needs to deal increasingly with human factors, including emotions. EmotionML provides mechanisms to represent emotions in terms of scientifically valid descriptors: categories, dimensions, appraisals, and action tendencies. It is conceived as a "plug-in" language suitable for use in three different areas: (1) manual annotation of data; (2) automatic recognition of emotion-related states from user behavior; and (3) generation of emotion-related system behavior.


Multimodal Architecture and Interfaces

This document describes a loosely coupled architecture for multimodal user interfaces, which allows for co-resident and distributed implementations, and focuses on the role of markup and scripting, and the use of well defined interfaces between its constituents.


EMMA: Extensible MultiModal Annotation markup language

The W3C Multimodal Interaction Working Group aims to develop specifications to enable access to the Web using multimodal interaction. This document is part of a set of specifications for multimodal systems, and provides details of an XML markup language for containing and annotating the interpretation of user input. Examples of interpretation of user input are a transcription into words of a raw signal, for instance derived from speech, pen or keystroke input, a set of attribute/value pairs describing their meaning, or a set of attribute/value pairs describing a gesture. The interpretation of the user's input is expected to be generated by signal interpretation processes, such as speech and ink recognition, semantic interpreters, and other types of processors for use by components that act on the user's inputs such as interaction managers.

Group Notes


Vocabularies for EmotionML

This document represents a public collection of emotion vocabularies that can be used with EmotionML. It was originally part of an earlier draft of the EmotionML specification, but was moved out of it so that we can easily update, extend and correct the list of vocabularies as required.


Registration & Discovery of Multimodal Modality Components in Multimodal Systems: Use Cases and Requirements


MMI interoperability test report


Best practices for creating MMI Modality Components


Use Cases for Possible Future EMMA Features


Authoring Applications for the Multimodal Architecture

This document describes a multimodal system which implements the W3C Multimodal Architecture and gives an example of a simple multimodal application authored using various W3C markup languages, including SCXML, CCXML, VoiceXML 2.1 and HTML.


Common Sense Suggestions for Developing Multimodal User Interfaces

This document is based on the accumulated experience of several years of developing multimodal applications. It provides a collection of common sense advice for developers of multimodal user interfaces.


Multimodal Application Developer Feedback

Several years of multimodal application development in various business areas and on various device platforms has provided developers enough experience to provide detailed feedback about what they like, dislike, and want to see improve and continue. This experience is provided here as an input to the specifications under development in the W3C Multimodal Interaction and Voice Browser Activities.


Modality Component to Host Environment DOM Requirements and Capabilities Assessment

This document describes the DOM capabilities needed to support a heterogeneous multimodal environment and the current state of DOM interfaces supporting those capabilities. These DOM interfaces are used between modality components and their host environment in the W3C Multimodal Interaction Framework as proposed by the W3C Multimodal Interaction Activity.

The Multimodal Interaction Framework separates multimodal systems into a set of functional units, including Input and Output components, an Interaction Mananger, Session Components, System and Environment, and Application Functions. In order for those functional components to interact with each other to form an application interpreter, the browser implementation must allow for communication and coordination between those components. This DOM interface identifies the DOM APIs used to communicate and coordinate at the browser implemention level. Multimodal browsers can be stand-alone or distributed systems.


W3C Multimodal Interaction Framework

This document introduces the W3C Multimodal Interaction Framework, and identifies the major components for multimodal systems. Each component represents a set of related functions. The framework identifies the markup languages used to describe information required by components and for data flowing among components. The W3C Multimodal Interaction Framework describes input and output modes widely used today and can be extended to include additional modes of user input and output as they become available.


Requirements for EMMA

This document describes requirements for the Extensible MultiModal Annotation language (EMMA) specification under development in the W3C Multimodal Interaction Activity. EMMA is intended as a data format for the interface between input processors and interaction management systems. It will define the means for recognizers to annotate application specific data with information such as confidence scores, time stamps, input mode (e.g. key strokes, speech or pen), alternative recognition hypotheses, and partial recognition results, etc. EMMA is a target data format for the semantic interpretation specification being developed in the Voice Browser Activity, and which describes annotations to speech grammars for extracting application specific data as a result of speech recognition. EMMA supercedes earlier work on the natural language semantics markup language in the Voice Browser Activity.


Multimodal Interaction Requirements

This document describes fundamental requirements for the specifications under development in the W3C Multimodal Interaction Activity. These requirements were derived from use case studies as discussed in Appendix A. They have been developed for use by the Multimodal Interaction Working Group (W3C Members only), but may also be relevant to other W3C working groups and related external standard activities.

The requirements cover general issues, inputs, outputs, architecture, integration, synchronization points, runtimes and deployments, but this document does not address application or deployment conformance rules.


Multimodal Interaction Use Cases

The W3C Multimodal Interaction Activity is developing specifications as a basis for a new breed of Web applications in which you can interact using multiple modes of interaction, for instance, using speech, hand writing, and key presses for input, and spoken prompts, audio and visual displays for output. This document describes several use cases for multimodal interaction and presents them in terms of varying device capabilities and the events needed by each use case to couple different components of a multimodal application.


Below are draft documents: Candidate Recommendations, other Working Drafts . Some of these may become Web Standards through the W3C Recommendation Track process. Others may be published as Group Notes or become obsolete specifications.

Candidate Recommendations



This specification defines accelerometer sensor interface for obtaining information about acceleration applied to the X, Y and Z axis of a device that hosts the sensor.



This specification defines a concrete sensor interface to measure magnetic field in the X, Y and Z axis.



This specification defines a concrete sensor interface to monitor the rate of rotation around the device’s local three primary axes.

Other Working Drafts


EMMA: Extensible MultiModal Annotation markup language Version 1.1

EMMA is an XML markup language for containing and annotating the interpretation of user input like a transcription into words of a raw signal, for instance derived from speech, pen or keystroke input. EMMA 1.0 was published as a W3C Recommendation in February 2009. Since then there have been numerous implementations of the standard and extensive feedback has come in regarding desired new features and clarifications needed of existing features. The Multimodal Interaction Working Group examined a range of different use cases for extensions and published a W3C Note on Use Cases for Possible Future EMMA Features. This Version 1.1 document describes a set of new features based on feedback from implementers.

Obsolete Specifications

These specifications have either been superseded by others, or have been abandoned. They remain available for archival purposes, but are not intended to be used.



Discovery & Registration of Multimodal Modality Components

This document is addressed to people who want either to develop Modality Components for Multimodal Applications distributed over a local network or "in the cloud". With this goal, in a multimodal system implemented according to the Multimodal Architecture Specification, the system must discover and register its Modality Components in order to preserve the overall state of the distributed elements. In this way, Modality Components can be composed with automation mechanisms in order to adapt the Application to the state of the surrounding environment.


EMMA: Extensible MultiModal Annotation markup language Version 2.0

This specification describes markup for representing interpretations of user input (speech, keystrokes, pen input etc.) and productions of system output together with annotations for confidence scores, timestamps, medium etc., and forms part of the proposals for the W3C Multimodal Interaction Framework.