EMMA: Extensible MultiModal Annotation 1.0:
Implementation Report

Version: 29 October 2008


Michael Johnston, AT&T (Editor in chief)
Paolo Baggia, Loquendo
Daniel C. Burnett, Voxeo (formerly of Vocalocity and Nuance)
Jerry Carter, Nuance
Deborah Dahl, Invited expert
Kazuyuki Ashimura, W3C

Table of Contents

1. Introduction

The EMMA Specification entered the Candidate Recommendation period on 11 December 2007.

The planned date for entering Proposed Recommendation is XX November 2008. This document summarizes the results from the EMMA implementation reports received and describes the process that the Multimodal Working Group followed in preparing the report.

1.1 Implementation report objectives

  1. Must verify that the specification is implementable.

1.2 Implementation report non-objectives

  1. The IR does not attempt conformance testing of implementations.

2. Work during the Candidate Recommendation period

During the CR period, the Working Group carried out the following activities:

  1. Clarification and improvement of the exposition of the specification (http://www.w3.org/TR/2008/PR-emma-20081215/).
  2. Disposing of comments that were communicated to the WG during the CR period. These comments
    and their resolution are detailed in the Disposition of Comments document for the CR period.
  3. Preparation of this Implementation Report.

3. Participating in the implementation report

Implementors were invited to contribute to the assessment of the W3C EMMA 1.0 Specification by submitting implementation reports describing the coverage of their EMMA implementations with respect to the test assertions outlined in the table below.

Implementation reports, comments on this document, or requests made for further information were posted to the Working Group's public mailing list www-multimodal@w3.org (archive).

4. Entrance criteria for the Proposed Recommendation phase

The Multimodal Working Group established the following entrance criteria for the Proposed Recommendation phase in the Request for CR:

  1. Sufficient reports of implementation experience have been gathered to demonstrate that EMMA processors based on the specification are implementable and have compatible behavior.
  2. Specific Implementation Report Requirements (outlined below) have been met.
  3. The Working Group has formally addressed and responded to all public comments received by the Working Group.

All three of these criteria have been met. A total of 11 implementations were received from 10 different companies and universities. The testimonials below indicate the broad base of support for the specification. All of the required features of EMMA had at least two implementations, many had ten or eleven implementations. All of the optional features received at least two implementations except for the annotation attributes on emma:endpoint which received one implementation. However, these optional features do not have conformance requirements that have an impact on interoperability.

5. Implementation report requirements

5.1 Detailed requirements for implementation report

  1. Testimonials from implementers are included in the IR when provided to document the utility and implementability of the specification.
  2. IR must cover all specified features in the specification. For each feature the IR should indicate:
  3. Feature status is a factor in test coverage in the report:

5.2 Notes on testing

  1. A implementation report must indicate the outcome of evaluating the implementation with respect to each of the test assertions. Possible outcomes are "pass", "fail" or "not-impl". "

5.3 Out of scope

EMMA Implementation Report will not cover:

6. Systems

This section contains testimonials on EMMA from the 10 companies and universities that submitted EMMA implementation reports.


Exec Summary

AT&T recognizes the crucial role of standards in the creation and deployment of next generation services supporting more natural and effective interaction through spoken and multimodal interfaces, and continues to be a firm supporter of W3C's activities in the area of spoken and multimodal standards. As a participating member of the W3C Multimodal Interaction working group, AT&T welcomes the Extensible Multimodal Annotation (EMMA) 1.0 Candidate Recommendation.

EMMA 1.0 provides a detailed language for capturing the range of possible interpretations of multimodal inputs and their associated metadata through a full range of input processing stages, from recognition, through understanding and integration, to dialog management. The creation of a common standard for the representation of multimodal inputs is critical in enabling rapid prototyping of multimodal applications, facilitating interoperation of components from different vendors, and enabling effective logging and archiving of multimodal interactions.

AT&T is very happy to contribute to the further progress of the emerging EMMA standard by submitting an EMMA 1.0 implementation report. EMMA 1.0 results are already available from an AT&T EMMA server which is currently being used in the development of numerous multimodal prototypes and trial services.

Avaya, USA

Exec Summary

Avaya labs Research has been using EMMA in its prototype multimodal dialogue system and is pleased with the contributions EMMA brings to the multimodal interactions.

As a common language for representing multimodal input, EMMA lays the cornerstone on which more advanced architectures and technologies can be developed to enable natural multimodal interactions.

Conversational Technologies, USA

Exec Summary

Conversational Technologies strongly supports the Extensible MultiModal Annotation 1.0 (EMMA) standard. By providing a standardized yet extensible and flexible basis for representing user input, we believe EMMA has tremendous potential for making possible a wide variety of innovative multimodal applications. In particular, EMMA provides strong support for interoperable applications based on user inputs in human languages in many modalities, including speech, text and handwriting as well as visual modalities such as sign languages. EMMA also supports composite multimodal interactions in which several user inputs in two or more modalities are integrated to represent a single user intent.

The Conversational Technologies EMMA implementations are used in tutorials on commercial applications of natural language processing and spoken dialog systems. We report on two implementations. The first is an EMMA producer (NLWorkbench) which is used to illustrate statistical and grammar-based semantic analysis of speech and text inputs. The second implementation is an EMMA consumer, specifically a viewer for EMMA documents. The viewer can be used in the classroom to simplify examination of EMMA results as well as potentially in commercial applications for debugging spoken dialog systems. In addition, the viewer could also become the basis of an editor which would support such applications as human annotation of EMMA documents to be used as input to machine learning applications. For most of the EMMA structural elements the viewer simply provides a tree structure mirroring the XML markup. The most useful aspects of the viewer are probably the graphical representation for EMMA lattices, the ability to see timestamps as standard dates and the computed durations from EMMA timestamps. The two implementations have been made available as open source software (http://www.sourceforge.net/projects/NLWorkbench).

Deutsche Telekom, Germany

Exec Summary

Deutsche Telekom AG is pleased to see the W3C Extensible Multimodal Annotation markup language 1.0 (EMMA) recommendation moving forward and is happy to support the process by providing the following implementation report.

We made use of EMMA within various multimodal prototype applications. Among others, EMMA documents have been generated from within VoiceXML scripts using ECMASrcipt and sent to a server-based Interaction Manager (see MMI architecture for more details http://www.w3.org/TR/mmi-arch). The Interaction Manager - implemented using an early version of an SCXML interpreter (see http://www.w3.org/TR/SCXML) - acted as an EMMA consumer and integrated input (represented using EMMA) from various modalities. EMMA documents have also been used for communication between various dialog management modules.

From our implementation experiences we note that EMMA has proven to be a valuable specification for representation of (multimodal) user input.

DFKI, Germany

Exec Summary

Central topics of the research at DFKI IUI department are multimodal interaction systems and individualised mobile network services. DFKI is pleased to contribute to the activities of the W3C Multimodal Interaction Working Group. DFKI looks at the Extensible MultiModal Annotation (EMMA) 1.0 Candidate Recommendation like as a usefull instrument for data representation in multimodal interaction systems.

EMMA 1.0 can be also used for the representation of data exchange between the components of dialog systems in a dialog server

DFKI submits an EMMA 1.0 Implementation Report that accounts for research results in two different systems, SmartWeb (Natural multimodal access to the Semantic Web) and OMDIP (Integration of Dialog Management Components). In both projects results from multimodal recognition have been encoded directly as EMMA structures. EMMA was basically conceived for the representation of user input. In order to properly represent also the output side and at the same time using a common interface language for communication among all system components, we decided to introduce an extention to the EMMA standard, SW-EMMA (SmartWeb EMMA), with specific output representation constructs. For the communication between multimodal dialog manager and the application core, which provides semantic access to content from the World-wide Web, the XML-based SW-EMMA constructs have been reproduced within the given ontology language RDFS and incorporated into a dedicated discourse ontology.

KIT, Japan

Exec Summary

Kyoto Institute of Technology (KIT) strongly supports the Extensible MultiModal Annotation 1.0 (EMMA) specification. We have been using EMMA within our multimodal human-robot interaction system. EMMA documents are dynamically generated by (1) the Automatic Speech Recognition (ASR) component and (2) the Face Detection/Behavior Recognition component in our implementation.

In addition, the Information Technology Standards Commission of Japan (ITSCJ), which includes KIT as a member, also has a plan to use EMMA as a data format for their own multimodal interaction architecture specification. ITSCJ believes EMMA is very useful for both uni-modal recognition component, e.g., ASR, and multimodal integration component, e.g., speech with pointing gesture.

Loquendo, Italy

Exec Summary

Loquendo is a strong believer in the considerable advantages that speech and multimodal standards can bring to speech markets, and continues to actively support their development and deployment. As a participating member of the W3C Multimodal Interaction working group, Loquendo welcomes the Extensible MultiModal Annotation (EMMA) 1.0 Candidate Recommendation.

EMMA 1.0 allows to create rich annotations for inputs of different modalities within a Multimodal Application. For instance, EMMA 1.0 is used as an annotation format for speech and DTMF input within Media Resource Control Protocol version 2 (MRCPv2). However, EMMA can also be used by gesture or pen modalities, and it offers interesting features to represent complex semantic information within an Interaction Manager.

Loquendo is very pleased to be able to contribute by submitting an EMMA 1.0 Implementation Report which covers the relevant features for an EMMA producer of voice and DTMF results. EMMA 1.0 results are already available for the Loquendo MRCP Server (a.k.a. Loquendo Speech Suite) to promote its quick adoption for the benefit of the speech market, especially for the integration of advanced speech technologies by means of MRCPv2 protocol in present and future platforms, both in speech / DTMF contexts and, more in general, in Multimodal application contexts.

Loquendo is continuing to give its complete and wholehearted support to the work of the W3C Multimodal Interaction and Voice Browser working groups, as well as to the IETF and the Voice XML Forum, as part of its continuing commitment and participation in the evolution of this and other standards.

Microsoft, USA

Exec Summary

Microsoft is committed to incorporating support for natural input into its products. Input methods such as speech and handwriting expand the types of content that can be created and stored in digital form. The usefulness of such information relies on the ability to create accurate interpretations of the data and the ability to persist these interpretations in an openly accessible format. As a member of the W3C Multimodal Interaction working group, Microsoft supports the Extensible Multimodal Annotation (EMMA) 1.0 Candidate recommendation as an effective approach to representing interpretations of natural input.

The proposed EMMA 1.0 standard provides a rich language for expressing interpretations produced by handwriting recognizers, speech recognizers and natural language understanding algorithms. A key concern for Microsoft as the company moves to support open file and data exchange formats is that the standards which are ultimately adopted allow for the accurate representation of existing data, typically represented in proprietary binary formats. The proposed EMMA standard, along with the related InkML standard, provides the desired compatibility with existing Microsoft formats. As such, Microsoft believes that the adoption of these specifications will make natural input data available to a broader range of clients, ultimately making such data more useful and valuable.

Nuance, USA

Exec Summary

Nuance is pleased to see the W3C Extensible MultiModal Annotation markup language (EMMA) moving toward final Recommendation. Nuance has implemented EMMA in both prototype and commercial systems such as our high performance network-based speech recognition engine. We have found EMMA to be a richly expressive specification. Our experience suggests it is broadly applicable for representing user input from a variety of sources and describing the subsequent processing steps. We believe that EMMA has a promising future at Nuance, within our industry, and as a linchpin in the next generation of multimodal systems.

University of Trento, Italy

Exec Summary

The Voice Multi Modal Application Framework developed at University of Trento, Italy (UNITN), supports the Extensible MultiModal Annotation 1.0 (EMMA) standard. UNITN recognizes the crucial role of EMMA standard in the creation and deployment of multimodal applications. We believe that EMMA covers a wide variety of innovative multimodal applications. EMMA standard provides strong support for interoperable applications based on user input modalities such as speech, text, mouse clicks and pen gestures.

We believe that EMMA facilitates interoperability of different components (GUI and VUI), and enables effective logging and archiving of multimodal interactions.

7. Test results

The aim of this section is to describe the range of test assertions developed for the EMMA 1.0 Specification and summarize the results from the implementation reports received in the CR period. The table lists all the assertions that were derived from the EMMA 1.0 Specification.

The Assert ID column uniquely identifies the assertion. The Feature column indicates the specific elements or attributes which the test assertion applies to. The Spec column identifies the section of the EMMA 1.0 Specification from which the assertion was derived. The REQ column is a Y/N value indicating whether the test assertion is for a feature which is required. The SUB column is a Y/N value indicating whether the test assertion is a subconstraint which is dependent on the implementation of the preceding non subconstraint feature test assertion. The Semantics column specifies the semantics of the feature or the constraint which must be met. The Result column is annotated with the number of 'pass', 'fail', and 'not implemented' (P/F/NI) in the set of implementation reports.

7.1 Classification of test assertions

Test assertions are classified into two types, basic test assertions which test for the presence of each feature, and sub constraints which only apply if that particular feature is implemented. Generally, sub constraints encode structural constraints that could not be expressed in the EMMA schema. Sub constraints are marked with 'SUB CONSTRAINT:' in the Semantics field and 'Y' in the Sub field.

7.2 EMMA Test assertion results

Assert ID Feature Spec Req Sub Semantics Results
Structural Elements
100 emma:emma [3.1] YN EMMA documents MUST have emma:emma as the root element. 11 0 0
200 emma:interpretation [3.2] YN Application namespace markup representing the interpretation of inputs MUST be contained within the emma:interpretation element. 11 0 0
201 [3.2] YY SUB CONSTRAINT: If the emma:interpretation element is empty, then it MUST be annotated as emma:uninterpreted="true' or emma:no-input="true". 7 1 3
300 emma:one-of [3.3.1] YN EMMA N-best interpretations MUST be contained within an emma:one-of element. 10 0 1
301 [3.3.1] YY SUB CONSTRAINT: EMMA N-best interpretations contained within an emma:one-of element MUST be ordered best-first in document order where the measure is emma:confidence if present, otherwise the quality metric is platform specific. 9 1 1
310 [3.3.1] YY SUB CONSTRAINT: If an emma:one-of element contains other emma:one-of elements, (embedded one-of), emma:one-of elements MUST contain a disjunction-type attribute indicating the reason for the multiple interpretations. 2 0 9
400 emma:group [3.3.2] NN Interpretations of distinct inputs MAY be represented in a single EMMA document within the emma:group element. 5 0 6
402 emma:group-info [] NN Information describing the criteria used in determining the grouping of interpretations within an emma:group MAY be indicated within an emma:group-info child element within emma:group. 2 0 9
500 emma:sequence [3.3.3] NN A temporally ordered sequence of distinct inputs MAY be contained within an emma:sequence element with document order corresponding to temporal order. 2 0 9
600 emma:lattice [3.4] NN Lattices SHOULD be represented as an emma:lattice element containing a series of emma:arc and emma:node elements. 6 0 5
602 [3.4.1] YY SUB CONSTRAINT: The value of the initial attribute on emma:lattice MUST be the value of the from attribute on at least one of the emma:arc elements that it contains. 6 0 5
603 [3.4.1] YY SUB CONSTRAINT: The value of each space separated number within the final attribute on emma:lattice MUST be the value of the to attribute on at least one of the emma:arc elements that it contains. 6 0 5
604 [3.4.1] YY SUB CONSTRAINT: The value of a to attribute on emma:arc MUST be the value of a from attribute on at least one of its emma:arc siblings or be one of the final values on the parent emma:lattice element. 5 0 6
605 [3.4.1] YY SUB CONSTRAINT: The value of a from attribute on emma:arc MUST be the value of a to attribute on at least one of its emma:arc siblings or be the initial value on the parent emma:lattice element. 5 0 6
606 [3.4.1] YY SUB CONSTRAINT: Any epsilon transitions in the lattice MUST be represented as emma:arc elements with no content other than emma:info. 2 0 9
607 [3.4.1] YY SUB CONSTRAINT: The content of the emma:arc element MUST be either an application namespace XML element, well-formed character content, an application namespace element and an emma:info element, or well-formed character content and an emma:info element. 5 0 6
608 [3.4.1] YY SUB CONSTRAINT: There MUST be no more than one emma:node specification for each numbered node in the lattice. 3 0 8
609 [3.4.1] YY SUB CONSTRAINT: For each emma:node specification, the value of the node-number attribute MUST appear as the value of at some from or to attribute in one of the sibling emma:arc elements. 3 0 8
700 emma:literal [3.5] YN String literal semantic interpretations MUST be wrapped with the tag emma:literal. 5 0 6
Annotation Elements
800 emma:model [4.1.1] NN The data model of the application semantic markup MAY appear within the emma:model element. 5 0 6
810 [4.2.16] YY SUB CONSTRAINT: If emma:model appears as a child of emma:emma, there MUST be emma:one-of or emma:interpretation elements within emma:emma with emma:model-ref attributes that refer to the id of the emma:model child of emma:emma. 3 0 8
811 [4.2.16] YY SUB CONSTRAINT: If emma:model-ref appears on an emma:interpretation or emma:one-of, there MUST be an emma:model child of the dominating emma:emma whose id value equals the value of the emma:model-ref attribute. 3 0 8
901 emma:derived-from [4.1.2] NN The interpretation of all but the first stage of processing MAY contain an emma:derived-from element with an attribute resource which refers to the interpretation from which that interpretation was derived. 3 0 8
904 [4.1.2] YY SUB CONSTRAINT:If there is more than one emma:derived-from within an emma:interpretation or emma:one-of, then all of the emma:derived-from elements MUST have the attribute composite="true". 2 0 9
910 emma:derivation [4.1.2] YN The interpretation resulting of the most recent processing step MUST appear as a child of emma:emma. 4 0 7
911 emma:derivation [4.1.2] NN If interpretations resulting from earlier processing steps are included they SHOULD appear as children of a single emma:derivation element within emma:emma. 4 0 7
1000 emma:grammar [4.1.3] NN The grammar used MAY be referenced using an emma:grammar element within emma:emma. 5 0 6
1001 [4.2.15] YY SUB CONSTRAINT: If an emma:interpretation or emma:one-of is annotated with the emma:grammar-ref attribute, then the value of the emma:grammar-ref MUST reference the id of an emma:grammar element child of emma:emma in the same document. 5 0 6
1002 [4.2.15] YY SUB CONSTRAINT: If an emma:grammar element appears as a child of emma:emma, then there MUST be an emma:interpretation or emma:one-of element dominated by that emma:emma, whose emma:grammar-ref attribute references the id of the emma:grammar element. 5 0 6
1100 emma:info [4.1.4] NN Application and vendor specific annotations SHOULD appear within emma:info. 7 0 4
Annotation attributes
1201 emma:tokens [4.2.1] NN emma:tokens MAY be used to indicate the sequence of tokens recognized in the input by the recognizer. 8 0 3
1300 emma:process [4.2.2] NN emma:process MAY be used to identify the process used to assign the interpretation. 2 0 9
1400 emma:no-input [4.2.3] YN emma:no-input="true" MUST be used to designate lack of expected input. 7 1 3
1500 emma:uninterpreted [4.2.4] YN Input for which the processor is unable to assign an interpretation MUST be marked as emma:uninterpreted="true". 6 0 5
1600 emma:lang [4.2.5] NN emma:lang MAY be used to designate the human language of input 8 0 3
1700 emma:signal [4.2.6] NN emma:signal (with a URI value) MAY be used to designate the location of the signal processed by an EMMA processor. 3 0 8
1701 emma:signal-size [4.2.6] NN emma:signal-size MAY be used to indicate the file size in 8-bit octets of the input signal. 2 0 9
1800 emma:media-type [4.2.7] NN emma:media-type MAY be used to designate the MIME type of the processed signal. 2 0 9
1900 emma:confidence [4.2.8] NN emma:confidence MAY be used to designate the confidence score of an input. 10 0 1
2000 emma:source [4.2.9] NN emma:source (with a URI value) MAY be used to designate the source device. 2 0 9
2100 emma:start [] NN emma:start with a millisecond value MAY be used to indicate the absolute start time of an input. 4 0 7
2101 emma:end [] NN emma:end with a millisecond value MAY be used to indicate the absolute end time of an input. 4 0 7
2201 emma:time-ref-uri [] NN The emma:time-ref-uri attribute MAY be used to indicate the URI used as a reference point for relative timestamps. 2 0 9
2202 emma:time-ref-anchor [] YY SUB CONSTRAINT: If the emma:time-ref-uri points to an interval, emma:time-ref-anchor MUST appear on the interpretation with a value of start or end, indicating whether to use the start or end of that interval as the reference point. 2 0 9
2203 emma:offset-to-start [] NN The value of emma:offset-to-start MAY be used to indicate the number of milliseconds from the reference point to the start of the current input in a relative timestamp. 2 0 9
2204 emma:duration [] NN The emma:duration attribute value MAY be used to indicate the duration in milliseconds of the current input. 9 0 2
2300 emma:medium [4.2.11] YN emma:medium MUST be included on all EMMA inputs, indicating the medium of input. 10 0 1
2301 emma:medium [4.1.2] YY SUBCONSTRAINT: When EMMA inputs from different modes are combined to make a composite input, if the combining inputs have different values for emma:medium, then the value on the result MUST be a space separated list of the emma:medium values from the combining modes. 2 0 9
2310 emma:mode [4.2.11] YN emma:mode MUST be included on all EMMA inputs, indicating the mode of input. 10 0 1
2311 emma:mode [4.1.2] YY SUBCONSTRAINT: When EMMA inputs from different modes are combined to make a composite input, the value of emma:mode on the result MUST be a space separated list of the values on the combining inputs. 2 0 9
2320 emma:function [4.2.11] NN emma:function MAY be included on EMMA inputs, providing a classification of the function of the input. 4 0 7
2330 emma:verbal [4.2.11] NN emma:verbal MAY be included on EMMA inputs, indicating whether the input is verbal or non-verbal. 3 0 8
2401 emma:hook [4.2.12] NN emma:hook MAY be used on application namespace markup to indicate where content from another mode needs to be integrated. 2 0 9
2500 emma:cost [4.2.13] NN emma:cost MAY be used to indicate the cost or weight of an interpretation or lattice arc. 3 0 8
2510 emma:dialog-turn [4.2.17] NN emma:dialog-turn MAY be used to indicate an identifier for the current dialog turn. 5 0 6
Scope and inheritance of annotations
2600 emma:one-of [3.3.1] YN EMMA annotations appearing on emma:one-of MUST be true of all of the container elements (emma:one-of,emma:interpretation,emma:sequence,emma:group) within the emma:one-of. 5 0 6
2601 emma:derived-from [4.3] YN Each EMMA annotation from the previous stage of a sequential derivation (indicated using emma:derived-from) MUST be true of the interpretations in the current stage of the derivation, unless the annotation is explicitly re-stated. 2 0 9
Endpoint elements and attributes
2700 emma:endpoint-info [4.1.5] NN The emma:endpoint-info element MAY be used to contain emma:endpoint elements describing the characteristics of the endpoints. 2 0 9
2701 emma:endpoint [4.1.5] NN The emma:endpoint element MAY be used to provide annotations regarding a communication endpoint. 2 0 9
2710 emma:endpoint-role [4.2.14] NN The emma:endpoint-role attribute MAY be used to indicate the role of an endpoint. 1 0 10
2711 emma:endpoint-address [4.2.14] NN The emma:endpoint-address attribute MAY be used to indicate the address of an endpoint. 1 0 10
2713 emma:port-type [4.2.14] NN The emma:port-type attribute MAY be used to indicate the type of port of an endpoint. 1 0 10
2714 emma:port-num [4.2.14] NN The emma:port-num attribute MAY be used to indicate the port number of an endpoint. 1 0 10
2715 emma:message-id [4.2.14] NN The emma:message-id attribute MAY be used to indicate the message id of an endpoint. 1 0 10
2716 emma:service-name [4.2.14] NN The emma:service-name attribute MAY be used to indicate name of service that the system provides. 1 0 10
2717 emma:endpoint-pair-ref [4.2.14] NN The emma:endpoint-pair-ref attribute MAY be used to indicate the pairing between sink and source endpoints. 1 0 10
2718 emma:endpoint-info-ref [4.2.14] NN The emma:endpoint-info-ref attribute MAY be used to reference the endpoint-info element associated with an interpretation. 1 0 10


Appendix A - Acknowledgements

The Multimodal Working Group would like to acknowledge the contributions of several individuals: