Version: 29 October 2008
Contributors:
Michael Johnston, AT&T (Editor in chief)
Paolo Baggia, Loquendo
Daniel C. Burnett, Voxeo (formerly of Vocalocity and Nuance)
Jerry Carter, Nuance
Deborah Dahl, Invited expert
Kazuyuki Ashimura, W3C
Copyright © 2008 W3C® (MIT, ERCIM, Keio), All Rights Reserved. W3C liability, trademark and document use rules apply.
The EMMA Specification entered the Candidate Recommendation period on 11 December 2007.
The planned date for entering Proposed Recommendation is XX November 2008. This document summarizes the results from the EMMA implementation reports received and describes the process that the Multimodal Working Group followed in preparing the report.
During the CR period, the Working Group carried out the following activities:
Implementors were invited to contribute to the assessment of the W3C EMMA 1.0 Specification by submitting implementation reports describing the coverage of their EMMA implementations with respect to the test assertions outlined in the table below.
Implementation reports, comments on this document, or requests made for further information were posted to the Working Group's public mailing list www-multimodal@w3.org (archive).
The Multimodal Working Group established the following entrance criteria for the Proposed Recommendation phase in the Request for CR:
All three of these criteria have been met. A total of 11 implementations were received from 10 different companies and universities. The testimonials below indicate the broad base of support for the specification. All of the required features of EMMA had at least two implementations, many had ten or eleven implementations. All of the optional features received at least two implementations except for the annotation attributes on emma:endpoint
which received one implementation. However, these optional features do not have conformance requirements that have an impact on interoperability.
pass
", "fail
" or
"not-impl
". "EMMA Implementation Report will not cover:
This section contains testimonials on EMMA from the 10 companies and universities that submitted EMMA implementation reports.
AT&T recognizes the crucial role of standards in the creation and deployment of next generation services supporting more natural and effective interaction through spoken and multimodal interfaces, and continues to be a firm supporter of W3C's activities in the area of spoken and multimodal standards. As a participating member of the W3C Multimodal Interaction working group, AT&T welcomes the Extensible Multimodal Annotation (EMMA) 1.0 Candidate Recommendation.
EMMA 1.0 provides a detailed language for capturing the range of possible interpretations of multimodal inputs and their associated metadata through a full range of input processing stages, from recognition, through understanding and integration, to dialog management. The creation of a common standard for the representation of multimodal inputs is critical in enabling rapid prototyping of multimodal applications, facilitating interoperation of components from different vendors, and enabling effective logging and archiving of multimodal interactions.
AT&T is very happy to contribute to the further progress of the emerging EMMA standard by submitting an EMMA 1.0 implementation report. EMMA 1.0 results are already available from an AT&T EMMA server which is currently being used in the development of numerous multimodal prototypes and trial services.
Avaya labs Research has been using EMMA in its prototype multimodal dialogue system and is pleased with the contributions EMMA brings to the multimodal interactions.
As a common language for representing multimodal input, EMMA lays the cornerstone on which more advanced architectures and technologies can be developed to enable natural multimodal interactions.
Conversational Technologies strongly supports the Extensible MultiModal Annotation 1.0 (EMMA) standard. By providing a standardized yet extensible and flexible basis for representing user input, we believe EMMA has tremendous potential for making possible a wide variety of innovative multimodal applications. In particular, EMMA provides strong support for interoperable applications based on user inputs in human languages in many modalities, including speech, text and handwriting as well as visual modalities such as sign languages. EMMA also supports composite multimodal interactions in which several user inputs in two or more modalities are integrated to represent a single user intent.
The Conversational Technologies EMMA implementations are used in tutorials on commercial applications of natural language processing and spoken dialog systems. We report on two implementations. The first is an EMMA producer (NLWorkbench) which is used to illustrate statistical and grammar-based semantic analysis of speech and text inputs. The second implementation is an EMMA consumer, specifically a viewer for EMMA documents. The viewer can be used in the classroom to simplify examination of EMMA results as well as potentially in commercial applications for debugging spoken dialog systems. In addition, the viewer could also become the basis of an editor which would support such applications as human annotation of EMMA documents to be used as input to machine learning applications. For most of the EMMA structural elements the viewer simply provides a tree structure mirroring the XML markup. The most useful aspects of the viewer are probably the graphical representation for EMMA lattices, the ability to see timestamps as standard dates and the computed durations from EMMA timestamps. The two implementations have been made available as open source software (http://www.sourceforge.net/projects/NLWorkbench).
Deutsche Telekom AG is pleased to see the W3C Extensible Multimodal Annotation markup language 1.0 (EMMA) recommendation moving forward and is happy to support the process by providing the following implementation report.
We made use of EMMA within various multimodal prototype applications. Among others, EMMA documents have been generated from within VoiceXML scripts using ECMASrcipt and sent to a server-based Interaction Manager (see MMI architecture for more details http://www.w3.org/TR/mmi-arch). The Interaction Manager - implemented using an early version of an SCXML interpreter (see http://www.w3.org/TR/SCXML) - acted as an EMMA consumer and integrated input (represented using EMMA) from various modalities. EMMA documents have also been used for communication between various dialog management modules.
From our implementation experiences we note that EMMA has proven to be a valuable specification for representation of (multimodal) user input.
Central topics of the research at DFKI IUI department are multimodal interaction systems and individualised mobile network services. DFKI is pleased to contribute to the activities of the W3C Multimodal Interaction Working Group. DFKI looks at the Extensible MultiModal Annotation (EMMA) 1.0 Candidate Recommendation like as a usefull instrument for data representation in multimodal interaction systems.
EMMA 1.0 can be also used for the representation of data exchange between the components of dialog systems in a dialog server
DFKI submits an EMMA 1.0 Implementation Report that accounts for research results in two different systems, SmartWeb (Natural multimodal access to the Semantic Web) and OMDIP (Integration of Dialog Management Components). In both projects results from multimodal recognition have been encoded directly as EMMA structures. EMMA was basically conceived for the representation of user input. In order to properly represent also the output side and at the same time using a common interface language for communication among all system components, we decided to introduce an extention to the EMMA standard, SW-EMMA (SmartWeb EMMA), with specific output representation constructs. For the communication between multimodal dialog manager and the application core, which provides semantic access to content from the World-wide Web, the XML-based SW-EMMA constructs have been reproduced within the given ontology language RDFS and incorporated into a dedicated discourse ontology.
Kyoto Institute of Technology (KIT) strongly supports the Extensible MultiModal Annotation 1.0 (EMMA) specification. We have been using EMMA within our multimodal human-robot interaction system. EMMA documents are dynamically generated by (1) the Automatic Speech Recognition (ASR) component and (2) the Face Detection/Behavior Recognition component in our implementation.
In addition, the Information Technology Standards Commission of Japan (ITSCJ), which includes KIT as a member, also has a plan to use EMMA as a data format for their own multimodal interaction architecture specification. ITSCJ believes EMMA is very useful for both uni-modal recognition component, e.g., ASR, and multimodal integration component, e.g., speech with pointing gesture.
Loquendo is a strong believer in the considerable advantages that speech and multimodal standards can bring to speech markets, and continues to actively support their development and deployment. As a participating member of the W3C Multimodal Interaction working group, Loquendo welcomes the Extensible MultiModal Annotation (EMMA) 1.0 Candidate Recommendation.
EMMA 1.0 allows to create rich annotations for inputs of different modalities within a Multimodal Application. For instance, EMMA 1.0 is used as an annotation format for speech and DTMF input within Media Resource Control Protocol version 2 (MRCPv2). However, EMMA can also be used by gesture or pen modalities, and it offers interesting features to represent complex semantic information within an Interaction Manager.
Loquendo is very pleased to be able to contribute by submitting an EMMA 1.0 Implementation Report which covers the relevant features for an EMMA producer of voice and DTMF results. EMMA 1.0 results are already available for the Loquendo MRCP Server (a.k.a. Loquendo Speech Suite) to promote its quick adoption for the benefit of the speech market, especially for the integration of advanced speech technologies by means of MRCPv2 protocol in present and future platforms, both in speech / DTMF contexts and, more in general, in Multimodal application contexts.
Loquendo is continuing to give its complete and wholehearted support to the work of the W3C Multimodal Interaction and Voice Browser working groups, as well as to the IETF and the Voice XML Forum, as part of its continuing commitment and participation in the evolution of this and other standards.
Microsoft is committed to incorporating support for natural input into its products. Input methods such as speech and handwriting expand the types of content that can be created and stored in digital form. The usefulness of such information relies on the ability to create accurate interpretations of the data and the ability to persist these interpretations in an openly accessible format. As a member of the W3C Multimodal Interaction working group, Microsoft supports the Extensible Multimodal Annotation (EMMA) 1.0 Candidate recommendation as an effective approach to representing interpretations of natural input.
The proposed EMMA 1.0 standard provides a rich language for expressing interpretations produced by handwriting recognizers, speech recognizers and natural language understanding algorithms. A key concern for Microsoft as the company moves to support open file and data exchange formats is that the standards which are ultimately adopted allow for the accurate representation of existing data, typically represented in proprietary binary formats. The proposed EMMA standard, along with the related InkML standard, provides the desired compatibility with existing Microsoft formats. As such, Microsoft believes that the adoption of these specifications will make natural input data available to a broader range of clients, ultimately making such data more useful and valuable.
Nuance is pleased to see the W3C Extensible MultiModal Annotation markup language (EMMA) moving toward final Recommendation. Nuance has implemented EMMA in both prototype and commercial systems such as our high performance network-based speech recognition engine. We have found EMMA to be a richly expressive specification. Our experience suggests it is broadly applicable for representing user input from a variety of sources and describing the subsequent processing steps. We believe that EMMA has a promising future at Nuance, within our industry, and as a linchpin in the next generation of multimodal systems.
The Voice Multi Modal Application Framework developed at University of Trento, Italy (UNITN), supports the Extensible MultiModal Annotation 1.0 (EMMA) standard. UNITN recognizes the crucial role of EMMA standard in the creation and deployment of multimodal applications. We believe that EMMA covers a wide variety of innovative multimodal applications. EMMA standard provides strong support for interoperable applications based on user input modalities such as speech, text, mouse clicks and pen gestures.
We believe that EMMA facilitates interoperability of different components (GUI and VUI), and enables effective logging and archiving of multimodal interactions.
The aim of this section is to describe the range of test assertions developed for the EMMA 1.0 Specification and summarize the results from the implementation reports received in the CR period. The table lists all the assertions that were derived from the EMMA 1.0 Specification.
The Assert ID column uniquely identifies the assertion. The Feature column indicates the specific elements or attributes which the test assertion applies to. The Spec column identifies the section of the EMMA 1.0 Specification from which the assertion was derived. The REQ column is a Y/N value indicating whether the test assertion is for a feature which is required. The SUB column is a Y/N value indicating whether the test assertion is a subconstraint which is dependent on the implementation of the preceding non subconstraint feature test assertion. The Semantics column specifies the semantics of the feature or the constraint which must be met. The Result column is annotated with the number of 'pass', 'fail', and 'not implemented' (P/F/NI) in the set of implementation reports.
Test assertions are classified into two types, basic test assertions which test for the presence of each feature, and sub constraints which only apply if that particular feature is implemented. Generally, sub constraints encode structural constraints that could not be expressed in the EMMA schema. Sub constraints are marked with 'SUB CONSTRAINT:' in the Semantics field and 'Y' in the Sub field.
Assert ID | Feature | Spec | Req | Sub | Semantics | Results | ||
---|---|---|---|---|---|---|---|---|
P | F | NI | ||||||
Structural Elements | ||||||||
100 | emma:emma | [3.1] | Y | N | EMMA documents MUST have emma:emma as the root element. | 11 | 0 | 0 |
200 | emma:interpretation | [3.2] | Y | N | Application namespace markup representing the interpretation of inputs MUST be contained within the emma:interpretation element. | 11 | 0 | 0 |
201 | [3.2] | Y | Y | SUB CONSTRAINT: If the emma:interpretation element is empty, then it MUST be annotated as emma:uninterpreted="true' or emma:no-input="true". | 7 | 1 | 3 | |
300 | emma:one-of | [3.3.1] | Y | N | EMMA N-best interpretations MUST be contained within an emma:one-of element. | 10 | 0 | 1 |
301 | [3.3.1] | Y | Y | SUB CONSTRAINT: EMMA N-best interpretations contained within an emma:one-of element MUST be ordered best-first in document order where the measure is emma:confidence if present, otherwise the quality metric is platform specific. | 9 | 1 | 1 | |
310 | [3.3.1] | Y | Y | SUB CONSTRAINT: If an emma:one-of element contains other emma:one-of elements, (embedded one-of), emma:one-of elements MUST contain a disjunction-type attribute indicating the reason for the multiple interpretations. | 2 | 0 | 9 | |
400 | emma:group | [3.3.2] | N | N | Interpretations of distinct inputs MAY be represented in a single EMMA document within the emma:group element. | 5 | 0 | 6 |
402 | emma:group-info | [3.3.2.1] | N | N | Information describing the criteria used in determining the grouping of interpretations within an emma:group MAY be indicated within an emma:group-info child element within emma:group. | 2 | 0 | 9 |
500 | emma:sequence | [3.3.3] | N | N | A temporally ordered sequence of distinct inputs MAY be contained within an emma:sequence element with document order corresponding to temporal order. | 2 | 0 | 9 |
600 | emma:lattice | [3.4] | N | N | Lattices SHOULD be represented as an emma:lattice element containing a series of emma:arc and emma:node elements. | 6 | 0 | 5 |
602 | [3.4.1] | Y | Y | SUB CONSTRAINT: The value of the initial attribute on emma:lattice MUST be the value of the from attribute on at least one of the emma:arc elements that it contains. | 6 | 0 | 5 | |
603 | [3.4.1] | Y | Y | SUB CONSTRAINT: The value of each space separated number within the final attribute on emma:lattice MUST be the value of the to attribute on at least one of the emma:arc elements that it contains. | 6 | 0 | 5 | |
604 | [3.4.1] | Y | Y | SUB CONSTRAINT: The value of a to attribute on emma:arc MUST be the value of a from attribute on at least one of its emma:arc siblings or be one of the final values on the parent emma:lattice element. | 5 | 0 | 6 | |
605 | [3.4.1] | Y | Y | SUB CONSTRAINT: The value of a from attribute on emma:arc MUST be the value of a to attribute on at least one of its emma:arc siblings or be the initial value on the parent emma:lattice element. | 5 | 0 | 6 | |
606 | [3.4.1] | Y | Y | SUB CONSTRAINT: Any epsilon transitions in the lattice MUST be represented as emma:arc elements with no content other than emma:info. | 2 | 0 | 9 | |
607 | [3.4.1] | Y | Y | SUB CONSTRAINT: The content of the emma:arc element MUST be either an application namespace XML element, well-formed character content, an application namespace element and an emma:info element, or well-formed character content and an emma:info element. | 5 | 0 | 6 | |
608 | [3.4.1] | Y | Y | SUB CONSTRAINT: There MUST be no more than one emma:node specification for each numbered node in the lattice. | 3 | 0 | 8 | |
609 | [3.4.1] | Y | Y | SUB CONSTRAINT: For each emma:node specification, the value of the node-number attribute MUST appear as the value of at some from or to attribute in one of the sibling emma:arc elements. | 3 | 0 | 8 | |
700 | emma:literal | [3.5] | Y | N | String literal semantic interpretations MUST be wrapped with the tag emma:literal. | 5 | 0 | 6 |
Annotation Elements | ||||||||
800 | emma:model | [4.1.1] | N | N | The data model of the application semantic markup MAY appear within the emma:model element. | 5 | 0 | 6 |
810 | [4.2.16] | Y | Y | SUB CONSTRAINT: If emma:model appears as a child of emma:emma, there MUST be emma:one-of or emma:interpretation elements within emma:emma with emma:model-ref attributes that refer to the id of the emma:model child of emma:emma. | 3 | 0 | 8 | |
811 | [4.2.16] | Y | Y | SUB CONSTRAINT: If emma:model-ref appears on an emma:interpretation or emma:one-of, there MUST be an emma:model child of the dominating emma:emma whose id value equals the value of the emma:model-ref attribute. | 3 | 0 | 8 | |
901 | emma:derived-from | [4.1.2] | N | N | The interpretation of all but the first stage of processing MAY contain an emma:derived-from element with an attribute resource which refers to the interpretation from which that interpretation was derived. | 3 | 0 | 8 |
904 | [4.1.2] | Y | Y | SUB CONSTRAINT:If there is more than one emma:derived-from within an emma:interpretation or emma:one-of, then all of the emma:derived-from elements MUST have the attribute composite="true". | 2 | 0 | 9 | |
910 | emma:derivation | [4.1.2] | Y | N | The interpretation resulting of the most recent processing step MUST appear as a child of emma:emma. | 4 | 0 | 7 |
911 | emma:derivation | [4.1.2] | N | N | If interpretations resulting from earlier processing steps are included they SHOULD appear as children of a single emma:derivation element within emma:emma. | 4 | 0 | 7 |
1000 | emma:grammar | [4.1.3] | N | N | The grammar used MAY be referenced using an emma:grammar element within emma:emma. | 5 | 0 | 6 |
1001 | [4.2.15] | Y | Y | SUB CONSTRAINT: If an emma:interpretation or emma:one-of is annotated with the emma:grammar-ref attribute, then the value of the emma:grammar-ref MUST reference the id of an emma:grammar element child of emma:emma in the same document. | 5 | 0 | 6 | |
1002 | [4.2.15] | Y | Y | SUB CONSTRAINT: If an emma:grammar element appears as a child of emma:emma, then there MUST be an emma:interpretation or emma:one-of element dominated by that emma:emma, whose emma:grammar-ref attribute references the id of the emma:grammar element. | 5 | 0 | 6 | |
1100 | emma:info | [4.1.4] | N | N | Application and vendor specific annotations SHOULD appear within emma:info. | 7 | 0 | 4 |
Annotation attributes | ||||||||
1201 | emma:tokens | [4.2.1] | N | N | emma:tokens MAY be used to indicate the sequence of tokens recognized in the input by the recognizer. | 8 | 0 | 3 |
1300 | emma:process | [4.2.2] | N | N | emma:process MAY be used to identify the process used to assign the interpretation. | 2 | 0 | 9 |
1400 | emma:no-input | [4.2.3] | Y | N | emma:no-input="true" MUST be used to designate lack of expected input. | 7 | 1 | 3 |
1500 | emma:uninterpreted | [4.2.4] | Y | N | Input for which the processor is unable to assign an interpretation MUST be marked as emma:uninterpreted="true". | 6 | 0 | 5 |
1600 | emma:lang | [4.2.5] | N | N | emma:lang MAY be used to designate the human language of input | 8 | 0 | 3 |
1700 | emma:signal | [4.2.6] | N | N | emma:signal (with a URI value) MAY be used to designate the location of the signal processed by an EMMA processor. | 3 | 0 | 8 |
1701 | emma:signal-size | [4.2.6] | N | N | emma:signal-size MAY be used to indicate the file size in 8-bit octets of the input signal. | 2 | 0 | 9 |
1800 | emma:media-type | [4.2.7] | N | N | emma:media-type MAY be used to designate the MIME type of the processed signal. | 2 | 0 | 9 |
1900 | emma:confidence | [4.2.8] | N | N | emma:confidence MAY be used to designate the confidence score of an input. | 10 | 0 | 1 |
2000 | emma:source | [4.2.9] | N | N | emma:source (with a URI value) MAY be used to designate the source device. | 2 | 0 | 9 |
2100 | emma:start | [4.2.10.1] | N | N | emma:start with a millisecond value MAY be used to indicate the absolute start time of an input. | 4 | 0 | 7 |
2101 | emma:end | [4.2.10.1] | N | N | emma:end with a millisecond value MAY be used to indicate the absolute end time of an input. | 4 | 0 | 7 |
2201 | emma:time-ref-uri | [4.2.10.2] | N | N | The emma:time-ref-uri attribute MAY be used to indicate the URI used as a reference point for relative timestamps. | 2 | 0 | 9 |
2202 | emma:time-ref-anchor | [4.2.10.2] | Y | Y | SUB CONSTRAINT: If the emma:time-ref-uri points to an interval, emma:time-ref-anchor MUST appear on the interpretation with a value of start or end, indicating whether to use the start or end of that interval as the reference point. | 2 | 0 | 9 |
2203 | emma:offset-to-start | [4.2.10.2] | N | N | The value of emma:offset-to-start MAY be used to indicate the number of milliseconds from the reference point to the start of the current input in a relative timestamp. | 2 | 0 | 9 |
2204 | emma:duration | [4.2.10.3] | N | N | The emma:duration attribute value MAY be used to indicate the duration in milliseconds of the current input. | 9 | 0 | 2 |
2300 | emma:medium | [4.2.11] | Y | N | emma:medium MUST be included on all EMMA inputs, indicating the medium of input. | 10 | 0 | 1 |
2301 | emma:medium | [4.1.2] | Y | Y | SUBCONSTRAINT: When EMMA inputs from different modes are combined to make a composite input, if the combining inputs have different values for emma:medium, then the value on the result MUST be a space separated list of the emma:medium values from the combining modes. | 2 | 0 | 9 |
2310 | emma:mode | [4.2.11] | Y | N | emma:mode MUST be included on all EMMA inputs, indicating the mode of input. | 10 | 0 | 1 |
2311 | emma:mode | [4.1.2] | Y | Y | SUBCONSTRAINT: When EMMA inputs from different modes are combined to make a composite input, the value of emma:mode on the result MUST be a space separated list of the values on the combining inputs. | 2 | 0 | 9 |
2320 | emma:function | [4.2.11] | N | N | emma:function MAY be included on EMMA inputs, providing a classification of the function of the input. | 4 | 0 | 7 |
2330 | emma:verbal | [4.2.11] | N | N | emma:verbal MAY be included on EMMA inputs, indicating whether the input is verbal or non-verbal. | 3 | 0 | 8 |
2401 | emma:hook | [4.2.12] | N | N | emma:hook MAY be used on application namespace markup to indicate where content from another mode needs to be integrated. | 2 | 0 | 9 |
2500 | emma:cost | [4.2.13] | N | N | emma:cost MAY be used to indicate the cost or weight of an interpretation or lattice arc. | 3 | 0 | 8 |
2510 | emma:dialog-turn | [4.2.17] | N | N | emma:dialog-turn MAY be used to indicate an identifier for the current dialog turn. | 5 | 0 | 6 |
Scope and inheritance of annotations | ||||||||
2600 | emma:one-of | [3.3.1] | Y | N | EMMA annotations appearing on emma:one-of MUST be true of all of the container elements (emma:one-of,emma:interpretation,emma:sequence,emma:group) within the emma:one-of. | 5 | 0 | 6 |
2601 | emma:derived-from | [4.3] | Y | N | Each EMMA annotation from the previous stage of a sequential derivation (indicated using emma:derived-from) MUST be true of the interpretations in the current stage of the derivation, unless the annotation is explicitly re-stated. | 2 | 0 | 9 |
Endpoint elements and attributes | ||||||||
2700 | emma:endpoint-info | [4.1.5] | N | N | The emma:endpoint-info element MAY be used to contain emma:endpoint elements describing the characteristics of the endpoints. | 2 | 0 | 9 |
2701 | emma:endpoint | [4.1.5] | N | N | The emma:endpoint element MAY be used to provide annotations regarding a communication endpoint. | 2 | 0 | 9 |
2710 | emma:endpoint-role | [4.2.14] | N | N | The emma:endpoint-role attribute MAY be used to indicate the role of an endpoint. | 1 | 0 | 10 |
2711 | emma:endpoint-address | [4.2.14] | N | N | The emma:endpoint-address attribute MAY be used to indicate the address of an endpoint. | 1 | 0 | 10 |
2713 | emma:port-type | [4.2.14] | N | N | The emma:port-type attribute MAY be used to indicate the type of port of an endpoint. | 1 | 0 | 10 |
2714 | emma:port-num | [4.2.14] | N | N | The emma:port-num attribute MAY be used to indicate the port number of an endpoint. | 1 | 0 | 10 |
2715 | emma:message-id | [4.2.14] | N | N | The emma:message-id attribute MAY be used to indicate the message id of an endpoint. | 1 | 0 | 10 |
2716 | emma:service-name | [4.2.14] | N | N | The emma:service-name attribute MAY be used to indicate name of service that the system provides. | 1 | 0 | 10 |
2717 | emma:endpoint-pair-ref | [4.2.14] | N | N | The emma:endpoint-pair-ref attribute MAY be used to indicate the pairing between sink and source endpoints. | 1 | 0 | 10 |
2718 | emma:endpoint-info-ref | [4.2.14] | N | N | The emma:endpoint-info-ref attribute MAY be used to reference the endpoint-info element associated with an interpretation. | 1 | 0 | 10 |
The Multimodal Working Group would like to acknowledge the contributions of several individuals: