EMMA: Extensible MultiModal Annotation markup language

W3C Candidate Recommendation 11 December 2007

This version:
http://www.w3.org/TR/2007/CR-emma-20071211/
Latest version:
http://www.w3.org/TR/emma/
Previous version:
http://www.w3.org/TR/2007/WD-emma-20070409/
Editor:
Michael Johnston, AT&T
Authors:
Paolo Baggia, Loquendo
Daniel C. Burnett, Nuance
Jerry Carter, Nuance
Deborah A. Dahl, Invited Expert
Gerry McCobb, IBM
Dave Raggett, W3C

Abstract

The W3C Multimodal Interaction working group aims to develop specifications to enable access to the Web using multimodal interaction. This document is part of a set of specifications for multimodal systems, and provides details of an XML markup language for containing and annotating the interpretation of user input. Examples of interpretation of user input are a transcription into words of a raw signal, for instance derived from speech, pen or keystroke input, a set of attribute/value pairs describing their meaning, or a set of attribute/value pairs describing a gesture. The interpretation of the user's input is expected to be generated by signal interpretation processes, such as speech and ink recognition, semantic interpreters, and other types of processors for use by components that act on the user's inputs such as interaction managers.

Status of this Document

This section describes the status of this document at the time of its publication. Other documents may supersede this document. A list of current W3C publications and the latest revision of this technical report can be found in the W3C technical reports index at http://www.w3.org/TR/.

This is the 11 December 2007 W3C Candidate Recommendation of "EMMA: Extensible MultiModal Annotation markup language". W3C publishes a technical report as a Candidate Recommendation to indicate that the document is believed to be stable, and to encourage implementation by the developer community.

This specification describes markup for representing interpretations of user input (speech, keystrokes, pen input etc.) together with annotations for confidence scores, timestamps, input medium etc., and forms part of the proposals for the W3C Multimodal Interaction Framework.

This document has been produced as part of the W3C Multimodal Interaction Activity, following the procedures set out for the W3C Process, with the intention of advancing it along the W3C Recommendation track. The authors of this document are members of the W3C Multimodal Interaction Working Group.

This document was produced by a group operating under the 5 February 2004 W3C Patent Policy. W3C maintains a public list of any patent disclosures made in connection with the deliverables of the group; that page also includes instructions for disclosing a patent. An individual who has actual knowledge of a patent which the individual believes contains Essential Claim(s) must disclose the information in accordance with section 6 of the W3C Patent Policy.

Publication as a Candidate Recommendation does not imply endorsement by the W3C Membership. This is a draft document and may be updated, replaced or obsoleted by other documents at any time. It is inappropriate to cite this document as other than work in progress.

Since the Second last call working draft in April 2007, a number of clarifications and examples have been added to the text of the specification in order to address detailed feedback on the Second last call. Changes from the previous Working Draft can be found in Appendix F. Please check the Disposition of Comments received during the Last Call period.

The entrance criteria to the Proposed Recommendation phase require at least two independently developed interoperable implementations of each required feature, and at least one or two implementations of each optional feature depending on whether the feature's conformance requirements have an impact on interoperability. Detailed implementation requirements and the invitation for participation in the Implementation Report are provided in the Implementation Report Plan. We expect to meet all requirements of that report within the Candidate Recommendation period closing 14 April 2008. The Multimodal Interaction Working Group will advance EMMA to Proposed Recommendation no sooner than 14 April 2008.

Several of the features in the current draft specification are considered to be at risk of removal due to potential lack of implementations.

Your feedback is welcomed until 14 April 2008. Please send feedback to the public mailing list: www-multimodal@w3.org (public archives). See W3C mailing list and archive usage guidelines.

Conventions of this Document

All sections in this specification are normative, unless otherwise indicated. The informative parts of this specification are identified by "Informative" labels within sections.

The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in [RFC2119].

Table of Contents

1. Introduction

This section is Informative.

This document presents an XML specification for EMMA, an Extensible MultiModal Annotation markup language, responding to the requirements documented in Requirements for EMMA [EMMA Requirements]. This markup language is intended for use by systems that provide semantic interpretations for a variety of inputs, including but not necessarily limited to, speech, natural language text, GUI and ink input.

It is expected that this markup will be used primarily as a standard data interchange format between the components of a multimodal system; in particular, it will normally be automatically generated by interpretation components to represent the semantics of users' inputs, not directly authored by developers.

The language is focused on annotating single inputs from users, which may be either from a single mode or a composite input combining information from multiple modes, as opposed to information that might have been collected over multiple turns of a dialog. The language provides a set of elements and attributes that are focused on enabling annotations on user inputs and interpretations of those inputs.

An EMMA document can be considered to hold three types of data:

Given the assumptions above about the nature of data represented in an EMMA document, the following general principles apply to the design of EMMA:

The annotations of EMMA should be considered 'normative' in the sense that if an EMMA component produces annotations as described in Section 3 and Section 4, these annotations must be represented using the EMMA syntax. The Multimodal Interaction Working Group may address in later drafts the issues of modularization and profiling; that is, which sets of annotations are to be supported by which classes of EMMA component.

1.1 Uses of EMMA

The general purpose of EMMA is to represent information automatically extracted from a user's input by an interpretation component, where input is to be taken in the general sense of a meaningful user input in any modality supported by the platform. The reader should refer to the sample architecture in W3C Multimodal Interaction Framework [MMI Framework], which shows EMMA conveying content between user input modality components and an interaction manager.

Components that generate EMMA markup:

  1. Speech recognizers
  2. Handwriting recognizers
  3. Natural language understanding engines
  4. Other input media interpreters (e.g. DTMF, pointing, keyboard)
  5. Multimodal integration component

Components that use EMMA include:

  1. Interaction manager
  2. Multimodal integration component

Although not a primary goal of EMMA, a platform may also choose to use this general format as the basis of a general semantic result that is carried along and filled out during each stage of processing. In addition, future systems may also potentially make use of this markup to convey abstract semantic content to be rendered into natural language by a natural language generation component.

1.2 Terminology

anchor point
When referencing an input interval with emma:time-ref-uri, emma:time-ref-anchor-point allows you to specify whether the referenced anchor is the start or end of the interval.
annotation
Information about the interpreted input, for example, timestamps, confidence scores, links to raw input, etc.
composite input
An input formed from several pieces, often in different modes, for example, a combination of speech and pen gesture, such as saying "zoom in here" and circling a region on a map.
confidence
A numerical score describing the degree of certainty in a particular interpretation of user input.
data model
For EMMA, a data model defines a set of constraints on possible interpretations of user input.
derivation
Interpretations of user input are said to be derived from that input, and higher level interpretations may be derived from lower level ones. EMMA allows you to reference the user input or interpretation a given interpretation was derived from, see semantic interpretation.
dialog
For EMMA, dialog can be considered as a sequence of interactions between the users and the application.
endpoint
In EMMA, this refers to a network location which is the source or recipient of an EMMA document. It should be noted that the usage of the term "endpoint" in this context is different from the way that the term is used in speech processing, where it refers to the end of a speech input.
gestures
In multimodal applications gestures are communicative acts made by the user or application. An example is circling an area on a map to indicate a region of interest. Users may be able to gesture with a pen, keystrokes, hand movements or sound. Gestures often form part of composite input. Application gestures are typically animations and/or sound effects.
grammar
A set of rules that describe a sequence of tokens expected in a given input. These can be used by speech and handwriting recognizers to increase recognition accuracy.
handwriting recognition
The process of converting pen strokes into text.
ink recognition
This includes the recognition of handwriting and pen gestures.
input cost
In EMMA, this refers to a numerical measure indicating the weight or processing cost associated with a user's input or part of their input.
input device
The device proving a particular input, for example, a microphone, a pen, a mouse, a camera, or a keyboard.
input function
In EMMA, this refers to the use a particular input is serving, for example, as part of a recording or transcription, as part of a dialog, or as a means to verify the user's identity.
input medium
Whether the input is acoustic, visual, or tactile, for instance, a spoken utterance is an example of an aural input, a hand gesture as seen by a camera is an example of a visual input, pointing with a mouse or pen is an example of a tactile input.
input mode
This distinguishes a particular means of providing an input within a general input medium, for example, speech, DTMF, ink, key strokes, video, photograph, etc.
input source
This is the device that provided the input, for example a particular microphone or camera. EMMA allows you to identify these with a URI.
input tokens
In EMMA, this refers to a sequence of characters, words or other discrete units of input.
instance data
A representation in XML of an interpretation of user input.
interaction manager
A processor that determines how an application interacts with a user. This can be at multiple levels of abstraction, for example, at a detailed level, determining what prompts to present to the user and what actions to take in response to user input, versus a higher level treatment in terms of goals and tasks for achieving those goals. Interaction managers are frequently event driven.
interpretation
In EMMA, an interpretation of user input refers to information derived from the user input that is meaningful to the application.
keystroke input
Input provided by the user pressing on a sequence of keys (buttons), such as a computer keyboard or keypad.
lattice
A set of nodes interconnected with directed arcs such that by following an arc, you can never find yourself back at a node you have already visited (i.e. a directed acyclic graph). Lattices provide a flexible means to represent the results of speech and handwriting recognition, in terms of arcs representing words or character sequences. Different arcs from the same node represent different local hypotheses as to what the user said or wrote.
metadata
Information describing another set of data, for instance, a library catalog card with information on the author, title and location of a book. EMMA is designed to support input processors in providing metadata for interpretations of user input.
multimodal integration
The process of combining inputs from different modes to create an interpretation of composite input. This is also sometimes referred to as multimodal fusion.
multimodal interaction
The means for a user to interact with an application using more than one mode of interaction, for instance, offering the user the choice of speaking or typing, or in some cases, allowing the user to provide a composite input involving multiple modes.
natural language understanding
The process of interpreting text in terms that are useful for an application.
N-best list
An N-best list is a list of the most likely hypotheses for what the user actually said or wrote, where N stands for an integral number such as 5 for the 5 most likely hypotheses.
raw signal
An uninterpreted input, such as an audio waveform captured from a microphone.
semantic interpretation
A normalized representation of the meaning of a user input, for instance, mapping the speech for "San Francisco" into the airport code "SFO".
semantic processor
In EMMA, this refers to systems that can derive interpretations of user input, for instance, mapping the speech for "San Francisco" into the airport code "SFO".
signal interpretation
The process of mapping a discrete or continuous signal into a symbolic representation that can be used by an application, for instance, transforming the audio waveform corresponding to someone saying "2005" into the number 2005.
speech recognition
The process of determining the textual transcription of a piece of speech.
speech synthesis
The process of rendering a piece of text into the corresponding speech, i.e. synthesizing speech from text.
text to speech
The process of rendering a piece of text into the corresponding speech.
time stamp
The time that a particular input or part of an input began or ended.
URI: Uniform Resource Identifier
A URI is a unifying syntax for the expression of names and addresses of objects on the network as used in the World Wide Web. Within this specification, the term URI refers to a Universal Resource Identifier as defined in [RFC3986] and extended in [RFC3987] with the new name IRI. The term URI has been retained in preference to IRI to avoid introducing new names for concepts such as "Base URI" that are defined or referenced across the whole family of XML specifications. A URI is defined as any legal anyURI primitive as defined in XML Schema Part 2: Datatypes Second Edition Section 3.2.17 [SCHEMA2].
user input
An input provided by a user as opposed to something generated automatically.

2. Structure of EMMA documents

This section is Informative.

As noted above, the main components of an interpreted user input in EMMA are the instance data, an optional data model, and the metadata annotations that may be applied to that input. The realization of these components in EMMA is as follows:

An EMMA interpretation is the primary unit for holding user input as interpreted by an EMMA processor. As will be seen below, multiple interpretations of a single input are possible.

EMMA provides a simple structural syntax for the organization of interpretations and instances, and an annotative syntax to apply the annotation to the input data at different levels.

An outline of the structural syntax and annotations found in EMMA documents is as follows. A fuller definition may be found in the description of individual elements and attributes in Section 3 and Section 4.

From the defined root node emma:emma the structure of an EMMA document consists of a tree of EMMA container elements (emma:one-of, emma:sequence, emma:group) terminating in a number of interpretation elements (emma:interpretation). The emma:interpretation elements serve as wrappers for either application namespace markup describing the interpretation of the users input or an emma:lattice element or emma:literal element . A single emma:interpretation may also appear directly under the root node.

To illustrate this here is an example EMMA document for input to a flight reservation application. In this example there are two speech recognition results and associated semantic representations of the input. The system is uncertain whether the user meant "flights from Boston to Denver" or "flights from Austin to Denver". The annotations to be captured are timestamps and confidence scores for the two inputs.

Example:

<emma:emma version="1.0"
    xmlns:emma="http://www.w3.org/2003/04/emma"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xsi:schemaLocation="http://www.w3.org/2003/04/emma
     http://www.w3.org/TR/2007/CR-emma-20071211/emma.xsd"
    xmlns="http://www.example.com/example">
  <emma:one-of id="r1" emma:start="1087995961542" emma:end="1087995963542"
     emma:medium="acoustic" emma:mode="voice">
    <emma:interpretation id="int1" emma:confidence="0.75"
    emma:tokens="flights from boston to denver">
      <origin>Boston</origin>
      <destination>Denver</destination>
    </emma:interpretation>

    <emma:interpretation id="int2" emma:confidence="0.68"
    emma:tokens="flights from austin to denver">
      <origin>Austin</origin>
      <destination>Denver</destination>
    </emma:interpretation>
  </emma:one-of>
</emma:emma>

Attributes on the root emma:emma element indicate the version and namespace. The emma:emma element contains an emma:one-of element which contains a disjunctive list of possible interpretations of the input. The actual semantic representation of each interpretation is within the application namespace. In the example here the application specific semantics involves elements origin and destination indicating the origin and destination cities for looking up a flight. The timestamp is the same for both interpretations and it is annotated using values in milliseconds in the emma:start and emma:end attributes on the emma:one-of. The confidence scores and tokens associated with each of the inputs are annotated using the EMMA annotation attributes emma:confidence and emma:tokens on each of the emma:interpretation elements.

2.1 Data model

An EMMA data model expresses the constraints on the structure and content of instance data, for the purposes of validation. As such, the data model may be considered as a particular kind of annotation (although, unlike other EMMA annotations, it is not a feature pertaining to a specific user input at a specific moment in time, it is rather a static and, by its very definition, application-specific structure). The specification of a data model in EMMA is optional.

Since Web applications today use different formats to specify data models, e.g. XML Schema Part 1: Structures Second Edition [XML Schema Structures], XForms 1.0 (Second Edition) [XFORMS], RELAX NG Specification [RELAX-NG], etc., EMMA itself is agnostic to the format of data model used.

Data model definition and reference is defined in Section 4.1.1.

2.2 EMMA namespace prefixes

An EMMA attribute is qualified with the EMMA namespace prefix if the attribute can also be used as an in-line annotation on elements in the application's namespace. Most of the EMMA annotation attributes in Section 4.2 are in this category. An EMMA attribute is not qualified with the EMMA namespace prefix if the attribute only appears on an EMMA element. This rule ensures consistent usage of the attributes across all examples.

Attributes from other namespaces are permissible on all EMMA elements. As an example xml:lang may be used to annotate the human language of character data content.

3. EMMA structural elements

This section defines elements in the EMMA namespace which provide the structural syntax of EMMA documents.

3.1 Root element: emma:emma

Annotation emma:emma
Definition The root element of an EMMA document.
Children The emma:emma element MUST immediately contain a single emma:interpretation element or EMMA container element: emma:one-of, emma:group, emma:sequence. It MAY also contain an optional single emma:derivation element and an optional single emma:info annotation element. It MAY also contain multiple optional emma:grammar annotation elements, emma:model annotation elements, and emma:endpoint-info annotation elements.
Attributes
  • Required:
    • version: the version of EMMA used for the interpretation(s). Interpretations expressed using this specification MUST use 1.0 for the value.
    • Namespace declaration for EMMA, see below.
  • Optional:
    • any other namespace declarations for application specific namespaces.
Applies to None

The root element of an EMMA document is named emma:emma. It holds a single emma:interpretation or EMMA container element (emma:one-of, emma:sequence, emma:group). It MAY also contain a single emma:derivation element containing earlier stages of the processing of the input (See Section 4.1.2). It MAY also contain an optional single annotation element: emma:info and multiple optional emma:grammar, emma:model, and emma:endpoint-info elements.

It MAY hold attributes for information pertaining to EMMA itself, along with any namespaces which are declared for the entire document, and any other EMMA annotative data. The emma:emma element and other elements and attributes defined in this specification belong to the XML namespace identified by the URI "http://www.w3.org/2003/04/emma". In the examples, the EMMA namespace is generally declared using the attribute xmlns:emma on the root emma:emma element. EMMA processors MUST support the full range of ways of declaring XML namespaces as defined by the Namespaces in XML 1.1 (Second Edition) [XMLNS]. Application markup MAY be declared in an explicit application namespace, or an undefined namespace (equivalent to setting xmlns="").

For example:

<emma:emma version="1.0" xmlns:emma="http://www.w3.org/2003/04/emma">
    ....
</emma:emma>

or


<emma version="1.0" xmlns="http://www.w3.org/2003/04/emma">
    ....
</emma>

3.2 Interpretation element: emma:interpretation

Annotation emma:interpretation
Definition The emma:interpretation element acts as a wrapper for application instance data or lattices.
Children The emma:interpretation element MUST immediately contain either application instance data, or a single emma:lattice element, or a single emma:literal element, or in the case of uninterpreted input or no input emma:interpretation MUST be empty. It MAY also contain multiple optional emma:derived-from elements and an optional single emma:info element.
Attributes
  • Required: Attribute id of type xsd:ID that uniquely identifies the interpretation within the EMMA document.
  • Optional: The annotation attributes: emma:tokens, emma:process, emma:no-input, emma:uninterpreted, emma:lang, emma:signal, emma:signal-size, emma:media-type, emma:confidence, emma:source, emma:start, emma:end, emma:time-ref-uri, emma:time-ref-anchor-point, emma:offset-to-start, emma:duration, emma:medium, emma:mode, emma:function, emma:verbal, emma:cost, emma:grammar-ref, emma:endpoint-info-ref, emma:model-ref, emma:dialog-turn.
Applies to The emma:interpretation element is legal only as a child of emma:emma, emma:group, emma:one-of, emma:sequence, or emma:derivation.

The emma:interpretation element holds a single interpretation represented in application specific markup, or a single emma:lattice element, or a single emma:literal element.

The emma:interpretation element MUST be empty if it is marked with emma:no-input="true" (Section 4.2.3). The emma:interpretation element MUST be empty if it has been annotated with emma:uninterpreted="true" (Section 4.2.4) or emma:function="recording" (Section 4.2.11).

Attributes:

  1. id a REQUIRED xsd:ID value that uniquely identifies the interpretation within the EMMA document.
<emma:emma version="1.0" xmlns:emma="http://www.w3.org/2003/04/emma"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xsi:schemaLocation="http://www.w3.org/2003/04/emma
     http://www.w3.org/TR/2007/CR-emma-20071211/emma.xsd"
    xmlns="http://www.example.com/example">
  <emma:interpretation id="r1" emma:medium="acoustic" emma:mode="voice">
    ...
  </emma:interpretation>
</emma:emma>

While emma:medium and emma:mode are optional on emma:interpretation, note that all EMMA interpretations must be annotated for emma:medium and emma:mode, so either these attributes must appear directly on emma:interpretation or they must appear on an ancestor emma:one-of node or they must appear on an earlier stage of the derivation listed in emma:derivation.

3.3 Container elements

3.3.1 emma:one-of element

Annotation emma:one-of
Definition A container element indicating a disjunction among a collection of mutually exclusive interpretations of the input.
Children The emma:one-of element MUST immediately contain a collection of one or more emma:interpretation elements or container elements: emma:one-of, emma:group, emma:sequence . It MAY also contain multiple optional emma:derived-from elements and an optional single emma:info element.
Attributes
  • Required:
    • Attribute id of type xsd:ID
    • The attribute disjunction-type MUST be present if emma:one-of is embedded within emma:one-of. The possible values of disjunction-type are {recognition, understanding, multi-device, and multi-process}.
  • Optional:
    • On a single non-embedded emma:one-of the attribute disjunction-type is optional.
    • The following annotation attributes are optional: emma:tokens, emma:process, emma:lang, emma:signal, emma:signal-size, emma:media-type, emma:confidence, emma:source, emma:start, emma:end, emma:time-ref-uri, emma:time-ref-anchor-point, emma:offset-to-start, emma:duration, emma:medium, emma:mode, emma:function, emma:verbal, emma:cost, emma:grammar-ref, emma:endpoint-info-ref, emma:model-ref, emma:dialog-turn.
Applies to The emma:one-of element MAY only appear as a child of emma:emma, emma:one-of, emma:group, emma:sequence, or emma:derivation.

The emma:one-of element acts as a container for a collection of one or more interpretation (emma:interpretation) or container elements (emma:one-of, emma:group, emma:sequence), and denotes that these are mutually exclusive interpretations.

An N-best list of choices in EMMA MUST be represented as a set of emma:interpretation elements contained within an emma:one-of element. For instance, a series of different recognition results in speech recognition might be represented in this way.

<emma:emma version="1.0"
    xmlns:emma="http://www.w3.org/2003/04/emma"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xsi:schemaLocation="http://www.w3.org/2003/04/emma
     http://www.w3.org/TR/2007/CR-emma-20071211/emma.xsd"
    xmlns="http://www.example.com/example">
  <emma:one-of id="r1" emma:medium="acoustic" emma:mode="voice">
    <emma:interpretation id="int1">
      <origin>Boston</origin>
      <destination>Denver</destination>
      <date>03112003</date>
    </emma:interpretation>

    <emma:interpretation id="int2">
      <origin>Austin</origin>
      <destination>Denver</destination>
      <date>03112003</date>
    </emma:interpretation>
  </emma:one-of>
</emma:emma>

The function of the emma:one-of element is to represent a disjunctive list of possible interpretations of a user input. A disjunction of possible interpretations of an input can be the result of different kinds of processing or ambiguity. One source is multiple results from a recognition technology such as speech or handwriting recognition. Multiple results can also occur from parsing or understanding natural language. Another possible source of ambiguity is from the application of multiple different kinds of recognition or understanding components to the same input signal. For example, an single ink input signal might be processed by both handwriting recognition and gesture recognition. Another is the use of more than one recording device for the same input (multiple microphones).

In order to make explicit these different kinds of multiple interpretations and allow for concise statement of the annotations associated with each, the emma:one-of element MAY appear within another emma:one-of element. If emma:one-of elements are nested then they MUST indicate the kind of disjunction using the attribute disjunction-type. The values of disjunction-type are {recognition, understanding, multi-device, and multi-process}. For the most common use case, where there are multiple recognition results and some of them have multiple interpretations, the top-level emma:one-of is disjunction-type="recognition" and the embedded emma:one-of has the attribute disjunction-type="understanding".

As an example, in an interactive flight reservation application, recognition yielded 'Boston' or 'Austin' and each had a semantic interpretation as either the assertion of city name or the specification of a flight query with the city as the destination, this would be represented as follows in EMMA:


<emma:emma version="1.0"
    xmlns:emma="http://www.w3.org/2003/04/emma"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xsi:schemaLocation="http://www.w3.org/2003/04/emma
     http://www.w3.org/TR/2007/CR-emma-20071211/emma.xsd"
    xmlns="http://www.example.com/example">
  <emma:one-of disjunction-type="recognition"
      start="12457990" end="12457995"
      emma:medium="acoustic" emma:mode="voice">
     <emma:one-of disjunction-type="understanding"
         emma:tokens="boston">
       <emma:interpretation>
          <assert><city>boston</city></assert>
       </emma:interpretation>
       <emma:interpretation>
          <flight><dest><city>boston</city></dest></flight>
       </emma:interpretation>
     </emma:one-of>
     <emma:one-of disjunction-type="understanding"
         emma:tokens="austin">
       <emma:interpretation>
          <assert><city>austin</city></assert>
       </emma:interpretation>
       <emma:interpretation>
          <flight><dest><city>austin</city></dest></flight>
       </emma:interpretation>
     </emma:one-of>
  </emma:one-of>
</emma:emma>

EMMA MAY explicitly represent ambiguity resulting from different processes, devices, or sources using embedded emma:one-of and the disjunction-type attribute. Multiple different interpretations resulting from different factors MAY also be listed within a single unstructured emma:one-of though in this case it is more complex or impossible to uncover the sources of the ambiguity if required by later stages of processing. If there is no embedding in emma:one-of, then the disjunction-type attribute is not required. If the disjunction-type attribute is missing then by default the source of disjunction is unspecified.

The example case above could also be represented as:


<emma:emma version="1.0"
    xmlns:emma="http://www.w3.org/2003/04/emma"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xsi:schemaLocation="http://www.w3.org/2003/04/emma
     http://www.w3.org/TR/2007/CR-emma-20071211/emma.xsd"
    xmlns="http://www.example.com/example">
  <emma:one-of  start="12457990" end="12457995"
         emma:medium="acoustic" emma:mode="voice">
     <emma:interpretation emma:tokens="boston">
        <assert><city>boston</city></assert>
     </emma:interpretation>
     <emma:interpretation >
        <flight><dest><city>boston</city></dest></flight>
     </emma:interpretation>
     <emma:interpretation emma:tokens="austin">
        <assert><city>austin</city></assert>
     </emma:interpretation>
     <emma:interpretation emma:tokens="austin">
        <flight><dest><city>austin</city></dest></flight>
     </emma:interpretation>
  </emma:one-of>
</emma:emma>

But in this case information about which interpretations resulted from speech recognition and which resulted from language understanding is lost.

A list of emma:interpretation elements within an emma:one-of MUST be sorted best-first by some measure of quality. The quality measure is emma:confidence if present, otherwise, the quality metric is platform-specific.

With embedded emma:one-of structures there is no requirement for the confidence scores within different emma:one-of to be on the same scale. For example, the scores assigned by handwriting recognition might not be comparable to those assigned by gesture recognition. Similarly, if multiple recognizers are used there is no guarantee that their confidence scores will be comparable. For this reason the ordering requirement on emma:interpretation within emma:one-of only applies locally to sister emma:interpretation elements within each emma:one-of. There is no requirement on the ordering of embedded emma:one-of elements within a higher emma:one-of element.

While emma:medium and emma:mode are optional on emma:one-of, note that all EMMA interpretations must be annotated for emma:medium and emma:mode, so either these annotations must appear directly on all of the contained emma:interpretation elements within the emma:one-of, or they must appear on the emma:one-of element itself, or they must appear on an ancestor emma:one-of element, or they must appear on an earlier stage of the derivation listed in emma:derivation.

3.3.2 emma:group element

Annotation emma:group
Definition A container element indicating that a number of interpretations of distinct user inputs are grouped according to some criteria.
Children The emma:group element MUST immediately contain a collection of one or more emma:interpretation elements or container elements: emma:one-of, emma:group, emma:sequence . It MAY also contain an optional single emma:group-info element. It MAY also contain multiple optional emma:derived-from elements and an optional single emma:info element.
Attributes
  • Required: Attribute id of type xsd:ID
  • Optional: The annotation attributes: emma:tokens, emma:process, emma:lang, emma:signal, emma:signal-size, emma:media-type, emma:confidence, emma:source, emma:start, emma:end, emma:time-ref-uri, emma:time-ref-anchor-point, emma:offset-to-start, emma:duration, emma:medium, emma:mode, emma:function, emma:verbal, emma:cost, emma:grammar-ref, emma:endpoint-info-ref, emma:model-ref, emma:dialog-turn.
Applies to The emma:group element is legal only as a child of emma:emma, emma:one-of, emma:group, emma:sequence, or emma:derivation.

The emma:group element is used to indicate that the contained interpretations are from distinct user inputs that are related in some manner. emma:group MUST NOT be used for containing the multiple stages of processing of a single user input. Those MUST be contained in the emma:derivation element instead (Section 4.1.2). For groups of inputs in temporal order the more specialized container emma:sequence MUST be used (Section 3.3.3). The following example shows three interpretations derived from the speech input "Move this ambulance here" and the tactile input related to two consecutive points on a map.

<emma:emma version="1.0"
    xmlns:emma="http://www.w3.org/2003/04/emma"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xsi:schemaLocation="http://www.w3.org/2003/04/emma
     http://www.w3.org/TR/2007/CR-emma-20071211/emma.xsd"
    xmlns="http://www.example.com/example">
  <emma:group id="grp"
      emma:start="1087995961542"
      emma:end="1087995964542">
    <emma:interpretation id="int1"
      emma:medium="acoustic" emma:mode="voice">
      <action>move</action>
      <object>ambulance</object>
      <destination>here</destination>
    </emma:interpretation>

    <emma:interpretation id="int2"
      emma:medium="tactile" emma:mode="ink">
      <x>0.253</x>
      <y>0.124</y>
    </emma:interpretation>

    <emma:interpretation id="int3"
      emma:medium="tactile" emma:mode="ink">
      <x>0.866</x>
      <y>0.724</y>
    </emma:interpretation>
  </emma:group>
</emma:emma>

The emma:one-of and emma:group containers MAY be nested arbitrarily.

3.3.2.1 Indirect grouping criteria: emma:group-info element

Annotation emma:group-info
Definition The emma:group-info element contains or references criteria used in establishing the grouping of interpretations in an emma:group element.
Children The emma:group-info element MUST either immediately contain inline instance data specifying grouping criteria or have the attribute ref referencing the criteria.
Attributes
  • Optional: ref of type xsd:anyURI referencing the grouping criteria; alternatively the criteria MAY be provided inline as the content of the emma:group-info element.
Applies to The emma:group-info element is legal only as a child of emma:group.

Sometimes it may be convenient to indirectly associate a given group with information, such as grouping criteria. The emma:group-info element might be used to make explicit the criteria by which members of a group are associated. In the following example, a group of two points is associated with a description of grouping criteria based upon a sliding temporal window of two seconds duration.

<emma:emma version="1.0"
    xmlns:emma="http://www.w3.org/2003/04/emma"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xsi:schemaLocation="http://www.w3.org/2003/04/emma
     http://www.w3.org/TR/2007/CR-emma-20071211/emma.xsd"
    xmlns="http://www.example.com/example"
    xmlns:ex="http://www.example.com/ns/group">
  <emma:group id="grp">
    <emma:group-info>
      <ex:mode>temporal</ex:mode>
      <ex:duration>2s</ex:duration>
    </emma:group-info>

    <emma:interpretation id="int1"
      emma:medium="tactile" emma:mode="ink">
      <x>0.253</x>
      <y>0.124</y>
    </emma:interpretation>

    <emma:interpretation id="int2"
      emma:medium="tactile" emma:mode="ink">
      <x>0.866</x>
      <y>0.724</y>
    </emma:interpretation>
  </emma:group>
</emma:emma>

You might also use emma:group-info to refer to a named grouping criterion using external reference, for instance:

<emma:emma version="1.0"
    xmlns:emma="http://www.w3.org/2003/04/emma"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xsi:schemaLocation="http://www.w3.org/2003/04/emma
     http://www.w3.org/TR/2007/CR-emma-20071211/emma.xsd"
    xmlns="http://www.example.com/example"
    xmlns:ex="http://www.example.com/ns/group">
  <emma:group id="grp">
    <emma:group-info ref="http://www.example.com/criterion42"/>
    <emma:interpretation id="int1"
      emma:medium="tactile" emma:mode="ink">
      <x>0.253</x>
      <y>0.124</y>
    </emma:interpretation>

    <emma:interpretation id="int2"
      emma:medium="tactile" emma:mode="ink">
      <x>0.866</x>
      <y>0.724</y>
    </emma:interpretation>
  </emma:group>
</emma:emma>

3.3.3 emma:sequence element

Annotation emma:sequence
Definition A container element indicating that a number of interpretations of distinct user inputs are in temporal sequence.
Children The emma:sequence element MUST immediately contain a collection of one or more emma:interpretation elements or container elements: emma:one-of, emma:group, emma:sequence . It MAY also contain multiple optional emma:derived-from elements and an optional single emma:info element.
Attributes
  • Required: Attribute id of type xsd:ID
  • Optional: The annotation attributes: emma:tokens, emma:process, emma:lang, emma:signal, emma:signal-size, emma:media-type, emma:confidence, emma:source, emma:start, emma:end, emma:time-ref-uri, emma:time-ref-anchor-point, emma:offset-to-start, emma:duration, emma:medium, emma:mode, emma:function, emma:verbal, emma:cost, emma:grammar-ref, emma:endpoint-info-ref, emma:model-ref, emma:dialog-turn.
Applies to The emma:sequence element is legal only as a child of emma:emma, emma:one-of, emma:group, emma:sequence, or emma:derivation.

The emma:sequence element is used to indicate that the contained interpretations are sequential in time, as in the following example, which indicates that two points made with a pen are in temporal order.

<emma:emma version="1.0"
    xmlns:emma="http://www.w3.org/2003/04/emma"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xsi:schemaLocation="http://www.w3.org/2003/04/emma
     http://www.w3.org/TR/2007/CR-emma-20071211/emma.xsd"
    xmlns="http://www.example.com/example">
  <emma:sequence id="seq1">
    <emma:interpretation id="int1"
        emma:medium="tactile" emma:mode="ink">
      <x>0.253</x>
      <y>0.124</y>
    </emma:interpretation>

    <emma:interpretation id="int2"
        emma:medium="tactile" emma:mode="ink">
      <x>0.866</x>
      <y>0.724</y>
    </emma:interpretation>
  </emma:sequence>
</emma:emma>

The emma:sequence container MAY be combined with emma:one-of and emma:group in arbitrary nesting structures. The order of children in the content of the emma:sequence element corresponds to a sequence of interpretations. This ordering does not imply any particular definition of sequentiality. EMMA processors are expected therefore to use the emma:sequence element to hold interpretations which are either strictly sequential in nature (e.g. the end-time of an interpretation precedes the start-time of its follower), or which overlap in some manner (e.g. the start-time of a follower interpretation precedes the end-time of its precedent). It is possible to use timestamps to provide fine grained annotation for the sequence of interpretations that are sequential in time (see Section 4.2.10).

In the following more complex example, a sequence of two pen gestures in emma:sequence and a speech input in emma:interpretation is contained in an emma:group.

<emma:emma version="1.0"
    xmlns:emma="http://www.w3.org/2003/04/emma"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xsi:schemaLocation="http://www.w3.org/2003/04/emma
     http://www.w3.org/TR/2007/CR-emma-20071211/emma.xsd"
    xmlns="http://www.example.com/example">
  <emma:group id="grp">
     <emma:interpretation id="int1" emma:medium="acoustic"
         emma:mode="voice">
       <action>move</action>
       <object>this-battleship</object>
       <destination>here</destination>
     </emma:interpretation>

     <emma:sequence id="seq1">
       <emma:interpretation id="int2" emma:medium="tactile"
           emma:mode="ink">
         <x>0.253</x>
         <y>0.124</y>
       </emma:interpretation>

     <emma:interpretation id="int3" emma:medium="tactile"
         emma:mode="ink">
       <x>0.866</x>
       <y>0.724</y>
     </emma:interpretation>
   </emma:sequence>
 </emma:group>
</emma:emma>

3.4 Lattice element

In addition to providing the ability to represent N-best lists of interpretations using emma:one-of, EMMA also provides the capability to represent lattices of words or other symbols using the emma:lattice element. Lattices provide a compact representation of large lists of possible recognition results or interpretations for speech, pen, or multimodal inputs.

In addition to providing a representation for lattice output from speech recognition, another important use case for lattices is for representation of the results of gesture and handwriting recognition from a pen modality component. Lattices can also be used to compactly represent multiple possible meaning representations. Another use case for the lattice representation is for associating confidence scores and other annotations with individual words within a speech recognition result string.

Lattices are compactly described by a list of transitions between nodes. For each transition the start and end nodes MUST be defined, along with the label for the transition. Initial and final nodes MUST also be indicated. The following figure provides a graphical representation of a speech recognition lattice which compactly represents eight different sequences of words.

speech lattice

which expands to:

a. flights to boston from portland today please
b. flights to austin from portland today please
c. flights to boston from oakland today please
d. flights to austin from oakland today please
e. flights to boston from portland tomorrow
f. flights to austin from portland tomorrow
g. flights to boston from oakland tomorrow
h. flights to austin from oakland tomorrow

3.4.1 Lattice markup: emma:lattice, emma:arc, emma:node elements

Annotation emma:lattice
Definition An element which encodes a lattice representation of user input.
Children The emma:lattice element MUST immediately contain one or more emma:arc elements and zero or more emma:node elements.
Attributes
  • Required:
    • initial of type xsd:nonNegativeInteger indicating the number of the initial node of the lattice.
    • final contains a space-separated list of xsd:nonNegativeInteger indicating the numbers of the final nodes in the lattice.
  • Optional: emma:time-ref-uri, emma:time-ref-anchor-point.
Applies to The emma:lattice element is legal only as a child of the emma:interpretation element.
Annotation emma:arc
Definition An element which encodes a transition between two nodes in a lattice. The label associated with the arc in the lattice is represented in the content of emma:arc.
Children The emma:arc element MUST immediately contain either character data or a single application namespace element or be empty, in the case of epsilon transitions. It MAY contain an emma:info element containing application or vendor specific annotations.
Attributes
  • Required:
    • from of type xsd:nonNegativeInteger indicating the number of the starting node for the arc.
    • to of type xsd:nonNegativeInteger indicating the number of the ending node for the arc.
  • Optional: emma:start, emma:end, emma:offset-to-start, emma:duration, emma:confidence, emma:cost, emma:lang, emma:medium, emma:mode, emma:source.
Applies to The emma:arc element is legal only as a child of the emma:lattice element.
Annotation emma:node
Definition An element which represents a node in the lattice. The emma:node elements are not required to describe a lattice but might be added to provide a location for annotations on nodes in a lattice. There MUST be at most one emma:node specification for each numbered node in the lattice.
Children An OPTIONAL emma:info element for application or vendor specific annotations on the node.
Attributes
  • Required:
    • node-number of type xsd:nonNegativeInteger indicating the node number in the lattice.
  • Optional: emma:confidence, emma:cost.
Applies to The emma:node element is legal only as a child of the emma:lattice element.

In EMMA, a lattice is represented using an element emma:lattice, which has attributes initial and final for indicating the initial and final nodes of the lattice. For the lattice below, this will be: <emma:lattice initial="1" final="8"/>. The nodes are numbered with integers. If there is more than one distinct final node in the lattice the nodes MUST be represented as a space separated list in the value of the final attribute e.g. <emma:lattice initial="1" final="9 10 23"/>. There MUST only be one initial node in an EMMA lattice. Each transition in the lattice is represented as an element emma:arc with attributes from and to which indicate the nodes where the transition starts and ends. The arc's label is represented as the content of the emma:arc element and MUST be any well-formed character or XML content. In the example here the contents are words. Empty (epsilon) transitions in a lattice MUST be represented in the emma:lattice representation as emma:arc empty elements, e.g. <emma:arc from="1" to="8"/>.

The example speech lattice above would be represented in EMMA markup as follows:

<emma:emma version="1.0"
    xmlns:emma="http://www.w3.org/2003/04/emma"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xsi:schemaLocation="http://www.w3.org/2003/04/emma
     http://www.w3.org/TR/2007/CR-emma-20071211/emma.xsd"
    xmlns="http://www.example.com/example">
  <emma:interpretation id="interp1"
    emma:medium="acoustic" emma:mode="voice">
    <emma:lattice initial="1" final="8">
      <emma:arc from="1" to="2">flights</emma:arc>

      <emma:arc from="2" to="3">to</emma:arc>
      <emma:arc from="3" to="4">boston</emma:arc>
      <emma:arc from="3" to="4">austin</emma:arc>
      <emma:arc from="4" to="5">from</emma:arc>

      <emma:arc from="5" to="6">portland</emma:arc>
      <emma:arc from="5" to="6">oakland</emma:arc>
      <emma:arc from="6" to="7">today</emma:arc>
      <emma:arc from="7" to="8">please</emma:arc>

      <emma:arc from="6" to="8">tomorrow</emma:arc>
    </emma:lattice>
  </emma:interpretation>
</emma:emma>

Alternatively, if we wish to represent the same information as an N-best list using emma:one-of, we would have the more verbose representation:

<emma:emma version="1.0"
    xmlns:emma="http://www.w3.org/2003/04/emma"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xsi:schemaLocation="http://www.w3.org/2003/04/emma
     http://www.w3.org/TR/2007/CR-emma-20071211/emma.xsd"
    xmlns="http://www.example.com/example">
  <emma:one-of id="nbest1" emma:medium="acoustic" emma:mode="voice">
    <emma:interpretation id="interp1">
      <text>flights to boston from portland today please</text>
    </emma:interpretation>

    <emma:interpretationid="interp2">
      <text>flights to boston from portland tomorrow</text>
    </emma:interpretation>

    <emma:interpretation id="interp3">
      <text>flights to austin from portland today please</text>
    </emma:interpretation>

    <emma:interpretation id="interp4">
      <text>flights to austin from portland tomorrow</text>
    </emma:interpretation>

    <emma:interpretation id="interp5">
      <text>flights to boston from oakland today please</text>
    </emma:interpretation>

    <emma:interpretation id="interp6">
      <text>flights to boston from oakland tomorrow</text>
    </emma:interpretation>

    <emma:interpretation id="interp7">
      <text>flights to austin from oakland today please</text>
    </emma:interpretation>

    <emma:interpretation id="interp8">
      <text>flights to austin from oakland tomorrow</text>
    </emma:interpretation>
  </emma:one-of>
</emma:emma>

The lattice representation avoids the need to enumerate all of the possible word sequences. Also, as detailed below, the emma:lattice representation enables placement of annotations on individual words in the input.

For use cases involving the representation of gesture/ink lattices and use cases involving lattices of semantic interpretations, EMMA allows for application namespace elements to appear within emma:arc.

For example a sequence of two gestures, each of which is recognized as either a line or a circle, might be represented as follows:

<emma:emma version="1.0"
    xmlns:emma="http://www.w3.org/2003/04/emma"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xsi:schemaLocation="http://www.w3.org/2003/04/emma
     http://www.w3.org/TR/2007/CR-emma-20071211/emma.xsd"
    xmlns="http://www.example.com/example">
  <emma:interpretation id="interp1"
    emma:medium="acoustic" emma:mode="voice">
    <emma:lattice initial="1" final="3">
      <emma:arc from="1" to="2">
        <circle radius="100"/>
      </emma:arc>
      <emma:arc from="2" to="3">
        <line length="628"/>
      </emma:arc>
      <emma:arc from="1" to="2">
        <circle radius="200"/>
      </emma:arc>
      <emma:arc from="2" to="3">
        <line length="1256"/>
      </emma:arc>
    </emma:lattice>
  </emma:interpretation>
</emma:emma>

As an example of a lattice of semantic interpretations, in a travel application where the source is either "Boston" or "Austin"and the destination is either "Newark" or "New York", the possibilities might be represented in a lattice as follows:

<emma:emma version="1.0"
    xmlns:emma="http://www.w3.org/2003/04/emma"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xsi:schemaLocation="http://www.w3.org/2003/04/emma
     http://www.w3.org/TR/2007/CR-emma-20071211/emma.xsd"
    xmlns="http://www.example.com/example">
  <emma:interpretation id="interp1"
    emma:medium="acoustic" emma:mode="voice">
    <emma:lattice initial="1" final="3">
      <emma:arc from="1" to="2">
        <source city="boston"/>
      </emma:arc>
      <emma:arc from="2" to="3">
        <destination city="newark"/>
      </emma:arc>
      <emma:arc from="1" to="2">
        <source city="austin"/>
      </emma:arc>
      <emma:arc from="2" to="3">
        <destination city="new york"/>
      </emma:arc>
    </emma:lattice>
  </emma:interpretation>
</emma:emma>

The emma:arc element MAY contain either an application namespace element or character data. It MUST NOT contain combinations of application namespace elements and character data. However, an emma:info element MAY appear within an emma:arc element alongside character data, in order to allow for the association of vendor or application specific annotations on a single word or symbol in a lattice.

So, in summary, there are four groupings of content that can appear within emma:arc:

3.4.2 Annotations on lattices

The encoding of lattice arcs as XML elements (emma:arc) enables arcs to be annotated with metadata such as timestamps, costs, or confidence scores:

<emma:emma version="1.0"
    xmlns:emma="http://www.w3.org/2003/04/emma"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xsi:schemaLocation="http://www.w3.org/2003/04/emma
     http://www.w3.org/TR/2007/CR-emma-20071211/emma.xsd"
    xmlns="http://www.example.com/example">
  <emma:interpretation id="interp1"
    emma:medium="acoustic" emma:mode="voice">
    <emma:lattice initial="1" final="8">
      <emma:arc
       from="1"
       to="2"
       emma:start="1087995961542"
       emma:end="1087995962042"
       emma:cost="30">
         flights
      </emma:arc>

      <emma:arc
       from="2"
       to="3"
       emma:start="1087995962042"
       emma:end="1087995962542"
       emma:cost="20">
         to
      </emma:arc>

      <emma:arc
       from="3"
       to="4"
       emma:start="1087995962542"
       emma:end="1087995963042"
       emma:cost="50">
         boston
      </emma:arc>

      <emma:arc
       from="3"
       to="4"
       emma:start="1087995963042"
       emma:end="1087995963742"
       emma:cost="60">
         austin
      </emma:arc>
      ...
    </emma:lattice>
  </emma:interpretation>
</emma:emma>

The following EMMA attributes MAY be placed on emma:arc elements: absolute timestamps (emma:start, emma:end), relative timestamps ( emma:offset-to-start, emma:duration), emma:confidence, emma:cost, the human language of the input (emma:lang), emma:medium, emma:mode, and emma:source. The use case for emma:medium, emma:mode, and emma:source is for lattices which contains content from different input modes. The emma:arc element MAY also contain an emma:info element for specification of vendor and application specific annotations on the arc.

The timestamps that appear on emma:arc elements do not necessarily indicate the start and end of the arc itself. They MAY indicate the start and end of the signal corresponding to the label on the arc. As a result there is no requirement that the emma:end timestamp on an arc going into a node should be equivalent to the emma:start of all arcs going out of that node. Furthermore there is no guarantee that the left to right order of arcs in a lattice will correspond to the temporal order of the input signal. The lattice representation is an abstraction that represents a range of possible interpretations of a user's input and is not intended to necessarily be a representation of temporal order.

Costs are typically application and device dependent. There are a variety of ways that individual arc costs might be combined to produce costs for specific paths through the lattice. This specification does not standardize the way for these costs to be combined; it is up to the applications and devices to determine how such derived costs would be computed and used.

For some lattice formats, it is also desirable to annotate the nodes in the lattice themselves with information such as costs. For example in speech recognition, costs might be placed on nodes as a result of word penalties or redistribution of costs. For this purpose EMMA also provides an emma:node element which can host annotations such as emma:cost. The emma:node element MUST have an attribute node-number which indicates the number of the node. There MUST be at most one emma:node specification for a given numbered node in the lattice. In our example, if there was a cost of 100 on the final state this could be represented as follows:

<emma:emma version="1.0"
    xmlns:emma="http://www.w3.org/2003/04/emma"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xsi:schemaLocation="http://www.w3.org/2003/04/emma
     http://www.w3.org/TR/2007/CR-emma-20071211/emma.xsd"
    xmlns="http://www.example.com/example">
  <emma:interpretation id="interp1" 
    emma:medium="acoustic" emma:mode="voice">
    <emma:lattice initial="1" final="8">
      <emma:arc
       from="1"
       to="2"
       emma:start="1087995961542"
       emma:end="1087995962042"
       emma:cost="30">
         flights
      </emma:arc>
      <emma:arc
       from="2"
       to="3"
       emma:start="1087995962042"
       emma:end="1087995962542"
       emma:cost="20">
         to
      </emma:arc>

      <emma:arc
       from="3"
       to="4"
       emma:start="1087995962542"
       emma:end="1087995963042"
       emma:cost="50">
         boston
      </emma:arc>
      <emma:arc
       from="3"
       to="4"
       emma:start="1087995963042"
       emma:end="1087995963742"
       emma:cost="60">
         austin
      </emma:arc>
        ...
      <emma:node node-number="8" emma:cost="100"/>
    </emma:lattice>
  </emma:interpretation>
</emma:emma>

3.4.3 Relative timestamps on lattices

The relative timestamp mechanism in EMMA is intended to provide temporal information about arcs in a lattice in relative terms using offsets in milliseconds. In order to do this the absolute time MAY be specified on emma:interpretation; both emma:time-ref-uri and emma:time-ref-anchor-point apply to emma:lattice and MAY be used there to set the anchor point for offsets to the start of the absolute time specified on emma:interpretation. The offset in milliseconds to the beginning of each arc MAY then be indicated on each emma:arc in the emma:offset-to-start attribute.

<emma:emma version="1.0"
    xmlns:emma="http://www.w3.org/2003/04/emma"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xsi:schemaLocation="http://www.w3.org/2003/04/emma
     http://www.w3.org/TR/2007/CR-emma-20071211/emma.xsd"
    xmlns="http://www.example.com/example">

  <emma:interpretation id="interp1"
          emma:start="1087995961542" emma:end="1087995963042"
          emma:medium="acoustic" emma:mode="voice">
    <emma:lattice emma:time-ref-uri="#interp1"
        emma:time-ref-anchor-point="start"
        initial="1" final="4">
      <emma:arc
       from="1"
       to="2"
       emma:offset-to-start="0">
         flights
      </emma:arc>
      <emma:arc
       from="2"
       to="3"
       emma:offset-to-start="500">
         to
      </emma:arc>

      <emma:arc
       from="3"
       to="4"
       emma:offset-to-start="1000">
         boston
      </emma:arc>
    </emma:lattice>
  </emma:interpretation>
</emma:emma>

Note that the offset for the first emma:arc MUST always be zero since the EMMA attribute emma:offset-to-start indicates the number of milliseconds from the anchor point to the start of the piece of input associated with the emma:arc, in this case the word "flights".

3.5 Literal semantics: emma:literal element

Annotation emma:literal
Definition An element that contains string literal output.
Children String literal
Attributes None.
Applies to The emma:literal is a child of emma:interpretation.

Certain EMMA processing components produce semantic results in the form of string literals without any surrounding application namespace markup. These MUST be placed with the EMMA element emma:literal within emma:interpretation. For example, if a semantic interpreter simply returned "boston" this could be represented in EMMA as:

<emma:emma version="1.0"
    xmlns:emma="http://www.w3.org/2003/04/emma"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xsi:schemaLocation="http://www.w3.org/2003/04/emma
     http://www.w3.org/TR/2007/CR-emma-20071211/emma.xsd"
    xmlns="http://www.example.com/example">
  <emma:interpretation id="r1" 
emma:medium="acoustic" emma:mode="voice"
> <emma:literal>boston</emma:literal> </emma:interpretation> </emma:emma>

4. EMMA annotations

This section defines annotations in the EMMA namespace including both attributes and elements. The values are specified in terms of the data types defined by XML Schema Part 2: Datatypes Second Edition [XML Schema Datatypes].

4.1 EMMA annotation elements

4.1.1 Data model: emma:model element

Annotation emma:model
Definition The emma:model either references or provides inline the data model for the instance data.
Children If a ref attribute is not specified then this element contains the data model inline.
Attributes
  • Required:
    • id of type xsd:ID.
  • Optional:
    • ref of type xsd:anyURI that references the data model. Note that either an ref attribute or in-line data model (but not both) MUST be specified.
Applies to The emma:model element MAY appear only as a child of emma:emma.

The data model that may be used to express constraints on the structure and content of instance data is specified as one of the annotations of the instance. Specifying the data model is OPTIONAL, in which case the data model can be said to be implicit. Typically the data model is pre-established by the application.

The data model is specified with the emma:model annotation defined as an element in the EMMA namespace. If the data model for the contents of a emma:interpretation, container elements, or application namespace element is to be specified in EMMA, the attribute emma:model-ref MUST be specified on the emma:interpretation, container element, or application namespace element. Note that since multiple emma:model elements might be specified under the emma:emma it is possible to refer to multiple data models within a single EMMA document. For example, different alternative interpretations under an emma:one-of might have different data models. In this case, an emma:model-ref attribute would appear on each emma:interpretation element in the N-best list with its value being the id of the emma:model element for that particular interpretation.

The data model is closely related to the interpretation data, and is typically specified as the annotation related to the emma:interpretation or emma:one-of elements.

<emma:emma version="1.0"
    xmlns:emma="http://www.w3.org/2003/04/emma"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xsi:schemaLocation="http://www.w3.org/2003/04/emma
     http://www.w3.org/TR/2007/CR-emma-20071211/emma.xsd"
    xmlns="http://www.example.com/example">
  <emma:model id="model1" ref="http://example.com/models/city.xml"/>
  <emma:interpretation id="int1" emma:model-ref="model1"
    emma:medium="acoustic" emma:mode="voice">
    <city> London </city>
    <country> UK </country>
  </emma:interpretation>
</emma:emma>

The emma:model annotation MAY reference any element or attribute in the application instance data, as well as any EMMA container element (emma:one-of, emma:group, or emma:sequence).

The data model annotation MAY be used to either reference an external data model with the ref attribute or provide a data model as in-line content. Either a ref attribute or in-line data model (but not both) MUST be specified.

4.1.2 Interpretation derivation: emma:derived-from element and emma:derivation element

Annotation emma:derived-from
Definition An empty element which provides a reference to the interpretation which the element it appears on was derived from.
Children None
Attributes
  • Required:
    • resource of type xsd:anyURI that references the interpretation from which the current interpretation is derived.
  • Optional:
    • composite of type xsd:boolean that is "true" if the derivation step combines multiple inputs and "false" if not. If composite is not specified the value is "false" by default.
Applies to The emma:derived-from element is legal only as a child of emma:interpretation, emma:one-of, emma:group, or emma:sequence.
Annotation emma:derivation
Definition An element which contains interpretation and container elements representing earlier stages in the processing of the input.
Children One or more emma:interpretation, emma:one-of, emma:sequence, or emma:group elements.
Attributes None
Applies to The emma:derivation MAY appear only as a child of the emma:emma element.

Instances of interpretations are in general derived from other instances of interpretation in a process that goes from raw data to increasingly refined representations of the input. The derivation annotation is used to link any two interpretations that are related by representing the source and the outcome of an interpretation process. For instance, a speech recognition process can return the following result in the form of raw text:

<emma:emma version="1.0"
    xmlns:emma="http://www.w3.org/2003/04/emma"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xsi:schemaLocation="http://www.w3.org/2003/04/emma
     http://www.w3.org/TR/2007/CR-emma-20071211/emma.xsd"
    xmlns="http://www.example.com/example">
  <emma:interpretation id="raw"
emma:medium="acoustic" emma:mode="voice"> <answer>From Boston to Denver tomorrow</answer> </emma:interpretation> </emma:emma>

A first interpretation process will produce:

<emma:emma version="1.0"
    xmlns:emma="http://www.w3.org/2003/04/emma"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xsi:schemaLocation="http://www.w3.org/2003/04/emma
     http://www.w3.org/TR/2007/CR-emma-20071211/emma.xsd"
    xmlns="http://www.example.com/example">
  <emma:interpretation id="better"
emma:medium="acoustic" emma:mode="voice"> <origin>Boston</origin> <destination>Denver</destination> <date>tomorrow</date> </emma:interpretation> </emma:emma>

A second interpretation process, aware of the current date, will be able to produce a more refined instance, such as:

<emma:emma version="1.0"
    xmlns:emma="http://www.w3.org/2003/04/emma"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xsi:schemaLocation="http://www.w3.org/2003/04/emma
     http://www.w3.org/TR/2007/CR-emma-20071211/emma.xsd"
    xmlns="http://www.example.com/example">
  <emma:interpretation id="best"
    emma:medium="acoustic" emma:mode="voice">
    <origin>Boston</origin>
    <destination>Denver</destination>
    <date>20030315</date>
  </emma:interpretation>
</emma:emma>

The interaction manager might need to have access to the three levels of interpretation. The emma:derived-from annotation element can be used to establish a chain of derivation relationships as in the following example:

<emma:emma version="1.0"
    xmlns:emma="http://www.w3.org/2003/04/emma"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xsi:schemaLocation="http://www.w3.org/2003/04/emma
     http://www.w3.org/TR/2007/CR-emma-20071211/emma.xsd"
    xmlns="http://www.example.com/example">
  <emma:derivation>
    <emma:interpretation id="raw"
emma:medium="acoustic" emma:mode="voice"> <answer>From Boston to Denver tomorrow</answer> </emma:interpretation> <emma:interpretation id="better"> <emma:derived-from resource="#raw" composite="false"/> <origin>Boston</origin> <destination>Denver</destination> <date>tomorrow</date> </emma:interpretation> </emma:derivation> <emma:interpretation id="best"> <emma:derived-from resource="#better" composite="false"/> <origin>Boston</origin> <destination>Denver</destination> <date>20030315</date> </emma:interpretation> </emma:emma>

The emma:derivation element MAY be used as a container for representations of the earlier stages in the interpretation of the input. The latest stage of processing MUST be a direct child of emma:emma.

The resource attribute on emma:derived-from is a URI which can reference IDs in the current or other EMMA documents.

In addition to representing sequential derivations, the EMMA emma:derived-from element can also be used to capture composite derivations. Composite derivations involve combination of inputs from different modes.

In order to indicate whether an emma:derived-from element describes a sequential derivation step or a composite derivation step, the emma:derived-from element has an attribute composite which has a boolean value. A composite emma:derived-from MUST be marked as composite="true" while a sequential emma:derived-from element is marked as composite="false". If this attribute is not specified the value is false by default.

In the following composite derivation example the user said "destination" using the voice mode and circled Boston on a map using the ink mode:

<emma:emma version="1.0"
    xmlns:emma="http://www.w3.org/2003/04/emma"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xsi:schemaLocation="http://www.w3.org/2003/04/emma
     http://www.w3.org/TR/2007/CR-emma-20071211/emma.xsd"
    xmlns="http://www.example.com/example">
  <emma:derivation>
    <emma:interpretation id="voice1"
        emma:start="1087995961500"
        emma:end="1087995962542"
        emma:process="http://example.com/myasr.xml"
        emma:source="http://example.com/microphone/NC-61"
        emma:signal="http://example.com/signals/sg23.wav"
        emma:confidence="0.6"
        emma:medium="acoustic"
        emma:mode="voice"
        emma:function="dialog"
        emma:verbal="true"
        emma:lang="en-US"
        emma:tokens="destination">
      <rawinput>destination</rawinput>
    </emma:interpretation>

    <emma:interpretation id="ink1"
        emma:start="1087995961600"
        emma:end="1087995964000"
        emma:process="http://example.com/mygesturereco.xml"
        emma:source="http://example.com/pen/wacom123"
        emma:signal="http://example.com/signals/ink5.inkml"
        emma:confidence="0.5"
        emma:medium="tactile"
        emma:mode="ink"
        emma:function="dialog"
        emma:verbal="false">
      <rawinput>Boston</rawinput>
    </emma:interpretation>
  </emma:derivation>

  <emma:interpretation id="multimodal1"
      
      
      emma:confidence="0.3"
      emma:start="1087995961500"
      emma:end="1087995964000"
      emma:medium="acoustic tactile"
      emma:mode="voice ink"
      emma:function="dialog"
      emma:verbal="true"
      emma:lang="en-US"
      emma:tokens="destination">
    <emma:derived-from resource="#voice1" composite="true"
    <emma:derived-from resource="#ink1" composite="true"
    <destination>Boston</destination>
  </emma:interpretation>
</emma:emma>

In this example, annotations on the multimodal interpretation indicate the process used for the integration and there are two emma:derived-from elements, one pointing to the speech and one pointing to the pen gesture.

The only constraints the EMMA specification places on the annotations that appear on a composite input are that the emma:medium attribute MUST contain the union of the emma:medium attributes on the combining inputs, represented as a space delimited set of nmtokens as defined in Section 4.2.11, and that the emma:mode attribute MUST contain the union of the emma:mode attributes on the combining inputs, represented as a space delimited set of nmtokens as defined in Section 4.2.11. In the example above this meanings that the emma:medium value is "acoustic tactile" and the emma:mode attribute is "voice ink". How all other annotations are handled is author defined. In the following paragraph, informative examples on how specific annotations might be handled are given.

With reference to the illustrative example above, this paragraph provides informative guidance regarding the determination of annotations (beyond emma:medium and emma:mode on a composite multimodal interpretation). Generally the timestamp on a combined input should contain the intervals indicated by the combining inputs. For the absolute timestamps emma:start and emma:end this can be achieved by taking the earlier of the emma:start values (emma:start="1087995961500" in our example) and the later of the emma:end values (emma:end="1087995964000" in the example). The determination of relative timestamps for composite is more complex, informative guidance is given in Section 4.2.10.4. Generally speaking the emma:confidence value will be some numerical combination of the confidence scores assigned to the combining inputs. In our example, it is the result of multiplying the voice and ink confidence scores (0.3). In other cases there may not be a confidence score for one of the combining inputs and the author may choose to copy the confidence score from the input which does have one. Generally, for emma:verbal, if either of the inputs has the value true then the multimodal interpretation will also be emma:verbal="true" as in the example. In other words the annotation for the composite input is the result of an inclusive OR of the boolean values of the annotations on the inputs. If an annotation is only specified on one of the combining inputs then it may in some cases be assumed to apply to the multimodal interpretation of the composite input. In the example, emma:lang="en-US" is only specified for the speech input, and this annotation appears on the composite result also. Similarly in our example, only the voice has emma:tokens and the author has chosen to annotate the combined input with the same emma:tokens value. In this example, the emma:function is the same on both combining input and the author has chosen to use the same annotation on the composite interpretation.

In annotating derivations of the processing of the input, EMMA provides the flexibility of both course-grained or fine-grained annotation of relations among interpretations. For example, when relating two N-best lists, within emma:one-of elements either there can be a single emma:derived-from element under emma:one-of referring to the ID of the emma:one-of for the earlier processing stage:


<emma:emma version="1.0"
    xmlns:emma="http://www.w3.org/2003/04/emma"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xsi:schemaLocation="http://www.w3.org/2003/04/emma
     http://www.w3.org/TR/2007/CR-emma-20071211/emma.xsd"
    xmlns="http://www.example.com/example">
  <emma:derivation>
    <emma:one-of id="nbest1"
      emma:medium="acoustic" emma:mode="voice">
      <emma:interpretation id="int1">
       <res>from boston to denver on march eleven two thousand three</res>
      </emma:interpretation>

      <emma:interpretation id="int2">
       <res>from austin to denver on march eleven two thousand three</res>
      </emma:interpretation>
  </emma:one-of>
</emma:derivation>

<emma:one-of id="nbest2">
  <emma:derived-from resource="#nbest1" composite="false"/>
  <emma:interpretation id="int1b">
    <origin>Boston</origin>
    <destination>Denver</destination>
    <date>03112003</date>
  </emma:interpretation>

  <emma:interpretation id="int2b">
    <origin>Austin</origin>
    <destination>Denver</destination>
    <date>03112003</date>
  </emma:interpretation>
</emma:one-of>
  
</emma:emma>

Or there can be a separate emma:derived-from element on each emma:interpretation element referring to the specific emma:interpretation element it was derived from.

<emma:emma version="1.0"
    xmlns:emma="http://www.w3.org/2003/04/emma"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xsi:schemaLocation="http://www.w3.org/2003/04/emma
     http://www.w3.org/TR/2007/CR-emma-20071211/emma.xsd"
    xmlns="http://www.example.com/example">
  <emma:one-of id="nbest2">
    <emma:interpretation id="int1b">
     <emma:derived-from resource="#int1" composite="false"/>
      <origin>Boston</origin>
      <destination>Denver</destination>
      <date>03112003</date>
    </emma:interpretation>

    <emma:interpretation id="int2b">
     <emma:derived-from resource="#int2" composite="false"/>
      <origin>Austin</origin>
      <destination>Denver</destination>
      <date>03112003</date>
    </emma:interpretation>
  </emma:one-of>
  <emma:derivation>
    <emma:one-of id="nbest1"
emma:medium="acoustic" emma:mode="voice"> <emma:interpretation id="int1"> <res>from boston to denver on march eleven two thousand three</res> </emma:interpretation> <emma:interpretation id="int2"> <res>from austin to denver on march eleven two thousand three</res> </emma:interpretation> </emma:one-of> </emma:derivation> </emma:emma>

Section 4.3 provides further examples of the use of emma:derived-from to represent sequential derivations and addresses the issue of the scope of EMMA annotations across derivations of user input.

4.1.3 Reference to grammar used: emma:grammar element

Annotation emma:grammar
Definition An element used to provide a reference to the grammar used in processing the input.
Children None
Attributes
  • Required:
    • ref of type xsd:anyURI that references a grammar used in processing the input.
    • id of type xsd:ID.
Applies to The emma:grammar is legal only as a child of the emma:emma element.

The grammar that was used to derive the EMMA result MAY be specified with the emma:grammar annotation defined as an element in the EMMA namespace.

Example:

<emma:emma version="1.0"
    xmlns:emma="http://www.w3.org/2003/04/emma"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xsi:schemaLocation="http://www.w3.org/2003/04/emma
     http://www.w3.org/TR/2007/CR-emma-20071211/emma.xsd"
    xmlns="http://www.example.com/example">
  <emma:grammar id="gram1" ref="someURI"/>
  <emma:grammar id="gram2" ref="anotherURI"/>
  <emma:one-of id="r1"
emma:medium="acoustic" emma:mode="voice"> <emma:interpretation id="int1" emma:grammar-ref="gram1"> <origin>Boston</origin> </emma:interpretation> <emma:interpretation id="int2" emma:grammar-ref="gram1"> <origin>Austin</origin> </emma:interpretation> <emma:interpretation id="int3" emma:grammar-ref="gram2"> <command>help</command> </emma:interpretation> </emma:one-of> </emma:emma>

The emma:grammar annotation is a child of emma:emma.

4.1.4 Extensibility to application/vendor specific annotations: emma:info element

Annotation emma:info
Definition The emma:info element acts as a container for vendor and/or application specific metadata regarding a user's input.
Children One of more elements in the application namespace providing metadata about the input.
Attributes
  • Optional:
    • id of type xsd:ID.
Applies to The emma:info element is legal only as a child of the EMMA elements emma:emma, emma:interpretation, emma:group, emma:one-of, emma:sequence, emma:arc, or emma:node.

In Section 4.2, a series of attributes are defined for representation of metadata about user inputs in a standardized form. EMMA also provides an extensibility mechanism for annotation of user inputs with vendor or application specific metadata not covered by the standard set of EMMA annotations. The element emma:info MUST be used as a container for these annotations, UNLESS they are explicitly covered by emma:endpoint-info. For example, if an input to a dialog system needed to be annotated with the number that the call originated from, their state, some indication of the type of customer, and the name of the service, these pieces of information could be represented within emma:info as in the following example:

<emma:emma version="1.0"
    xmlns:emma="http://www.w3.org/2003/04/emma"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xsi:schemaLocation="http://www.w3.org/2003/04/emma
     http://www.w3.org/TR/2007/CR-emma-20071211/emma.xsd"
    xmlns="http://www.example.com/example">
  <emma:info>
    <caller_id>
      <phone_number>2121234567</phone_number>
      <state>NY</state>
    </caller_id>

    <customer_type>residential</customer_type>
    <service_name>acme_travel_service</service_name>
  </emma:info>

  <emma:one-of id="r1" emma:start="1087995961542"
      emma:end="1087995963542"
      emma:medium="acoustic" emma:mode="voice">
    <emma:interpretation id="int1" emma:confidence="0.75">
      <origin>Boston</origin>
      <destination>Denver</destination>
      <date>03112003</date>
    </emma:interpretation>

    <emma:interpretation id="int2" emma:confidence="0.68">
      <origin>Austin</origin>
      <destination>Denver</destination>
      <date>03112003</date>
    </emma:interpretation>
  </emma:one-of>
</emma:emma>

It is important to have an EMMA container element for application/vendor specific annotations since EMMA elements provide a structure for representation of multiple possible interpretations of the input. As a result it is cumbersome to state application/vendor specific metadata as part of the application data within each emma:interpretation. An element is used rather than an attribute so that internal structure can be given to the annotations within emma:info.

In addition to emma:emma, emma:info MAY also appear as a child of other structural elements such as emma:interpretation, emma:info and so on. When emma:info appears as a child of one of these elements the application/vendor specific annotations contained within emma:info are assumed to apply to all of the emma:interpretation elements within the containing element. The semantics of conflicting annotations in emma:info, for example when different values are found within emma:emma and emma:interpretation, are left to the developer of the vendor/application specific annotations.

4.1.5 Endpoint reference: emma:endpoint-info element and emma:endpoint element

Annotation emma:endpoint-info
Definition The emma:endpoint-info element acts as a container for all application specific annotation regarding the communication environment.
Children One or more emma:endpoint elements.
Attributes
  • Required:
    • id of type xsd:ID.
Applies to The emma:endpoint-info elements is legal only as a child of emma:emma.
Annotation emma:endpoint
Definition The element acts as a container for application specific endpoint information.
Children Elements in the application namespace providing metadata about the input.
Attributes
  • Required:
    • id of type xsd:ID
  • Optional: emma:endpoint-role, emma:endpoint-address, emma:message-id, emma:port-num, emma:port-type, emma:endpoint-pair-ref, emma:service-name, emma:media-type, emma:medium, emma:mode.
Applies to emma:endpoint-info

In order to conduct multimodal interaction, there is a need in EMMA to specify the properties of the endpoint that receives the input which leads to the EMMA annotation. This allows subsequent components to utilize the endpoint properties as well as the annotated inputs to conduct meaningful multimodal interaction. EMMA element emma:endpoint can be used for this purpose. It can specify the endpoint properties based on a set of common endpoint property attributes in EMMA, such as emma:endpoint-address, emma:port-num, emma:port-type, etc. (Section 4.2.14). Moreover, it provides an extensible annotation structure that allows the inclusion of application and vendor specific endpoint properties.

Note that the usage of the term "endpoint" in this context is different from the way that the term is used in speech processing, where it refers to the end of a speech input. As used here, "endpoint" refers to a network location which is the source or recipient of an EMMA document.

In multimodal interaction, multiple devices can be used and each device can open multiple communication endpoints at the same time. These endpoints are used to transmit and receive data, such as raw input, EMMA documents, etc. The EMMA element emma:endpoint provides a generic representation of endpoint information which is relevant to multimodal interaction. It allows the annotation to be interoperable, and it eliminates the need for EMMA processors to create their own specialized annotations for existing protocols, potential protocols or yet undefined private protocols that they may use.

Moreover, emma:endpoint-info provides a container to hold all annotations regarding the endpoint information, including emma:endpoint and other application and vendor specific annotations that are related to the communication, allowing the same communication environment to be referenced and used in multiple interpretations.

Note that EMMA provides two locations (i.e. emma:info and emma:endpoint-info) for specifying vendor/application specific annotations. If the annotation is specifically related to the description of the endpoint, then the vendor/application specific annotation SHOULD be placed within emma:endpoint-info, otherwise it SHOULD be placed within emma:info.

The following example illustrates the annotation of endpoint reference properties in EMMA.

<emma:emma version="1.0"
    xmlns:emma="http://www.w3.org/2003/04/emma"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xsi:schemaLocation="http://www.w3.org/2003/04/emma
     http://www.w3.org/TR/2007/CR-emma-20071211/emma.xsd"
    xmlns="http://www.example.com/example"
    xmlns:ex="http://www.example.com/emma/port">
  <emma:endpoint-info id="audio-channel-1">
    <emma:endpoint id="endpoint1"
        emma:endpoint-role="sink"
        emma:endpoint-address="135.61.71.103"
        emma:port-num="50204"
        emma:port-type="rtp"
        emma:endpoint-pair-ref="endpoint2"
        emma:media-type="audio/dsr-202212; rate:8000; maxptime:40"
        emma:service-name="travel"
        emma:mode="voice">
      <ex:app-protocol>SIP</ex:app-protocol>
    </emma:endpoint>

    <emma:endpoint id="endpoint2"
        emma:endpoint-role="source"
        emma:endpoint-address="136.62.72.104"
        emma:port-num="50204"
        emma:port-type="rtp"
        emma:endpoint-pair-ref="endpoint1"
        emma:media-type="audio/dsr-202212; rate:8000; maxptime:40"
        emma:service-name="travel"
        emma:mode="voice">
      <ex:app-protocol>SIP</ex:app-protocol>
    </emma:endpoint>
  </emma:endpoint-info>

  <emma:interpretation id="int1"
      emma:start="1087995961542" emma:end="1087995963542"
      emma:endpoint-info-ref="audio-channel-1"
emma:me