Use Cases for Possible Future EMMA Features

W3C Working Group Note 15 December 2009

This version:: http://www.w3.org/TR/2009/NOTE-emma-usecases-20091215
Latest version:: http://www.w3.org/TR/emma-usecases
Previous version:: This is the first publication.
Editor:: Michael Johnston, AT&T
Authors:: Deborah A. Dahl, Invited Expert; Ingmar Kliche, Deutsche Telekom AG; Paolo Baggia, Loquendo; Daniel C. Burnett, Voxeo; Felix Burkhardt, Deutsche Telekom AG; Kazuyuki Ashimura, W3C

Abstract

The EMMA: Extensible MultiModal Annotation specification defines an XML markup language for capturing and providing metadata on the interpretation of inputs to multimodal systems. Throughout the implementation report process and discussion since EMMA 1.0 became a W3C Recommendation, a number of new possible use cases for the EMMA language have emerged. These include the use of EMMA to represent multimodal output, biometrics, emotion, sensor data, multi-stage dialogs, and interactions with multiple users. In this document, we describe these use cases and illustrate how the EMMA language could be extended to support them.

Status of this Document

This section describes the status of this document at the time of its publication. Other documents may supersede this document. A list of current W3C publications and the latest revision of this technical report can be found in the W3C technical reports index at http://www.w3.org/TR/.

This document is a W3C Working Group Note published on 15 December 2009. This is the first publication of this document and it represents the views of the W3C Multimodal Interaction Working Group at the time of publication. The document may be updated as new technologies emerge or mature. Publication as a Working Group Note does not imply endorsement by the W3C Membership. This is a draft document and may be updated, replaced or obsoleted by other documents at any time. It is inappropriate to cite this document as other than work in progress.

This document is one of a series produced by the Multimodal Interaction WorkingGroup, part of the W3C Multimodal Interaction Activity. Since EMMA 1.0 became a W3C Recommendation, a number of new possible use cases for the EMMA language have emerged, e.g., the use of EMMA to represent multimodal output, biometrics, emotion, sensor data, multi-stage dialogs and interactions with multiple users. Therefore the Working Group have been working on a document capturing use cases and issues for a series of possible extensions to EMMA. The intention of publishing this Working Group Note is to seek feedback on the various different use cases.

Comments on this document can be sent to www-multimodal@w3.org, the public forum for discussion of the W3C's work on Multimodal Interaction. To subscribe, send an email to www-multimodal-request@w3.org with the word subscribe in the subject line (include the word unsubscribe if you want to unsubscribe). The archive for the list is accessible online.

This document was produced by a group operating under the 5 February 2004 W3C Patent Policy. W3C maintains a public list of any patent disclosures made in connection with the deliverables of the group; that page also includes instructions for disclosing a patent. An individual who has actual knowledge of a patent which the individual believes contains Essential Claim(s) must disclose the information in accordance with section 6 of the W3C Patent Policy.

1. Introduction
2. EMMA use cases

2.1 Incremental results for streaming modalities such as haptics, ink, monologues, dictation
2.2 Representing biometric information
2.3 Representing emotion in EMMA
2.4 Richer semantic representations in EMMA
2.5 Representing system output in EMMA
- 2.5.1 Abstracting output from specific modalities
- 2.5.2 Coordination of outputs distributed over multiple different modalities
2.6 Representation of dialogs in EMMA
2.7 Logging, analysis, and annotation
- 2.7.1 Log analysis
- 2.7.2 Log annotation
2.8 Multi-sentence inputs
2.9 Multi-participant interactions
2.10 Capturing sensor data such as GPS in EMMA
2.11 Extending EMMA from NLU to also represent search or database retrieval results
2.12 Supporting other semantic representation forms in EMMA

General References

1. Introduction

This document presents a set of use cases for possible new features of the Extensible MultiModal Annotation (EMMA) markup language. EMMA 1.0 was designed primarily to be used as a data interchange format by systems that provide semantic interpretations for a variety of inputs, including but not necessarily limited to, speech, natural language text, GUI and ink input. EMMA 1.0 provides a set of elements for containing the various stages of processing of a user's input and a set of elements and attributes for specifying various kinds of metadata such as confidence scores and timestamps. EMMA 1.0 became a W3C Recommendation on February 10, 2009.

A number of possible extensions to EMMA 1.0 have been identified through discussions with other standards organizations, implementers of EMMA, and internal discussions within the W3C Multimodal Interaction Working Group. This document focusses on the following use cases:

Representing incremental results for streaming modalities such as haptics, ink, monologues, dictation, where it is desirable to have partial results available before the full input finishes.
Representing biometric results such as the results of speaker verification or speaker identification (briefly covered in EMMA 1.0).
Representing emotion, for example, as conveyed by intonation patterns, facial expression, or lexical choice.
Richer semantic representations, for example, integrating EMMA application semantics with ontologies.
Representing system output in addition to user input, including topics such as:
1. Isolating presentation logic from dialog/interaction management.
2. Coordination of outputs distributed over multiple different modalities.
Support for archival functions such as logging, human annotation of inputs, and data analysis.
Representing full dialogs and multi-sentence inputs in addition to single inputs.
Representing multi-participant interactions.
Representing sensor data such as GPS input.
Representing the results of database queries or search.
Support for forms of representation of application semantics other than XML, such as JSON.

It may be possible to achieve support for some of these features without modifying the language, through the use of the extensibility mechanisms of EMMA 1.0, such as the <emma:info> element and application-specific semantics; however, this would significantly reduce interoperability among EMMA implementations. If features are of general value then it would be beneficial to define standard ways of implementing them within the EMMA language. Additionally, extensions may be needed to support additional new kinds of input modalities such as multi-touch and accelerometer input.

The W3C Membership and other interested parties are invited to review this document and send comments to the Working Group's public mailing list www-multimodal@w3.org (archive) .

2. EMMA use cases

2.1 Incremental results for streaming modalities such as haptic, ink, monologues, dictation

In EMMA 1.0, EMMA documents were assumed to be created for completed inputs within a given modality. However, there are important use cases where it would be beneficial to represent some level of interpretation of partial results before the input is complete. For example, in a dictation application, where inputs can be lengthy it is often desirable to show partial results to give feedback to the user while they are speaking. In this case, each new word is appended to the previous sequence of words. Another use case would be incremental ASR, either for dictation or dialog applications, where previous results might be replaced as more evidence is collected. As more words are recognized and provide more context, earlier word hypotheses may be updated. In this scenario it may be necessary to replace the previous hypothesis with a revised one.

In this section, we discuss how the EMMA standard could be extended to support incremental or streaming results in the processing of a single input. Some key considerations and areas for discussion are:

Do we need an identifier for a particular stream? Or is emma:source sufficient? Subsequent messages (carrying information for a particular stream) may need to have the same identifier.
Do we need a sequence number to indicate order? Or are timestamps sufficient (though optional)?
Do we need to mark "begin", "in progress" and "end" of a stream? There are streams with a particular start and end, like a dictation. Note that sensors may never explicitly end a stream.
Do we always append information? Or do we also replace previous data? A dictation application will probably append new text. But do we consider sensor data (such as GPS position or device tilt) as streaming or as "final" data?

In the example below for dictation, we show how three new attributes emma:streamId, emma:streamSeqNr, and emma:streamProgress could be used to annotate each result with metadata regarding its position and status within a stream of input. In this example, the emma:streamId is a identifier which can be used to show that different emma:interpretation elements are members of the same stream. The emma:streamSeqNr attribute provides a numerical order to elements in the stream while emma:streamProgress indicates the start of the stream (and whether to expect more interpretations within the same stream), and the end of the stream. This is an instance of the 'append' scenario for partial results in EMMA.

Participant

Input

EMMA

User

Hi Joe the meeting has moved

<emma:emma > 
  version="2.0"
  xmlns:emma="http://www.w3.org/2003/04/emma"
  xmlns="http://www.example.com/example"
  <emma:interpretation id="int1"
    emma:medium="acoustic"
    emma:mode="voice"
    emma:function="transcription"
    emma:confidence="0.75" 
    emma:tokens="Hi Joe the meeting has moved" 
    emma:streamId="id1" 
    emma:streamSeqNr="0" 
    emma:streamProgress="begin">
      <emma:literal>
      Hi Joe the meeting has moved
      </emma:literal>
  </emma:interpretation>
</emma:emma>

User

to friday at four

<emma:emma > 
  version="2.0"
  xmlns:emma="http://www.w3.org/2003/04/emma"
  xmlns="http://www.example.com/example"
  <emma:interpretation id="int2"
    emma:medium="acoustic"
    emma:mode="voice"
    emma:function="transcription"
    emma:confidence="0.75" 
    emma:tokens="to friday at four" 
    emma:streamId="id1" 
    emma:streamSeqNr="1" 
    emma:streamProgress="end">
      <emma:literal>
      to friday at four
      </emma:literal>
  </emma:interpretation>
</emma:emma>

In the example below, a speech recognition hypothesis for the whole string is updated once more words have been recognized. This is an instance of the 'replace' scenario for partial results in EMMA. Note that the emma:streamSeqNr is the same for each interpretation in this case.

Participant

Input

EMMA

User

Is there a Pisa

<emma:emma > 
  version="2.0"
  xmlns:emma="http://www.w3.org/2003/04/emma"
  xmlns="http://www.example.com/example"
  <emma:interpretation id="int1"
    emma:medium="acoustic"
    emma:mode="voice"
    emma:function="dialog"
    emma:confidence="0.7"
    emma:tokens="is there a pisa" 
    emma:streamId="id2" 
    emma:streamSeqNr="0" 
    emma:streamProgress="begin">
      <emma:literal>
      is there a pisa
      </emma:literal>
  </emma:interpretation>
</emma:emma>

User

Is there a pizza restaurant

<emma:emma > 
  version="2.0"
  xmlns:emma="http://www.w3.org/2003/04/emma"
  xmlns="http://www.example.com/example"
  <emma:interpretation id="int2" 
    emma:medium="acoustic"
    emma:mode="voice"
    emma:function="dialog"
    emma:confidence="0.9"
    emma:tokens="is there a pizza restaurant" 
    emma:streamId="id2" 
    emma:streamSeqNr="0" 
    emma:streamProgress="end"> 
      <emma:literal>
      is there a pizza restaurant
      </emma:literal>
  </emma:interpretation>
</emma:emma>

One issue for the 'replace' case of incremental results, is how to specify that a result replaces multiple of the previously received results. For example, a system could receive partial results consisting of each word in turn of an utterance, and then a final result which is the final recognition for the whole sequence of words. One approach to this problem would be to allow emma:streamSeqNr to specify a range of inputs to be replaced. For example, if the emma:streamSeqNr for each of three single word results was 1, 2, and then 3. A final revised result could be marked as emma:streamSeqNr="1-3" indicating that it is a revised result for those three words.

One issue is whether timestamps might be used to track ordering instead of introducing new attributes. One problem is that timestamp attributes are not required and may not always be available. Also as shown in the example, chunks of input in a stream may not always be in sequential order. Even with timestamps providing an order some kind of 'begin' and 'end' flag is needed (like emma:streamProgress) to indicate indicate the beginning and end of transmission of streamed input. Moreover, timestamps do not provide sufficient information to detect whether a message has been lost.

Another possibility to explore for representation of incremental results would be to use an <emma:sequence> element containing the interim results and a derived result which contains the combination.

Another issue to explore is the relationship between incremental results and the MMI lifecyle events within the MMI Architecture.

2.2 Representing biometric information

Biometric technologies include systems designed to identify someone or verify a claim of identity based on their physical or behavioral characteristics. These include speaker verification, speaker identification, face recognition, and iris recognition, among others. EMMA 1.0 provided some capability for representing the results of biometric analysis through values of the emma:function attribute such as "verification". However, it did not discuss the specifics of this use case in any detail. It may be worth exploring further considerations and consequences of using EMMA to represent biometric results. As one example, if different biometric results are represented in EMMA, this would simplify the process of fusing the outputs of multiple biometric technologies to obtain a more reliable overall result. It should also make it easier to take into account non-biometric claims of identity, such as a statement like "this is Kazuyuki", represented in EMMA, along with a speaker verification result based on the speaker's voice, which would also be represented in EMMA. In the following example, we have extended the set of values for emma:function to include "identification" for an interpretation showing the results of a biometric component that picks out an individual from a set of possible individuals (who are they). This contrasts with "verification" which is used for verification of a particular user (are they who they say they are).

Example

Participant

Input

EMMA

user

an image of a face

<emma:emma> 
  version="2.0"
  xmlns:emma="http://www.w3.org/2003/04/emma"
  xmlns="http://www.example.com/example"
  <emma:interpretation id=“int1"
    emma:confidence="0.75”
    emma:medium="visual" 
    emma:mode="photograph" 
    emma:verbal="false" 
    emma:function="identification"> 
      <person>12345</person>
      <name>Mary Smith</name> 
  </emma:interpretation>
</emma:emma>

One direction to explore further is the relationship between work on messaging protocols for biometrics within the OASIS Biometric Identity Assurance Services (BIAS) standards committee and EMMA.

2.3 Representing emotion in EMMA

In addition to speech recognition, and other tasks such as speaker verification and identification, another kind of interpretation of speech that is of increasing importance is determination of the emotional state of the speaker, based on, for example, their prosody, lexical choice, or other features. This information can be used, for example, to make the dialog logic of an interactive system sensitive to the user's emotional state. Emotion detection can also use other modalities such as vision (facial expression, posture) and physiological sensors such as skin conductance measurement or blood pressure. Multimodal approaches where evidence is combined from multiple different modalities are also of significance for emotion classification.

The creation of a markup language for emotion has been a recent focus of attention in W3C. Work that initiated in the W3C Emotion Markup Language Incubator Group (EmotionML XG), has now transitioned to the W3C Multimodal Working Group and the EmotionML language has been published as a working draft. One of the major use cases for that effort is: "Automatic recognition of emotions from sensors, including physiological sensors, speech recordings, facial expressions, etc., as well as from multi-modal combinations of sensors."

Given the similarities to the technologies and annotations used for other kinds of input processing (recognition, semantic classification) which are now captured in EMMA, it makes sense to explore the use of EMMA for capture of emotional classification of inputs. Just as EMMA does not standardize the application markup for semantic results, though, it does not make sense to try and standardize emotion markup within EMMA. One promising approach is to combine the containers and metadata annotation of EMMA with the EmotionML markup, as shown in the following example.

Participant

Input

EMMA

user

expression of boredom

<emma:emma
  version="2.0"
  xmlns:emma="http://www.w3.org/2003/04/emma"
  xmlns="http://www.example.com/example"
  xmlns:emo="http://www.w3.org/2009/10/emotionml">
  <emma:interpretation id="emo1"
    emma:start="1241035886246"
    emma:end="1241035888246"
    emma:medium="acoustic"
    emma:mode="voice"
    emma:verbal="false"
    emma:signal="http://example.com/input345.amr"
    emma:media-type="audio/amr; rate:8000;"
    emma:process="engine:type=emo_class&vn=1.2”>
      <emo:emotion>
        <emo:intensity 
          value="0.1" 
          confidence="0.8"/>
        <emo:category 
          set="everydayEmotions" 
          name="boredom" 
          confidence="0.1"/>
      </emo:emotion>
  </emma:interpretation>
</emma:emma>

In this example, we use the capabilities of EMMA for describing the input signal, its temporal characteristics, modality, sampling rate, audio codec etc. and EmotionML is used to provide the specific representation of the emotion. Other EMMA container elements also have strong use cases for emotion recognition. For example, <emma:one-of> can be used to represent N-best lists of competing classifications of emotion. The <emma:group> element could be used to combine a semantic interpretation of a user input with an emotional classification, as illustrated in the following example. Note that all of the general properties of the signal can be specified on the <emma:group> element.

Participant

Input

EMMA

user

spoken input "flights to boston tomorrow" to dialog system in angry voice

<emma:emma 
  version="2.0"
  xmlns:emma="http://www.w3.org/2003/04/emma"
  xmlns="http://www.example.com/example"
  xmlns:emo="http://www.w3.org/2009/10/emotionml">
  <emma:group id="result1"
    emma:start="1241035886246"
    emma:end="1241035888246"
    emma:medium="acoustic"
    emma:mode="voice"
    emma:verbal="false"
    emma:signal="http://example.com/input345.amr"
    emma:media-type="audio/amr; rate:8000;">
    <emma:interpretation id="asr1"
      emma:tokens="flights to boston tomorrow"
      emma:confidence="0.76"
      emma:process="engine:type=asr_nl&vn=5.2”>
        <flight>
          <dest>boston</dest>
          <date>tomorrow</date>
        </flight>
    </emma:interpretation>
    <emma:interpretation id="emo1"
      emma:process="engine:type=emo_class&vn=1.2”>
      <emo:emotion>
        <emo:intensity 
          value="0.3" 
          confidence="0.8"/>
        <emo:category 
          set="everydayEmotions" 
          name="anger" 
          confidence="0.8"/>
      </emo:emotion>
    </emma:interpretation>
    <emma:group-info>
    meaning_and_emotion
    </emma:group-info>
  </emma:group>
</emma:emma>

The element <emma:group> can also be used to capture groups of emotion detection results from individual modalities for combination by a multimodal fusion component or when automatic recognition results are described together with manually annotated data. This use case is inspired by Use case 2b (II) of the Emotion Incubator Group Report. The following example illustrates the grouping of three interpretations, namely: a speech analysis emotion classifier, a physiological emotion classifier measuring blood pressure, and a human annotator viewing video, for two different media files (from the same episode) that are synchronized via emma:start and emma:end attributes. In this case, the physiological reading is for a subinterval of the video and audio recording.

Participant

Input

EMMA

user

audio, video, and physiological sensor of a test user acting with a new design.

<emma:emma
  version="2.0"
  xmlns:emma="http://www.w3.org/2003/04/emma"
  xmlns="http://www.example.com/example"
  xmlns:emo="http://www.w3.org/2009/10/emotionml">
  <emma:group id="result1">
    <emma:interpretation id="speechClassification1"      
      emma:medium="acoustic"
      emma:mode="voice"
      emma:verbal="false"
      emma:start="1241035884246"
      emma:end="1241035887246"
      emma:signal="http://example.com/video_345.mov"
      emma:process="engine:type=emo_voice_classifier”>
        <emo:emotion>
          <emo:category 
            set="everydayEmotions" 
            name="anger" 
            confidence="0.8"/>
        </emo:emotion>
    </emma:interpretation>
    <emma:interpretation id="bloodPressure1"               
      emma:medium="tactile"
      emma:mode="blood_pressure"
      emma:verbal="false"
      emma:start="1241035885300"
      emma:end="1241035886900"
      emma:signal="http://example.com/bp_signal_345.cvs"
      emma:process="engine:type=emo_physiological_classifier”>
        <emo:emotion>
          <emo:category 
            set="everydayEmotions" 
            name="anger" 
            confidence="0.6"/>
        </emo:emotion>
    </emma:interpretation>
    <emma:interpretation id="humanAnnotation1"               
      emma:medium="visual"
      emma:mode="video"
      emma:verbal="false"
      emma:start="1241035884246"
      emma:end="1241035887246"
      emma:signal="http://example.com/video_345.mov"
      emma:process="human:type=labeler&id=1”>
        <emo:emotion>
          <emo:category 
            set="everydayEmotions" 
            name="fear" 
            confidence="0.6"/>
        </emo:emotion>
    </emma:interpretation>
    <emma:group-info>
    several_emotion_interpretations
    </emma:group-info>
  </emma:group>
</emma:emma>

A combination of <emma:group> and <emma:derivation> could be used to represent a combined emotional analysis resulting from analysis of multiple different modalities of the user's behavior. The <emma:derived-from> and <emma:derivation> elements can be used to capture both the fused result and combining inputs in a single EMMA document. In the following example, visual analysis of user activity and analysis of their speech have been combined by a multimodal fusion component to provide an combined multimodal classification of the user's emotional state. The specifics of the multimodal fusion algorithm are not relevant here, or to EMMA in general. Note though that in this case, the multimodal fusion appears to have compensated for uncertainty in the visual analysis which gave two results with equal confidence, one for fear and one for anger. The emma:one-of element is used to capture the N-best list of multiple competing results from the video classifier.

Participant

Input

EMMA

user

multimodal fusion of emotion classification of user based on analysis of voice and video

<emma:emma
  version="2.0"
  xmlns:emma="http://www.w3.org/2003/04/emma"
  xmlns="http://www.example.com/example"
  xmlns:emo="http://www.w3.org/2009/10/emotionml">
  <emma:interpretation id="multimodalClassification1" 
    emma:medium="acoustic,visual"
    emma:mode="voice,video"
    emma:verbal="false"
    emma:start="1241035884246"
    emma:end="1241035887246"
    emma:process="engine:type=multimodal_fusion”>
      <emo:emotion>
        <emo:category 
          set="everydayEmotions" 
          name="anger" 
          confidence="0.7"/>
      </emo:emotion>
    <emma:derived-from ref="mmgroup1" composite="true"/>
  </emma:interpretation>
  <emma:derivation>
    <emma:group id="mmgroup1">
      <emma:interpretation id="speechClassification1"      
        emma:medium="acoustic"
        emma:mode="voice"
        emma:verbal="false"
        emma:start="1241035884246"
        emma:end="1241035887246"
        emma:signal="http://example.com/video_345.mov"
        emma:process="engine:type=emo_voice_classifier”>
          <emo:emotion>
            <emo:category 
              set="everydayEmotions" 
              name="anger" 
              confidence="0.8"/>
          </emo:emotion>
      </emma:interpretation>
      <emma:one-of id="video_nbest"               
        emma:medium="visual"
        emma:mode="video"
        emma:verbal="false"
        emma:start="1241035884246"
        emma:end="1241035887246"
        emma:signal="http://example.com/video_345.mov"
        emma:process="engine:type=video_classifier">
        <emma:interpretation id="video_result1"
          <emo:emotion>
            <emo:category 
              set="everydayEmotions" 
              name="anger" 
              confidence="0.5"/>
          </emo:emotion>
        </emma:interpretation>
        <emma:interpretation id="video_result2"
          <emo:emotion>
            <emo:category 
              set="everydayEmotions" 
              name="fear" 
              confidence="0.5"/>
          </emo:emotion>
        </emma:interpretation>
      </emma:one-of>
      <emma:group-info>
      emotion_interpretations
      </emma:group-info>
    </emma:group>
  </emma:derivation>
</emma:emma>

One issue which need to be addressed is the relationship between EmotionML confidence attribute values and emma:confidence values. Could the emma:confidence value be used as an overall confidence value for the emotion result, or should confidence values appear only within the EmotionML markup since confidence is used for different dimensions of the result? If a series of possible emotion classifications are contained in emma:one-of should they be ordered by the EmotionML confidence values?

2.4 Richer semantic representations in EMMA

Enriching the semantic information represented in EMMA would be helpful for certain use cases. For example, the concepts in an EMMA application semantics representation might include references to concepts in an ontology such as WordNet. Then, a translation system might make use of a sense disambiguator to represent the probabilities of different senses of a word, for example, "spicy" in the example has two possible WordNet senses. In the following example, inputs to a machine translation system are annotated in the application semantics with specific WordNet senses which are used to distinguish among different senses of the words. A translation system might make use of a sense disambiguator to represent the probabilities of different senses of a word, for example, "spicy" in the example has two possible WordNet senses.

Participant

Input

EMMA

user

I love to eat Mexican food because it is spicy

<emma:emma
  version="2.0"
  xmlns:emma="http://www.w3.org/2003/04/emma"
  xmlns="http://www.example.com/example"
  xmlns="http://example.com/universal_translator">
  <emma:interpretation id="spanish">
    <result xml:lang="es">
    Adoro alimento mejicano porque es picante.
    </result>
    <emma:derived-from resource="#english" composite="false"/>
  </emma:interpretation>
  <emma:derivation>
    <emma:interpretation id="english"
      emma:tokens="I love to eat Mexican food
                   because it is spicy">
      <assertion>
        <interaction
          wordnet="1828736"
          wordnet-desc="love, enjoy (get pleasure from)"
          token="love">
          <experiencer
            reference="first" 
            token="I">
                <attribute quantity="single"/>
          </experiencer>
          <attribute time="present"/>
          <content>
            <interaction wordnet="1157345" 
              wordnet-desc="eat (take in solid food)"
              token="to eat">
              <object id="obj1"
                wordnet="7555863"
                wordnet-desc="food, solid food (any solid 
                              substance (as opposed to 
                              liquid) that is used as a source
                              of nourishment)"
                        token="food">
                <restriction 
                  wordnet="3026902"
                  wordnet-desc="Mexican (of or relating
                                to Mexico or its inhabitants)"
                                token="Mexican"/>
              </object>
            </interaction>
          </content>
          <reason token="because">
            <experiencer reference="third" 
              target="obj1" token="it"/>
                <attribute time="present"/>
                <one-of token="spicy">
                  <modification wordnet="2397732"
                    wordnet-desc="hot, spicy (producing a 
                                  burning sensation on 
                                  the taste nerves)" 
                    confidence="0.8"/>
                  <modification wordnet="2398378"
                    wordnet-desc="piquant, savory, 
                                  savoury, spicy, zesty
                                  (having an agreeably
                                  pungent taste)"
                    confidence="0.4"/>
                </one-of>
           </reason>
         </interaction>
       </assertion>
     </emma:interpretation>
  </emma:derivation>
</emma:emma>

In addition to sense disambiguation it could also be useful to relate concepts to superordinate concepts in some ontology. For example, it could be useful to know that O'Hare is an airport and Chicago is a city, even though they might be used interchangeably in an application. For example, in an air travel application a user might say "I want to fly to O'Hare" or "I want to fly to Chicago".

2.5 Representing system output in EMMA

EMMA 1.0 was explicitly limited in scope to representation of the interpretation of user inputs. Most interactive systems also produce system output and one of the major possible extensions of the EMMA language would be to provide support for representation of the outputs made by the system in addition to the user inputs. One advantage of having EMMA representation for system output is that system logs can have unified markup representation across input and output for viewing and analyzing user/system interactions. In this section, we consider two different use cases for addition of output representation to EMMA.

2.5.1 Abstracting output from specific modality or output language

It is desirable for a multimodal dialog designer to be able to isolate dialog flow (for example SCXML code) from the details of specific utterances produced by a system. This can achieved by using presentation or media planning component that takes the abstract intent from the system and creates one or more modality-specific presentations. In addition to isolating dialog logic from specific modality choice this can also make it easier to support different technologies for the same modality. For example, in the example below, the GUI technology is HTML, but abstracting output would also support using a different GUI technology like Flash, or SVG. If EMMA is extended to support output, then EMMA documents could be used for communication from the dialog manager to the presentation planning component, and also potentially for the documents generated by the presentation component, which could embed specific markup such as HTML and SSML. Just as there can be multiple different stages of processing of a user input, there may be multiple stages of processing of an output, and the mechanisms of EMMA can be used to capture and provide metadata on these various stages of output processing.

Potential benefits for this approach include:

Accessibility: it would be useful for an application to be able to accommodate users who might have an assistive device or devices without requiring special logic or even special applications.
Device independence: An application could separate the flow in the IM from the details of the presentation. This might be especially useful if there are a lot of target devices with different types of screens, cameras, or possibilities for haptic output.
Adapting to user preferences: An application could accommodate different dynamic preferences, for example, switching to visual presentation from speech in public places without disturbing the application flow.

In the following example, we consider the introduction of a new EMMA element, <emma:presentation> which is the output equivalent of the input element <emma:interpretation>. Like <emma:interpretation> this element can take emma:medium and emma:mode attributes classifying the specific modality. It could also potentially take timestamp annotations indicating the time at which the output should be produced. One issue is whether timestamps should be used for the intended time of production or for the actual time of production and how to capture both. Relative timestamps could be used to anchor the planned time of presentation to another element of system output. In this example we show how the emma:semantic-rep attribute proposed in Section 2.12 could potentially be used to indicate the markup language of the output.

Participant	Output	EMMA
IM (step 1)	semantics of "what would you like for lunch?"	<emma:emma version="2.0" xmlns:emma="http://www.w3.org/2003/04/emma" xmlns="http://www.example.com/example"> <emma:presentation> <question> <topic>lunch</topic> <experiencer>second person</experiencer> <object>questioned</object> </question> </emma:presentation> </emma:emma> or, more simply, without natural language generation: <emma:emma> <emma:presentation> <text>what would you like for lunch?</text> </emma:presentation> </emma:emma>
presentation manager (voice output)	text "what would you like for lunch?"	<emma:emma version="2.0" xmlns:emma="http://www.w3.org/2003/04/emma" xmlns="http://www.example.com/example"> <emma:presentation emma:medium="acoustic" emma:mode="voice" emma:verbal="true" emma:function="dialog" emma:semantic-rep="ssml"> <speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/2001/10/synthesis http://www.w3.org/TR/speech-synthesis/synthesis.xsd" xml:lang="en-US"> what would you like for lunch</speak> </emma:presentation> </emma:emma>
presentation manager (GUI output)	text "what would you like for lunch?"	<emma:emma version="2.0" xmlns:emma="http://www.w3.org/2003/04/emma" xmlns="http://www.example.com/example"> <emma:presentation emma:medium="visual" emma:mode="graphics" emma:verbal="true" emma:function="dialog" emma:semantic-rep="html"> <html> <body> <p>what would you like for lunch?"</p> <input name="" type="text"> <input type="submit" name="Submit" value="Submit"> </body> </html> </emma:presentation> </emma:emma>

2.5.2 Coordination of outputs distributed over multiple different modalities

A critical issue in the enablement of effective multimodal output is to enable synchronization of outputs in different output media. For example, text to speech output or prompts may be coordinated with graphical outputs such as highlighting of items in an HTML table. EMMA markup could potentially be used to indicate that elements in each medium should be coordinated in their presentation. In the following example, a new attribute emma:sync is used to indicate the relationship between a <mark> in SSML and an element to be highlighted in HTML content. The emma:process attribute could be used to identify the presentation planning component. Again emma:semantic-rep is used to indicate the embedded markup language.

Participant

Output

EMMA

system

Coordinated presentation of table with TTS

<emma:emma
  version="2.0"
  xmlns:emma="http://www.w3.org/2003/04/emma"
  xmlns="http://www.example.com/example"> 
  <emma:group id=“gp1" 
    emma:medium="acoustic,visual" 
    emma:mode="voice,graphics" 
    emma:process="http://example.com/presentation_planner">
    <emma:presentation id=“pres1"
      emma:medium="acoustic" 
      emma:mode="voice" 
      emma:verbal="true" 
      emma:function="dialog" 
      emma:semantic-rep="ssml"> 
      <speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis"
        xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
        xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
        http://www.w3.org/TR/speech-synthesis/synthesis.xsd"
        xml:lang="en-US"> 
        Item 4 <mark emma:sync="123"/> costs fifteen dollars.
      </speak> 
    </emma:presentation> 
    <emma:presentation id=“pres2"
      emma:medium="visual" 
      emma:mode="graphics" 
      emma:verbal="true" 
      emma:function="dialog" 
      emma:semantic-rep="html" 
      <table xmlns="http://www.w3.org/1999/xhtml">
        <tr>
          <td emma:sync="123">Item 4</td>
          <td>15 dollars</td>
        </tr>
      </table>
    </emma:presentation> 
  </emma:group>
</emma:emma>

One issue to be considered is the potential role of the Synchronized Multimedia Integration Language (SMIL) for capturing multimodal output synchronization. SMIL markup for multimedia presentation could potentially be embedded within EMMA markup coming from an interaction manager to a client for rendering.

2.6 Representation of dialogs in EMMA

The scope of EMMA 1.0 was explicitly limited to representation of single turns of user input. For logging, analysis, and training purposes it could be useful to be able to represent multi-stage dialogs in EMMA. The following example shows a sequence of two EMMA documents where the the first is a request from the system and the second is the user response. A new attribute emma:in-response-to is used to relate the system output to the user input. EMMA already has an attribute emma:dialog-turn used to provide an indicator of the turn of interaction.

Example

Participant

Input

EMMA

system

where would you like to go?

<emma:emma
  version="2.0"
  xmlns:emma="http://www.w3.org/2003/04/emma"
  xmlns="http://www.example.com/example"> 
  <emma:presentation id="pres1" 
    emma:dialog-turn="turn1" 
    emma:in-response-to="initial">
      <prompt>
      where would you like to go?
      </prompt>
  </emma:presentation> 
</emma:emma>

user

New York

<emma:emma
  version="2.0"
  xmlns:emma="http://www.w3.org/2003/04/emma"
  xmlns="http://www.example.com/example"> 
  <emma:interpretation id="int1" 
    emma:dialog-turn="turn2"
    emma:tokens="new york"
    emma:in-response-to="pres1"> 
      <location>
      New York 
      </location>
  </emma:interpretation>
</emma:emma>

In this case, each utterance is still a single EMMA document, and markup is being used to encode the fact that the utterance are part of an ongoing dialog. Another possibility would be to use EMMA markup to contain a whole dialog within a single EMMA document. For example, a flight query dialog could be represented as follows using <emma:sequence>:

Example

Participant	Input	EMMA
user	flights to boston	<emma:emma version="2.0" xmlns:emma="http://www.w3.org/2003/04/emma" xmlns="http://www.example.com/example"> <emma:sequence> <emma:interpretation id="user1" emma:dialog-turn="turn1" emma:in-response-to="initial"> <emma:literal> flights to boston </emma:literal> </emma:interpretation> <emma:presentation id="sys1" emma:dialog-turn="turn2" emma:in-response-to="user1"> <prompt> traveling to boston, which departure city </prompt> </emma:presentation> <emma:interpretation id="user2" emma:dialog-turn="turn3" emma:in-response-to="sys1"> <emma:literal> san francisco </emma:literal> </emma:interpretation> <emma:presentation id="sys2" emma:dialog-turn="turn4" emma:in-response-to="user2"> <prompt> departure date </prompt> </emma:presentation> <emma:interpretation id="user3" emma:dialog-turn="turn5" emma:in-response-to="sys2"> <emma:literal> next thursday </emma:literal> </emma:interpretation> </emma:sequence> </emma:emma>
system	traveling to Boston, which departure city?
user	San Francisco
system	departure date
user	next thursday

Note that in this example with <emma:sequence> the emma:in-response-to attribute is still important since there is no guarantee that an utterance in a dialog is a response to the previous utterance. For example, a sequence of utterances may all be from the user.

One issue that arises with the representation of whole dialogs is that the resulting EMMA documents with full sets of metadata may become quite large. One possible extension that could help with this would be allow the value of emma:in-response-to to be URI valued so it can refer to another EMMA document.

2.7 Logging, analysis, and annotation

EMMA was initially designed to facilitate communication among components of an interactive system. It has become clear over time that the language can also play an important role in logging of user/system interactions. In this section, we consider possible advantages of EMMA for log analysis and illustrate how elements such as <emma:derived-from> could be used to capture and provide metadata on annotations made by human annotators.

2.7.1 Log analysis

The proposal above for representing system output in EMMA would support after the fact analysis of dialogs. For example, if both the system's and the user's utterance are represented in EMMA, it should be much easier to examine relationships between factors such as how the wording of prompts might affect user's responses or even the modality that users select for their responses. It would also be easier to study timing relationships between the system prompt and the user's responses. For example, prompts that are confusing might consistently elicit longer times before the user starts speaking. This would be useful even without a presentation manager or fission component. In the following example, it might be useful to look into the relationship between the end of the prompt and the start of the user's response. We use here the emma:in-response-to attribute suggested in Section 2.6 for the representation of dialogs in EMMA.

Example

Participant

Input

EMMA

system

where would you like to go?

<emma:emma
  version="2.0"
  xmlns:emma="http://www.w3.org/2003/04/emma"
  xmlns="http://www.example.com/example">
  <emma:presentation id="pres1" 
    emma:dialog-turn="turn1"
    emma:in-response-to="initial"
    emma:start="1241035886246"
    emma:end="1241035888306">
    <prompt>
    where would you like to go?
    </prompt>
  </emma:presentation>
</emma:emma>

user

New York

<emma:emma
  version="2.0"
  xmlns:emma="http://www.w3.org/2003/04/emma"
  xmlns="http://www.example.com/example">
  <emma:interpretation id="int1" 
    emma:dialog-turn="turn2"
    emma:in-response-to="pres1"
    emma:start="1241035891246"
    emma:end="1241035893000"">
    <destination>
    New York
    </destination>
  </emma:interpretation>
</emma:emma>

2.7.2 Log annotation

EMMA is generally used to show the recognition, semantic interpretation etc. assigned to inputs based on machine processing of the user input. Another potential use case is to provide a mechanism for showing the interpretation assigned to an input by a human annotator and using <emma:derived-from> to show the relationship between the input received the annotation. The <emma:one-of> element can then be used to show multiple competing annotations for an input. The <emma:group> element could be used to contain multiple different kinds of annotation on a single input. One question here is whether emma:process can be used for identification of the labeller, and whether there is a need for any additional EMMA machinery to better support this this use case. In these examples, <emma:literal> contains mixed content with text and elements. This is in keeping with the EMMA 1.0 schema.

One issue that arises concerns the meaning of an emma:confidence value on an annotated interpretation. It may be preferable to have another attribute for annotator confidence rather than overloading the current emma:confidence.

Another issue concerns mixing of system results and human annotation. Should these be grouped or is the annotation a derived from the system's interpretation. Also it would be useful to capture the time of the annotation. The current timestamps are used for the time of the input itself. Where should annotation timestamps be recorded?

It would also be useful to have a way to specify open ended information about the annotator such as their native language, profession, experience etc. One approach would be to be to have a new attribute e.g. emma:annotator with a URI value that could point to a description of the annotator.

It could be useful for very common annotations to have in addition to emma:tokens another dedicated element to indicate the annotated transcription, for example, emma:annotated-tokens or emma:transcription.

In the following example, we show how emma:interpretation and emma:derived-from could be used to capture the annotation of an input.

Participant

Input

EMMA

user

In this example the user has said:

"flights from boston to san francisco leaving on the fourth of september"

and the semantic interpretation here is a semantic tagging of the utterance done by a human annotator. emma:process is used to provide details about the annotation

<emma:emma
  version="2.0"
  xmlns:emma="http://www.w3.org/2003/04/emma"
  xmlns="http://www.example.com/example">
  <emma:interpretation id="annotation1"
    emma:process="annotate:type=semantic&annotator=michael"
    emma:confidence="0.90">
      <emma:literal>
      flights from <src>san francisco</src> to 
      <dest>boston</dest> on 
      <date>the fourth of september</date>
      </emma:literal>
    <emma:derived-from resource="#asr1"/>
  </emma:interpretation>
  <emma:derivation>
    <emma:interpretation id="asr1"
      emma:medium="acoustic" 
      emma:mode="voice"
      emma:function="dialog" 
      emma:verbal="true"
      emma:lang="en-US" 
      emma:start="1241690021513" 
      emma:end="1241690023033"
      emma:media-type="audio/amr; rate=8000"
      emma:process="smm:type=asr&version=watson6"
      emma:confidence="0.80">
        <emma:literal>
        flights from san francisco 
        to boston on the fourth of september
        </emma:literal>
    </emma:interpretation>
  </emma:derivation>
</emma:emma>

Taking this example a step further, <emma:group> could be used to group annotations made by multiple different annotators of the same utterance:

Participant Input EMMA

user

In this example the user has said:

"flights from boston to san francisco leaving on the fourth of september"

and the semantic interpretation here is a semantic tagging of the utterance done by two different human annotators. emma:process is used to provide details about the annotation

<emma:emma
  version="2.0"
  xmlns:emma="http://www.w3.org/2003/04/emma"
  xmlns="http://www.example.com/example">
  <emma:group emma:confidence="1.0">
    <emma:interpretation id="annotation1"
      emma:process="annotate:type=semantic&annotator=michael"
      emma:confidence="0.90">
        <emma:literal>
        flights from <src>san francisco</src> 
        to <dest>boston</dest> 
        on <date>the fourth of september</date>
        </emma:literal>
      <emma:derived-from resource="#asr1"/>
    </emma:interpretation>
    <emma:interpretation id="annotation2"
      emma:process="annotate:type=semantic&annotator=debbie"
      emma:confidence="0.90">
        <emma:literal>
        flights from <src>san francisco</src> 
        to <dest>boston</dest> on 
        <date>the fourth of september</date>
        </emma:literal>
      <emma:derived-from resource="#asr1"/>
    </emma:interpretation>
    <emma:group-info>semantic_annotations</emma:group-info>
  </emma:group>
  <emma:derivation>
    <emma:interpretation id="asr1"
      emma:medium="acoustic" 
      emma:mode="voice"
      emma:function="dialog" 
      emma:verbal="true"
      emma:lang="en-US" 
      emma:start="1241690021513" 
      emma:end="1241690023033"
      emma:media-type="audio/amr; rate=8000"
      emma:process="smm:type=asr&version=watson6"
      emma:confidence="0.80">
        <emma:literal>
        flights from san francisco to boston
        on the fourth of september
        </emma:literal>
    </emma:interpretation>
  </emma:derivation>
</emma:emma>

2.8 Multisentence Inputs

For certain applications, it is useful to be able to represent the semantics of multi-sentence inputs, which may be in one of more modalities such as speech (e.g. voicemail), text (e.g. email), or handwritten input. One application use case is for summarizing a voicemail or email. We develop this example below.

There are at least two possible approaches to addressing this use case.

If there is no reason to distinguish the individual sentences of the input or interpret them individually, the entire input could be included as the value of the emma:tokens attribute of an <emma:interpretation> or <emma:one-of> element, where the semantics of the input is represented as the value of an <emma:interpretation>. Although in principle there is no upper limit on the length of a emma:tokens attribute, in practice, this approach might be cumbersome for longer or more complicated texts.
If more structure is required, the interpretations of the individual sentences in the input could be grouped as individual <emma:interpretation> elements under an <emma:sequence> element. A single unified semantics representing the meaning of the entire input could then be represented with the sequence as the value of <emma:derived-from>.

The example below illustrates the first approach.

Example

Participant

Input

EMMA

user

Hi Group,

You are all invited to lunch tomorrow at Tony's Pizza at 12:00. Please let me know if you're planning to come so that I can make reservations. Also let me know if you have any dietary restrictions. Tony's Pizza is at 1234 Main Street. We will be discussing ways of using EMMA.

Debbie

<emma:emma
  version="2.0"
  xmlns:emma="http://www.w3.org/2003/04/emma"
  xmlns="http://www.example.com/example">
  <emma:interpretation 
    emma:tokens="Hi Group, You are all invited to 
    lunch tomorrow at Tony's Pizza at 12:00. 
    Please let me know if you're planning to
    come so that I can make reservations. 
    Also let me know if you have any dietary
    restrictions. Tony's Pizza is at 1234 
    Main Street. We will be discussing 
    ways of using EMMA." >
      <business-event>lunch</business-event>
      <host>debbie</host>
      <attendees>group</attendees>
      <location>
        <name>Tony's Pizza</name>
        <address> 1234 Main Street</address>
      </location>
      <date> tuesday, March 24</date>
      <needs-rsvp>true</needs-rsvp>
      <needs-restrictions>true</need-restrictions>
      <topic>ways of using EMMA</topic>
  </emma:interpretation>
</emma:emma>

2.9 Multi-participant interactions

EMMA 1.0 primarily focussed on the interpretation of inputs from a single user. Both for annotation of human-human dialogs and for the emerging systems which support dialog or multimodal interaction with multiple participants (such as multimodal systems for meeting analysis), it is important to support annotation of interactions involving multiple different participants. The proposals above for capturing dialog can play an important role. One possible further extension would be to add specific markup for annotation of the user making a particular contribution. In the following example, we use an attribute emma:participant to identify the participant contributing each response to the prompt.

Participant	Input	EMMA
system	Please tell me your lunch orders	<emma:emma version="2.0" xmlns:emma="http://www.w3.org/2003/04/emma" xmlns="http://www.example.com/example"> <emma:presentation id="pres1" emma:dialog-turn="turn1" emma:in-response-to="initial" emma:start="1241035886246" emma:end="1241035888306"> <prompt>please tell me your lunch orders</prompt> </emma:presentation> </emma:emma>
user1	I'll have a mushroom pizza	<emma:emma version="2.0" xmlns:emma="http://www.w3.org/2003/04/emma" xmlns="http://www.example.com/example"> <emma:interpretation id="int1" emma:dialog-turn="turn2" emma:in-response-to="pres1" emma:participant="user1" emma:start="1241035891246" emma:end="1241035893000""> <pizza> <topping> mushroom </topping> </pizza> </emma:interpretation> </emma:emma>
user3	I'll have a pepperoni pizza.	<emma:emma version="2.0" xmlns:emma="http://www.w3.org/2003/04/emma" xmlns="http://www.example.com/example"> <emma:interpretation id="int2" emma:dialog-turn="turn3" emma:in-response-to="pres1" emma:participant="user2" emma:start="1241035896246" emma:end="1241035899000""> <pizza> <topping> pepperoni </topping> </pizza> </emma:interpretation> </emma:emma>

2.10 Capturing sensor data such as GPS in EMMA

The multimodal examples described in the EMMA 1.0 specification, include combination of spoken input with a location specified by touch or pen. With the increase in availability of GPS and other location sensing technology such as cell tower triangulation in mobile devices, it is desirable to provide a method for annotating inputs with the device location and, in some cases fusing the GPS information with the spoken command in order to derive a complete interpretation. GPS information could potentially be determined using the Geolocation API Specification from the Geolocation working group and then encoded into a EMMA result sent to a server for fusion.

One possibility using the current EMMA capabilities is to use <emma:group> to associate GPS markup with the semantics of a spoken command. For example, the user might say "where is the nearest pizza place?" and the interpretation of the spoken command is grouped with markup capturing the GPS sensor data. This example uses the existing <emma:group> element and extends the set of values of emma:medium and emma:mode to include "sensor" and "gps" respectively.

Participant

Input

EMMA

user

where is the nearest pizza place?

<emma:emma
  version="2.0"
  xmlns:emma="http://www.w3.org/2003/04/emma"
  xmlns="http://www.example.com/example"> 
  <emma:group>
    <emma:interpretation 
      emma:tokens="where is the nearest pizza place" 
      emma:confidence="0.9" 
      emma:medium="acoustic" 
      emma:mode="voice"
      emma:start="1241035887111" 
      emma:end="1241035888200" 
      emma:process="reco:type=asr&version=asr_eng2.4" 
      emma:media-type="audio/amr; rate=8000" 
      emma:lang="en-US">
        <category>pizza</category>
    </emma:interpretation> 
    <emma:interpretation
      emma:medium="sensor" 
      emma:mode="gps" 
      emma:start="1241035886246" 
      emma:end="1241035886246">
        <lat>40.777463</lat>
        <lon>-74.410500</lon>
        <alt>0.2</alt> 
    </emma:interpretation> 
    <emma:group-info>geolocation</emma:group-info>
  </emma:group> 
</emma:emma>

GPS

(GPS coordinates)

Another, more abbreviated, way to incorporate sensor information would be to have spatial correlates of the timestamps and allow for location stamping of user inputs, e.g. emma:lat and emma:lon attributes that could appear on EMMA container elements to indicate the location where the input was produced.

2.11 Extending EMMA from NLU to also represent search or database retrieval results

In many of the use cases considered so far, EMMA is used for representation of the results of speech recognition and then for the results of natural language understanding, and possibly multimodal fusion. In systems used for voice search, the next step is often to conduct search and extract a set of records or documents. Strictly speaking, this stage of processing is out of scope for EMMA. It is odd though to have the mechanisms of EMMA such as <emma:one-of> for ambiguity all the way up to NLU or multimodal fusion, but not to have access to the same apparatus for representation of the next stage of processing which can often be search or database lookup. Just as we can use <emma:one-of> and emma:confidence to represent N-best recognitions or semantic interpretations, similarly we can use them to represent a series of search results along with their relative confidence. One issue is whether we need some measure other than confidence for relevance ranking, or is the same confidence attribute can be used.

One issue that arises is whether it would be useful to have some recommended or standardized element to use for query results e.g <result> as in the following example. Another issue is how to annotate information about the database and the query that was issued. The database could be indicate as part of the emma:process value as in the following example. For web search the query URL could be annotated on the result e.g. <result url="http://cnn.com"/>. For database queries, the query, SQL for example could be annotated on the results or on the containing <emma:group>.

The following example shows the use of EMMA to represent the results of database retrieval from an employee directory. The user says "John Smith". After ASR, NLU, and then database look up, the system returns the XML here which shows the N-best lists associated with each of these three stages of processing. Here <emma:derived-from&gr; is used to indicate the relations between each of the <emma:one-of> elements. However, if you want to see which specific ASR result a record is derived from, you would need to put <emma:derived-from> on the individual elements.

Participant

Input

EMMA

user

User says "John Smith"

<emma:emma
  version="2.0"
  xmlns:emma="http://www.w3.org/2003/04/emma"
  xmlns="http://www.example.com/example">
  <emma:one-of id="db_results1"
    emma:process="db:type=mysql&database=personel_060109.db>
    <emma:interpretation id="db_nbest1"
      emma:confidence="0.80" emma:tokens="john smith">
        <result>
          <name>John Smith</name>
          <room>dx513</room>
          <number>123-456-7890>/number>
        </result>
    </emma:interpretation>
    <emma:interpretation id="db_nbest2"
      emma:confidence="0.70" emma:tokens="john smith">
        <result>
          <name>John Smith</name>
          <room>ef312</room>
          <number>123-456-7891>/number>
        </result>
    </emma:interpretation>
    <emma:interpretation id="db_nbest3"
      emma:confidence="0.50" emma:tokens="jon smith">
        <result>
          <name>Jon Smith</name>
          <room>dv900</room>
          <number>123-456-7892>/number>
       </result>
    </emma:interpretation>
    <emma:interpretation id="db_nbest4"
      emma:confidence="0.40" emma:tokens="joan smithe">
        <result>
          <name>Joan Smithe</name>
          <room>lt567</room>
          <number>123-456-7893>/number>
        </result>
    </emma:interpretation>
    <emma:derived-from resource="#nlu_results1/>
  </emma:one-of>
  <emma:derivation>
    <emma:one-of id="nlu_results1"
      emma:process="smm:type=nlu&version=parser">
      <emma:interpretation id="nlu_nbest1"
        emma:confidence="0.99" emma:tokens="john smith">
          <fn>john</fn><ln>smith</ln>
      </emma:interpretation>
      <emma:interpretation id="nlu_nbest2"
        emma:confidence="0.97" emma:tokens="jon smith">
          <fn>jon</fn><ln>smith</ln>
      </emma:interpretation>
      <emma:interpretation id="nlu_nbest3"
        emma:confidence="0.93" emma:tokens="joan smithe">
          <fn>joan</fn><ln>smithe</ln>
      </emma:interpretation>
      <emma:derived-from resource="#asr_results1/>
    </emma:one-of>
    <emma:one-of id="asr_results1"
      emma:medium="acoustic" emma:mode="voice"
      emma:function="dialog" emma:verbal="true"
      emma:lang="en-US" emma:start="1241641821513" 
      emma:end="1241641823033"
      emma:media-type="audio/amr; rate=8000"
      emma:process="smm:type=asr&version=watson6">
        <emma:interpretation id="asr_nbest1"
          emma:confidence="1.00">
            <emma:literal>john smith</emma:literal>
        </emma:interpretation>
        <emma:interpretation id="asr_nbest2"
          emma:confidence="0.98">
            <emma:literal>jon smith</emma:literal>
        </emma:interpretation>
        <emma:interpretation id="asr_nbest3"
          emma:confidence="0.89" >
            <emma:literal>joan smithe</emma:literal>
        </emma:interpretation>
   </emma:one-of>
  </emma:derivation>
</emma:emma>

2.12 Supporting other semantic representation forms in EMMA

In the EMMA 1.0 specification, the semantic representation of an input is represented either in XML in some application namespace or as a literal value using emma:literal. In some circumstances it could be beneficial to allow for semantic representation in other formats such as JSON. Serializations such as JSON could potentially be contained within emma:literal using CDATA, and a new EMMA annotation e.g. emma:semantic-rep used to indicate the semantic representation language being used.

Example

Participant

Input

EMMA

user

semantics of spoken input

<emma:emma
  version="2.0"
  xmlns:emma="http://www.w3.org/2003/04/emma"
  xmlns="http://www.example.com/example"> 
  <emma:interpretation id=“int1"
    emma:confidence=".75”
    emma:medium="acoustic" 
    emma:mode="voice" 
    emma:verbal="true"
    emma:function="dialog" 
    emma:semantic-rep="json" 
      <emma:literal> 
        <![CDATA[
              {
           drink: {
              liquid:"coke",
              drinksize:"medium"},
           pizza: {
              number: "3",
              pizzasize: "large",
              topping: [ "pepperoni", "mushrooms" ]
           }
          } 
          ]]>
      </emma:literal> 
  </emma:interpretation> 
</emma:emma>

General References

EMMA 1.0 Requirements http://www.w3.org/TR/EMMAreqs/

EMMA Recommendation http://www.w3.org/TR/emma/

Acknowledgements

Thanks to Jim Larson (W3C Invited Expert) for his contribution to the section on EMMA for multimodal output.

Use Cases for Possible Future EMMA Features

W3C Working Group Note 15 December 2009

Abstract

Status of this Document

Table of Contents

1. Introduction

2. EMMA use cases

2.1 Incremental results for streaming modalities such as haptic, ink, monologues, dictation

2.2 Representing biometric information

Example

2.3 Representing emotion in EMMA

2.4 Richer semantic representations in EMMA

2.5 Representing system output in EMMA

2.5.1 Abstracting output from specific modality or output language

2.5.2 Coordination of outputs distributed over multiple different modalities

2.6 Representation of dialogs in EMMA

Example

Example

2.7 Logging, analysis, and annotation

2.7.1 Log analysis

Example

2.7.2 Log annotation

2.8 Multisentence Inputs

Example

2.9 Multi-participant interactions

2.10 Capturing sensor data such as GPS in EMMA

2.11 Extending EMMA from NLU to also represent search or database retrieval results

2.12 Supporting other semantic representation forms in EMMA

Example

General References

Acknowledgements