Multimodal Interaction

in the 90s came multi-output (aka multimedia)...

in the 00s come multi-input (aka multimodal)

digital ink, voice recognition, keyboard, haptics

But multimodal is also about

sessions transfer
environmental conditions

MMI at W3C

Why W3C? W3C doesn't do browsers!

A whole framework

standardising a multimodal interaction framework
leveraging W3C technologies
but also exploring how to make new interfaces work with the Web

A big working group, with many participants from VoiceXML/SRGS/SSML group.

The MMI Framework

Goals

abstractly describe a typical MMI Web application
identify where parts need to be standardised
standardise them

Standards in MMI area, are important, like in Voice applications, as software components come from very different areas of expertise.

MMI Framework

MMI framework diagram

Looking closer: output

Already many existing technologies applicable

diagram of output part of mmi framework

Looking closer: input

mmi framework

EMMA

Goal

Transmit input information from recognisers to interaction mamager, through integration
Intermediaries add metadata to recogniser output

emma process

EMMA Payload

Emma process

EMMA Syntaxes

Open Issue

The Multimodal Interaction Working Group is currently considering the role of RDF in EMMA syntax and processing...

Example

"inline" syntax:

<emma:emma emma:version="1.0"
 xmlns:emma="http://www.w3.org/2003/04/emma#"> 
  <emma:one-of emma:id="r1" 
      emma:start="2003-03-26T0:00:00.15"
      emma:end="2003-03-26T0:00:00.2">
    <emma:interpretation emma:id="int1" emma:confidence="0.75" > 
      <origin>Boston</origin>
      <destination>Denver</destination>
      <date> 
         <emma:absolute-timestamp
          emma:start="2003-03-26T0:00:00.15"
          emma:end="2003-03-26T0:00:00.2"/> 
         03112003 
      </date>  
    </emma:interpretation>
    <emma:interpretation emma:id="int2" emma:confidence="0.68" >
      <origin>Austin</origin>
      <destination>Denver</destination>
      <date>03112003</date>
    </emma:interpretation>
  </emma:one-of>
</emma:emma>

RDF Syntax

<emma:emma emma:version="1.0"
 xmlns:emma="http://www.w3.org/2003/04/emma#"
 xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"> 
  <emma:one-of emma:id="r1">
    <emma:interpretation emma:id="int1">
      <origin>Boston</origin>
      <destination>Denver</destination>
      <date>03112003</date>
    </emma:interpretation>
    <emma:interpretation emma:id="int2">
      <origin>Austin</origin>
      <destination>Denver</destination>
      <date>03112003</date>
    </emma:interpretation>
  </emma:one-of>

  <rdf:RDF>
    <!-- time stamp for result -->
    <rdf:Description rdf:about="#r1" 
      emma:start="2003-03-26T0:00:00.15"
      emma:end="2003-03-26T0:00:00.2"/>

    <!-- confidence score for first interpretation -->
    <rdf:Description rdf:about="#int1" emma:confidence="0.75"/>

    <!-- confidence score for second interpretation -->
    <rdf:Description rdf:about="#int2" emma:confidence="0.68"/>

    <!-- time stamps for date in first interpretation -->
     <rdf:Description rdf:about="#emma(id('int1')/date)"> 
        <emma:absolute-timestamp
         emma:start="2003-03-26T0:00:00.15"
         emma:end="2003-03-26T0:00:00.2"/> 
     </rdf:Description>  
  </rdf:RDF>
</emma:emma>

The structure of EMMA files

1. Data Model

Links to structure of expected results, e.g.

<xsd:element name="origin" type="xsd:string"/>

Can be XML Schema, DTD, RelaxNG, XForms, ...

Linked to from EMMA document:

<emma:model ref="http://www.example.com/models/travel/flight.xsd"/>

2. Instance Data

<emma:emma emma:version="1.0"
 xmlns:emma="http://www.w3.org/2003/04/emma#"
 xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"> 
  <emma:one-of emma:id="r1">
    <emma:interpretation emma:id="int1">
      <origin>Boston</origin>
      <destination>Denver</destination>
      <date>03112003</date>
    </emma:interpretation>
    <emma:interpretation emma:id="int2">
      <origin>Austin</origin>
      <destination>Denver</destination>
      <date>03112003</date>
    </emma:interpretation>
  </emma:one-of>

  <rdf:RDF>
    <!-- time stamp for result -->
    <rdf:Description rdf:about="#r1" 
      emma:start="2003-03-26T0:00:00.15"
      emma:end="2003-03-26T0:00:00.2"/>

    <!-- confidence score for first interpretation -->
    <rdf:Description rdf:about="#int1" emma:confidence="0.75"/>

    <!-- confidence score for second interpretation -->
    <rdf:Description rdf:about="#int2" emma:confidence="0.68"/>

    <!-- time stamps for date in first interpretation -->
     <rdf:Description rdf:about="#emma(id('int1')/date)"> 
        <emma:absolute-timestamp
         emma:start="2003-03-26T0:00:00.15"
         emma:end="2003-03-26T0:00:00.2"/> 
     </rdf:Description>  
  </rdf:RDF>
</emma:emma>

3. Annotations

<emma:emma emma:version="1.0"
 xmlns:emma="http://www.w3.org/2003/04/emma#"
 xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"> 
  <emma:one-of emma:id="r1">
    <emma:interpretation emma:id="int1">
      <origin>Boston</origin>
      <destination>Denver</destination>
      <date>03112003</date>
    </emma:interpretation>
    <emma:interpretation emma:id="int2">
      <origin>Austin</origin>
      <destination>Denver</destination>
      <date>03112003</date>
    </emma:interpretation>
  </emma:one-of>

  <rdf:RDF>
    <!-- time stamp for result -->
    <rdf:Description rdf:about="#r1" 
      emma:start="2003-03-26T0:00:00.15"
      emma:end="2003-03-26T0:00:00.2"/>

    <!-- confidence score for first interpretation -->
    <rdf:Description rdf:about="#int1" emma:confidence="0.75"/>

    <!-- confidence score for second interpretation -->
    <rdf:Description rdf:about="#int2" emma:confidence="0.68"/>

    <!-- time stamps for date in first interpretation -->
     <rdf:Description rdf:about="#emma(id('int1')/date)"> 
        <emma:absolute-timestamp
         emma:start="2003-03-26T0:00:00.15"
         emma:end="2003-03-26T0:00:00.2"/> 
     </rdf:Description>  
  </rdf:RDF>
</emma:emma>

references to external grammar, tokens, reference to external process, reference to raw signal, time stamps

medium (acoustic, tactile or visual)

mode: speech, dtmf, ink, gui, ...

function: recording, transcription, dialog, verification

verbal: true for anything that produces text, false for things that don't

3.1 Derivation Chain

<emma:emma version="1.0" xmlns:emma="http://www.w3.org/2003/04/emma">
  <emma:interpretation id="raw">
    <answer>From Boston to Denver tomorrow</answer>
  </emma:interpretation>

  <emma:interpretation id="better">
    <emma:derived-from resource="#raw" />
    <origin>Boston</origin>
    <destination>Denver</destination>
    <date>tomorrow</date>
  </emma:interpretation>

  <emma:interpretation id="best">
    <emma:derived-from resource="#better" />
    <origin>Boston</origin>
    <destination>Denver</destination>
    <date>20030315</date>
  </emma:interpretation>
</emma:emma>

3.2 misc

emma:tokens="arriving at 'Liverpool Street'"
<rdf:Description rdf:about="#better"emma:process="http://example.com/mysemproc1.xml"/>
<emma:interpretation emma:id="int1" emma:no-input="true" />
<rdf:Description rdf:about="#better" emma:uninterpreted="true"/>
<rdf:Description rdf:about="#int1" emma:lang="fr"/>
<rdf:Description rdf:about="#intp1" emma:signal="http://example.com/signals/sg23.bin"/>
<rdf:Description rdf:about="#intp1" emma:signal-codec="audio/3gpp"/>
<rdf:Description rdf:about="#meaning2" emma:confidence="0.4"/>
<rdf:Description rdf:about="#intp1" emma:source="http://example.com/microphone/NC-61"/>
<emma:interpretation emma:id="int1" emma:start="2003-03-26T0:00:00" emma:end="2003-03-26T0:00:00.2">
<rdf:Description rdf:about="#interp1"
emma:confidence="0.6"

emma:medium="tactile"

emma:mode="ink"

emma:function="dialog"

emma:verbal="true"/>

Medium	Device	Mode	Function
Medium	Device	Mode	recording	dialog	transcription	verification
acoustic	microphone	speech	audiofile (e.g. voicemail)	spoken command / query / response (verbal = true)	dictation	speaker recognition
acoustic	microphone	speech	audiofile (e.g. voicemail)	singing a note (verbal = false)	dictation	speaker recognition
tactile	keypad	dtmf	audiofile / character stream	typed command / query / response (verbal = true)	text entry (T9-tegic, word completion, or word grammar)	password / pin entry
	keypad	dtmf	audiofile / character stream	command key "Press 9 for sales" (verbal = false)	text entry (T9-tegic, word completion, or word grammar)	password / pin entry
	keyboard	keys	character / key-code stream	typed command / query / response (verbal = true)	typing	password / pin entry
	keyboard	keys	character / key-code stream	command key "Press S for sales" (verbal = false)	typing	password / pin entry
	pen	ink	trace, sketch	handwritten command / query / response (verbal = true)	handwritten text entry	signature, handwriter recognition
		ink	trace, sketch	gesture (e.g. circling building) (verbal = false)	handwritten text entry	signature, handwriter recognition
		gui	N/A	tapping on named button (verbal = true)	soft keyboard	password / pin entry
		gui	N/A	drag and drop, tapping on map (verbal = false)	soft keyboard	password / pin entry
	mouse	ink	trace, sketch	handwritten command / query / response (verbal = true)	handwritten text entry	N/A
		ink	trace, sketch	gesture (e.g. circling building) (verbal = false)	handwritten text entry	N/A
		gui	N/A	clicking named button (verbal = true)	soft keyboard	password / pin entry
		gui	N/A	drag and drop, clicking on map (verbal = false)	soft keyboard	password / pin entry
	joystick	ink	trace,sketch	gesture (e.g. circling building) (verbal = false)	N/A	N/A
	joystick	gui	N/A	pointing, clicking button / menu (verbal = false)	soft keyboard	password / pin entry
visual	page scanner	photograph	image	handwritten command / query / response (verbal = true)	optical character recognition, object/scene recognition (markup, e.g. SVG)	N/A
	page scanner	photograph	image	drawings and images (verbal = false)		N/A
	still camera	photograph	image	objects (verbal = false)	visual object/scene recognition	face id, retinal scan
	video camera	video	movie	sign language (verbal = true)	audio/visual recognition	face id, gait id, retinal scan
	video camera	video	movie	face / hand / arm / body gesture (e.g. pointing, facing) (verbal = false)	audio/visual recognition	face id, gait id, retinal scan

Compositing input

How does a multimodal application author specify what composition rules to apply?

No work released yet.

Procedural

Programming language + DOM or SAX

XSLT

<xsl:if test="@emma:confidence &gt; 40">
  <xsl:copy-of select="."/>
</xsl:if>

Declarative

SMIL
```
    
    
        
        
    
]]>
```

The Big Issue: RDF or not?

Extensible Multimodal Annotation Language

RDF Syntax: pros

At W3C, annotations are expressed in RDF

Well defined data model
Syntax splits data and metadata
Can use RDFS and OWL to define property classes
Use of namespaces (e.g. reuse of DC)
Allows annotating attributes and more than one element using XPointer
Allows reification: "the confidence that this element comes from ink is 40%"

RDF Syntax: cons

RDF/XML: need an extra data structure for the graph
Data structure not really a graph (yet)

Need a way to point to parts of documents which don't necessarily have an ID: XPointer

     <rdf:Description rdf:about="#emma(id('int1')/date)"> 
        <emma:absolute-timestamp
         emma:start="2003-03-26T0:00:00.15"Z
         emma:end="2003-03-26T0:00:00.2"/> 
     </rdf:Description>

RDF edge case: all node URIs are transient, reification

So?

A decision has been made...

Future work? Lattices

example of a lattice

Instead of linear n-best list:

<departure>boston</departure><destination>portland</destination><date>today</date>
<departure>austin</departure><destination>portland</destination><date>today</date>
<departure>boston</departure><destination>oakland</destination><date>today</date>
<departure>austin</departure><destination>oakland</destination><date>today</date>
<departure>boston</departure><destination>portland</destination><date>tomorrow</date>
<departure>austin</departure><destination>portland</destination><date>tomorrow</date>
<departure>boston</departure><destination>oakland</destination><date>tomorrow</date>
<departure>austin</departure><destination>oakland</destination><date>tomorrow</date>

Represent the actual graph in the EMMA instance.

Graph => RDF!

Difficult

you need metadata on arcs - the ugly head of reification

What about this?

alternate lattice representation

Leading to... RDF!

alternate lattice model

What does that give us? Can write ontologies with emma:followed-by and perhaps other properties.

RDF/XML? Hmm...

Conclusion

EMMA shows yet another example of how XML technologies are usable in many contexts, here: multimodal and NLP

Still, lots of options: RDF, RDF/XML, XSLT/SMIL, DOM-IDL/WSDL, maybe more..

Multimodal interaction on the Web is a new problem, which connects it to many W3C activities: not just about XML, but also new devices (XHTML, CC/PP, SVG/CSS), accessibility, internationalisation, Web services, etc.