W3C Voice Browsing WG - Multimodal Dialog Requirements and Specification

Marianne Hickey, HP Labs

W3C Voice Browsing ActivityMultimodal subgroup

Goal: language spec for multimodal dialog interaction

Speech + other interaction modes
Focus: speech + small screen, pointing device, buttons
Reuse other mark-up
Multiple input/output modalities available simultaneously (and co-ordinated as a lower priority)

Status

May ’00: Paris face to face – show and tell multimodal demos
July ’00: requirements for multimodal dialog interaction http://www.w3.org/TR/multimodal-reqs
Sept ’00: Review draft Language Specification, W3C f2f

Language specification – possible approaches

1: Include elements for synchronisation within the documents
2: XML data structure specifies relationship between documents

Example dialog:

Computer, C:  says "Would you like coffee, tea, milk or nothing?"
and displays pictures of the choices

User, U:   "coffee" or clicks a picture

C:  "Would you like a cookie, a cake, a sandwich, or nothing?"
and displays pictures of the choices

U:  clicks on "sandwich" or says "sandwich"

C:  "Thank you for using the food and drink service!"

Proposed elements for multimodality

Element	Attribute	Description
DialogML elements: <filled>, <noinput>, <menu>, ...		Applies to input from any modality
SMIL elements: <par>, <seq>, <excl>
<show>	src: uri to show id: identifier	e.g. displays content of url in browser window
<update>	id expr	Update with key value pairs expr:key=val(;key=val)*
<listen>	id wait_before	Wait for input.
<close>	id	Close the window
<input-sync-excl>		Take the first input
<input-synch-all>	start: start of sync window end: end of synch window	Co-ordinate inputs from different modalities

Example with embedded synchronization

<form>
  <field name="drink">
    <show src="http://www.drinkfood.ex/drinkfood.html"
      id="drinkfood"/>
    <prompt>Would you like coffee,tea, milk,
      or nothing?</prompt>
    <input-sync-excl>
         <grammar src="drink.gram" type="application/x-jsgf"/>
         <listen id=“drinkfood"/>
    </input-sync-excl>
    <filled>
       <update namelist="drink" id="drinkfood"/> 
       <par> 
           <show src="advert.html" id="advert"/> 
           <audio src="advert.au" begin="5s" dur="10s"/>
       </par> 
    </filled>
  </field>
  <field name="food">
    <prompt>Would you like a cookie, a cake, a sandwich,
    or nothing?</prompt>
    <input-sync-excl>
         <grammar src="food.gram" type="application/x-jsgf"/>
         <listen id=“drinkfood"/> 
    </input-sync-excl>    
    <filled>
       <update namelist="food" id=“drinkfood"/> 
    </filled>
  </field>
  <block>
     <prompt>Thank you for using the food and drink
       service!</prompt>
     <close id=“drinkfood"/>
     <submit next="http://www.drinkfood.ex/drinkfood.asp" 
       namelist="drink food"/>
  </block> 
</form>

Example with separate synchronization

Assume you have two voice dialogs for food and drink, as well as matching html files. The following lists just the voice dialog files for brevity:

drink.vxml:
<?xml version="1.0"?>
<v?xml version="1.0">
  <form>
    <field name="drink">
      <prompt>Would you like coffee, tea, milk,
        or nothing?</prompt>
      <grammar src="drink.gram" type="application/x-jsgf"/>
    </field>
    <block>
       <submit next="http://www.drinkfood.example/drink.asp"
         namelist="drink"/>
    </block>
 </form>
</v?xml>
 
food.vxml:
<?xml version="1.0"?>
<v?xml version="1.0">
  <form>
    <field name="food">
      <prompt>Would you like a sandwich, cake, cookie
        or nothing?</prompt>
      <grammar src="food.gram" type="application/x-jsgf"/>
    </field>
    <block>
       <submit next="http://www.drinkfood.example/food.asp"
          namelist="food"/>
    </block>
 </form>
</v?xml>

Here is some markup for synchronizing the two:

<multimodal>
  <input-sync-excl>
    <show src=“drink.vxml”/>
    <show src=“drink.html”/>
  </input-sync-excl>
  <input-sync-excl>
    <show src=“food.vxml”/>
    <show src=“food.html”/>
  </input-sync-excl>
</multimodal>