W3C's work on voice and multimodal interaction

Giving users a chance to choose between using their ears or eyes, their voice or their fingers

Dave Raggett <dsr@w3.org>

W3C Fellow on assigment
from Openwave Systems





background photo by Gerald Saunders, see www.wetsand.com

Brief history of W3C's involvement in voice standards work

For more details, please see http://www.w3.org/Voice

Current status of Voice Browser working group

Requirements:

Specifications:

The working group is now due to be rechartered, and the W3C Advisory Committee is being asked for their input on this.

A very quick look at VoiceXML

VoiceXML browser

Speech Recognition

Speech Synthesis

Formant based synthesis (Festival):

Concatenative based synthesis (Rhetorical):

What does VoiceXML look like?

<vxml version="2.0" lang="en">
<form>
 <field name="city">
 <prompt>Where do you want to travel to?</prompt>
 <grammar src="travel.xml#cities"
  type="application/grammar+xml"/>
 </field>

 <field name="travellers" type="number">
 <prompt>How many are travelling to
  <value expr="city"/>?</prompt>
 </field>

 <block>
 <submit next="http://www.acmetravel/bookings"
  namelist="city travellers"/>
 </block>
</form>
</vxml>

What is it like to use a VoiceXML application?

Here is an example:

Computer: Welcome to Joe's Pizza ordering service
Computer: Select pizza size from large, medium
 or small?
User:  large
Computer: what number of these pizzas do you want?
User: two
Computer: Select first topping from mozzarella,
 pepperoni and anchovies?
User: mozzarella
Computer: Do you want another topping, yes or no?
User: yes
Computer: Select second topping from mozzarella,
 pepperoni and anchovies?
User: pepperoni
Computer: Do you want any other pizzas, yes or no?
...   

Recovering from problems

Prompts are carefully designed to elicit simple responses, if users fails to respond or the response can't be understood, the application provides tapered prompts, for example:

Computer: what number of these pizzas do you want?
User: I reckon, er, two would do the job
Computer: please say the number on its own
User: two
Computer: Select first topping from mozzarella,
pepperoni and anchovies?
...   

Recognition confidence scores can be used to determine when to ask for confirmation

Tapered Prompts

Offer progressively more guidance on each turn:

<field name="travellers" type="number">
  <prompt count="1">
    How many are travelling to <value expr="city"/>? 
  </prompt>

  <prompt count="2">
    Please tell me the number of people travelling.
  </prompt>

  <prompt count="3">
    To book a flight, you must tell me the number
    of people travelling to <value expr="city"/>.
  </prompt>

  <nomatch>
   <prompt>Please say just a number.</prompt>
   <reprompt/>
  </nomatch>
</field>

For more examples: http://www.w3.org/Voice/Guide/

Richer use of natural language

The dialog gets the job done, but is very rigid. With larger grammars, a more natural interaction style becomes possible, for example:

Computer: Welcome to Joe's Pizza
Computer: What would you like?
User: I would like two large pizzas with mozzarella
and one small pizza with tomatoes and anchovies
Computer: would you like any drinks with that?
User: Sure, 3 large diet cokes, oh and add
 pepperoni to the large pizzas
Computer: Is that all?
User: yes
Computer: Okay, that will be ready for you
 in 5 minutes
User: thanks

This is impractical today, due to the difficulties in determining all the ways people could respond. More flexible natural language understanding is needed to deal with this.

Multimodal Interaction

Why?

What?

Multimodal must work on low-end phones!

A multimodal interface to the Web in every pocket!

W3C multimodal working group charter

Technical directions under consideration

Relationship with work in other organizations

Last thoughts

Questions?