W3C's work on voice and multimodal interaction

Giving users a chance to choose between using their ears or eyes, their voice or their fingers

Dave Raggett <dsr@w3.org>

W3C Fellow on assigment
from Openwave Systems

background photo by Gerald Saunders, see www.wetsand.com

Brief history of W3C's involvement in voice standards work

The Web started as a visual medium with HTML and images etc.
Aural style sheets made it possible to style HTML for rendering to speech, when using a keyboard for input
W3C got further involved in voice interaction through work on accessibility for people with visual impairments
Voice interaction is valuable for hands and eyes free operation, and for easier input on small devices
W3C workshop on voice browsers in October 1998
Voice Browser working group set up in early 1999
Speech interface framework requirements
Specifications for VoiceXML, speech grammars, speech synthesis, call control, ...

For more details, please see http://www.w3.org/Voice

Current status of Voice Browser working group

Requirements:

Voice dialog requirements - 23 December 1999
Speech synthesis requirements - 23 December 1999
Speech grammar requirements - 23 December 1999
Natural language processing requirements - 23 December 1999
Pronunciation lexicon requirements - 12 March 2001
Call control requirements - 13 April 2001

Specifications:

VoiceXML 2.0 specification in Last Call (24 April 2002)
Speech Grammar specification about to enter Candidate Recommendation
Speech Synthesis specification about to re-enter Last Call
CCXML call control specification — First working draft 21 Feb 2002

The working group is now due to be rechartered, and the W3C Advisory Committee is being asked for their input on this.

A very quick look at VoiceXML

VoiceXML browser

VoiceXML is being deployed by wireless and wireline operators, and by companies for various kinds of call centers
Users dial up to connect to a voice browser running a VoiceXML interpreter, this in turn contacts a web server to request the corresponding VoiceXML, speech grammars, and audio prompts
Developers are comfortable with markup and can apply their skills with server-side scripts and back-end systems
VoiceXML offers navigation links, and form filling, with speech recognition and DTMF (touch-tone) for input
Result of recognition activates link or sets value of named variables
Judiscious blend of declarative and procedural features

Speech Recognition

Dictation systems now offer good results with training to the user's voice, a good microphone and quiet conditions
Speaker independent continuous speech recognition isn't as far advanced
Recognition accuracy is increased by supplying a context free grammar for what the user is expected to say
This requires carefully designed prompts to get the user to speak within the specified grammar
Alternative is to use word spotting, e.g. "bill" or "fault", as in:
- I have a question about my bill
- I would like to report a fault
The user can then be transferred to an customer service representative or to a specific VoiceXML application

Speech Synthesis

Use of an actor or Voice Talent to record prompts
Conventional formant speech synthesis sounds robotic, and is tiring to listen to
Non-uniform unit concatenative speech synthesis is much better and is based on splicing together varying length fragments of recorded human speech

Formant based synthesis (Festival):

Concatenative based synthesis (Rhetorical):

What does VoiceXML look like?

<vxml version="2.0" lang="en">
<form>
 <field name="city">
 <prompt>Where do you want to travel to?</prompt>
 <grammar src="travel.xml#cities"
  type="application/grammar+xml"/>
 </field>

 <field name="travellers" type="number">
 <prompt>How many are travelling to
  <value expr="city"/>?</prompt>
 </field>

 <block>
 <submit next="http://www.acmetravel/bookings"
  namelist="city travellers"/>
 </block>
</form>
</vxml>

What is it like to use a VoiceXML application?

Here is an example:

Computer: Welcome to Joe's Pizza ordering service
Computer: Select pizza size from large, medium
 or small?
User:  large
Computer: what number of these pizzas do you want?
User: two
Computer: Select first topping from mozzarella,
 pepperoni and anchovies?
User: mozzarella
Computer: Do you want another topping, yes or no?
User: yes
Computer: Select second topping from mozzarella,
 pepperoni and anchovies?
User: pepperoni
Computer: Do you want any other pizzas, yes or no?
...

Recovering from problems

Prompts are carefully designed to elicit simple responses, if users fails to respond or the response can't be understood, the application provides tapered prompts, for example:

Computer: what number of these pizzas do you want?
User: I reckon, er, two would do the job
Computer: please say the number on its own
User: two
Computer: Select first topping from mozzarella,
pepperoni and anchovies?
...

Recognition confidence scores can be used to determine when to ask for confirmation

Tapered Prompts

Offer progressively more guidance on each turn:

<field name="travellers" type="number">
  <prompt count="1">
    How many are travelling to <value expr="city"/>? 
  </prompt>

  <prompt count="2">
    Please tell me the number of people travelling.
  </prompt>

  <prompt count="3">
    To book a flight, you must tell me the number
    of people travelling to <value expr="city"/>.
  </prompt>

  <nomatch>
   <prompt>Please say just a number.</prompt>
   <reprompt/>
  </nomatch>
</field>

For more examples: http://www.w3.org/Voice/Guide/

Richer use of natural language

The dialog gets the job done, but is very rigid. With larger grammars, a more natural interaction style becomes possible, for example:

Computer: Welcome to Joe's Pizza
Computer: What would you like?
User: I would like two large pizzas with mozzarella
and one small pizza with tomatoes and anchovies
Computer: would you like any drinks with that?
User: Sure, 3 large diet cokes, oh and add
 pepperoni to the large pizzas
Computer: Is that all?
User: yes
Computer: Okay, that will be ready for you
 in 5 minutes
User: thanks

This is impractical today, due to the difficulties in determining all the ways people could respond. More flexible natural language understanding is needed to deal with this.

Multimodal Interaction

Why?

Allow people to choose when to use their eyes, hands, ears and mouths
Speaking is so much easier than thumbing in text
Sometimes speech is not the answer - too noisy or its inappropriate to speak
Complementing transient nature of speech with longer lived visual information
Ink: The use of a stylus for text input, gestures, specialized notations such as math, music and chemistry, and for diagrams and artwork

What?

Name dialling - see who you want to call
Seeing and hearing as means to browse messages
Driving directions
Search, e.g. for flight information
...

Multimodal must work on low-end phones!

A multimodal interface to the Web in every pocket!

Cell phones need to be affordable to people from all walks of life, this constrains the price
Small physical size, and desire for long battery life further constrains memory capacity and processor speed
Cell phones are expected to support an increasing number of specifications: XHTML, CSS, JPEG, SVG, SMIL, Java, MP3, scripting and more ...
It is therefore essential to be able to offload some of the burden to the network
A greater emphasis on the network will also scale better, allowing for greater flexibility and more innovation without requiring users to upgrade their phones to enjoy the benefits

W3C multimodal working group charter

Multimodal work started in Voice Browser Working Group
Joint W3C/WAP Forum workshop in Hongkong (September 2000)
Multimodal Interaction Working Group was formed February 2002
Primary focus on 2.5 and 3G mobile networks
Markup for synchronization across modalities and devices with a wide range of capabilities
Support for local and remote speech services (recognition, prompts and verification)
Support for local and remote processing of ink entered using stylus, pen or imaging device
Support for combination of local visual system and remote dialog system
Support for coordination across multiple devices

Technical directions under consideration

Microsoft led SALT Forum proposes speech extensions to HTML, assumes reasonably powerful devices - PDA's and up
IBM, Motorola and Opera Software propose XHTML+Voice which combines XHTML with VoiceXML for PDA's and up
IBM, Intel and Motorola working on XML format for electronic ink, called InkXML
W3C working group is looking at extending W3C event model across network using XML, but independent of underlying transport protocol
Working group is expected to define standard for output from speech/ink recognizers, plus framework for natural language understanding
Coordination with other W3C work, e.g. on XHTML, XForms, SMIL, SVG, CSS, and Voice Browsers etc.

Relationship with work in other organizations

W3C work is intended to be complementary to other organizations
SIP for setting up multimodal sessions and providing control throughout the session
SIP Events as a natural fit to W3C work on distributing XML events - register/notify model, but hide this from application authors
CATS (formerly MRCP) is name of new IETF work on protocol for control of remote speech recognition, prompts and verification services
ETSI Aurora work on increasing resilience of speech recognizers against transmission fade-outs and aural noise
- Device based preprocessing to address aural noise
- Codec optimized for speech recognition
- Work now picked up by 3GPP along with related work on transport issues for multimodal services
ETSI work on standard command and control vocabulary with bindings for European languages
WAP Forum may start work on interoperability issues for multimodal aligned with W3C's work

Last thoughts

Voice has the potential to be a natural interface to the Web, opening it up to people with visual impairments
Designing for voice requires a new set of skills
Initially voice interfaces will be limited to simple prompted responses, but over time richer and more natural interaction styles will become possible
Electronic ink offers exciting new possibilities for the Web
Multimodal has the potential to allow you to choose whether you want to use your voice or your hands and eyes, but this relies on the application designer taking this into account
What is the relationship between multimodal and device independence?
How does natural language interaction relate to the Semantic Web?
- Human reasoning is quite different from the rigid symbolic inference typical of logic based approaches
- Statistical techniques have the potential to combine symbolic with more fluid kinds of reasoning — offering a vision of an intelligent Web. Machines should become more human, rather than forcing humans to act like machines.

Questions?