From Voice Browsers to Multimodal Systems

The W3C Speech Interface Framework

Dave Raggett <dsr@w3.org>, W3C Lead for Voice/Multimodal. Hong Kong, 1st May 2001

Voice – The Natural Interface available from over a billion phones

Personal assistant functions:

Name dialing and Search
Personal Information Management
Unified Messaging (mail, Fax & IM)
Call screening & call routing

Voice Portals

Access to news, information, entertainment, customer service and V-commerce(e.g. Find a friend, Wine Tips, Flight info, Find a hotel room , Buy ringing tones, Track a shipment)

Front-ends for Call Centers

90% cost savings over human agents
Reduced call abandonment rates (IVR)
Increased customer satisfaction

W3C Voice Browser Working Group

Home page: http://www.w3.org/Voice/Group
Founded: May 1999 following workshop in October 1998

Mission

Prepare and review markup languages to enable Internet-based speech applications
Has published requirements and specifications for languages in the W3C Speech Interface Framework
Is now due to be re-chartered with clarified IP policy

Voice Browser working group members

Alcatel
AnyDevice
Ask Jeeves
AT&T
Avaya
BeVocal
Brience
BT
Canon
Cisco
Comverse
Conversay
EDF
France Telecom
General Magic

Hitachi
HP
IBM
Informio
Intel
IsSound
Lernout & Hauspie
Locus Dialogue
Lucent
Microsoft
Milo
Mitre
Motorola
Nokia
Nortel Networks

Nuance
Philips
Openwave
PipeBeach
SpeechHost
SpeechWorks
Sun Microsystems
Telecom Italia
Telera
Tellme
Unisys
Verascape
VoiceGenie
Voxeo
VoxSurf
Yahoo

Note: Invited experts are italicized.

W3C Speech Interface Framework

The Voice Browser working group is developing a suite of specifications. The following diagram shows how each of the specifications fits into the general framework for voice browsers:

voice browser framework

Published Documents

publication status

Voice User Interfaces and VoiceXML

Why use voice as a user interface?

Far more phones than PCs
More wireless phones than PCs
Hands and eyes free operation

Why do we need a language for specifying voice dialogs?

High-level language simplifies application development
Separates Voice interface from Application server
Leverage existing Web application development tools

What does VoiceXML describe?

Conversational dialogs: System and user turns to speak
Dialogs based on form-filling metaphor plus events and links

W3C aims to standardize VoiceXML based upon the VoiceXML 1.0 submission by AT&T, IBM, Lucent and Motorola

VoiceXML Architecture

Brings the power of the Web to Voice

architecture

Reaching Out to Multiple Channels

server architecture

VoiceXML Features

VoiceXML includes the following:

Menus, Forms and Sub-dialogs
Hooks for speech grammars, DTMF input and audio recording
Variables and scripting
Throwing and catching of various events
Following links and submitting data
Simple Telephony control
Access to proprietary features via <object>

Example of a simple voice menu

<menu>
  <prompt>
   <speak>
     Welcome to Ajax Travel.
     Do you want to fly to 
     <emphasis>
       New York
     </emphasis>
     or 
     <emphasis> 
       Washington 
     </emphasis>
   </speak>
  </prompt>

 <choice next="http://www.NY...".>
    <grammar>
     <choice>
       <item> New York </item>  
       <item> Big Apple </item>
     </choice>  
    </grammar>
 </choice>

 <choice next="http://www.Wash...">
   <grammar>
      <choice>
       <item> Washington </item> 
        <item> The Capital </item>
      </choice> 
   </grammar>    
 </choice>
</menu>

Example of a simple form with two fields

<form id="weather_info">       
  <block>Welcome to the international weather service.</block>       
  <field name=“country">       
    <prompt>What country?</prompt>       
    <grammar src=“country.gram" type="application/x-jsgf"/>       
    <catch event="help">       
      Please say the country for which you want the weather.       
    </catch>       
  </field>       
  <field name="city">       
    <prompt>What city?</prompt>       
    <grammar src="city.gram" type="application/x-jsgf"/>       
    <catch event="help">       
      Please say the city for which you want the weather.       
    </catch>       
  </field>       
  <block>       
    <submit next="/servlet/weather" namelist="city country"/>       
  </block>       
</form>

VoiceXML Implementations

BeVocal
General Magic
HeyAnita
IBM
Lucent
Motorola
Nuance
PipeBeach
SpeechWorks
Telera
Tellme
VoiceGenie

If your company has an implementation and would like to be listed, contact Dave Raggett

Reusable Components

designing for reusability

Principles:

Express application at task level rather than interaction level
Save development time by reusing tried and effective modules
Increase consistency among applications

Possible examples include:

Credit card number
Date
Name
Address
Telephone number
Yes/No question
Shopping cart
Order status
Weather
Stock quotes
Sport scores
Word games

Speech Grammars

Specifies the words and patterns of words for which a speaker independent recognizer can listen
May be specified
- Inline as part of a VoiceXML page
- Referenced and stored separately on Web servers
Context free grammars
- XML and ABNF formats (interchangeable)
Action Tags for “binding data”
N-Gram statistical language models

An example in both XML syntax

<rule id="state" scope="public">
   <one-of> 
       <item> Oregon </item>
       <item>Maine </item>
    </one-of> 
</rule>

When rewritten in ABNF this becomes:

public $state = Oregon | Maine

Speech Synthesis

Synthesis engines have detailed knowledge about language. Consider the following text:

Dr. Jones lives at 175 Park Dr. He weighs 175 lb. He plays bass in a blues band. He also likes to fish; last week he caught a 20 lb. bass.

The first instance of "Dr" is spoken as "Doctor" but the second is spoken as "Drive". Numbers in street addresses are (in the USA) spoken specially. Some words like "bass" have context dependent pronunciations. The previous text is spoken as it it were written as:

Doctor Jones lives at one seventy-five Park Drive. He weighs one hundred and seventy-five pounds. He plays base in a blues band. He likes to fish; last week he caught a twenty-pound bass.

Processing is done in stages:

Structure analysis
Text normalization
Text to Phoneme conversion
Prosody analysis
Waveform production.

The W3C speech synthesis markup language gives authors control over each of these steps.

Lexicon Markup Language

Why?

Accurate pronunciations are essential in EVERY speech application
Platform default lexicons do not give 100% coverage of user speech

example for 'either'

The lexicon is used in this example to get the synthesis engine to say "either" as /ay th r/ and for the the recognition engine to accept this pronunciation and also the alternate pronuciation: /iy th r/.

LexiconML - Key Requirements

Meets both synthesis and recognition requirements
Pronunciations for any language (including tonal)
- reuse standard alphabets, support for suprasegmentals
Multiple pronunciations per word
Alternate orthographies
- Spelling variations — “colour” and “color”
- Alternative writing systems —Japanese Kanji and Kana
- Abbreviations and Acronyms - e.g. Dr., BT
Homophones e.g “read” and “reed” (same sound)
Homographs e.g. “read” and “read” (same spelling)

Natural Language Semantics ML

A standard for representing the output from language understanding. This markup will be generated and consumed by software rather than by people. It builds upon XForms for representing the values and types of data extracted from spoken utterances.

flow diagram

Interaction Style

Voice user interfaces needn't be dull
Choose prompts to reflect an explicit choice of personality
Introduce variety in prompts rather than always repeating the same thing
Politeness, helpfulness and sense of humor
Target different groups of users e.g. Gen Y
Allow users to select personality (skin)

Call Control

Give developers control over speech and telephony resources at the network edge.

Examples
- Call management—Place outbound call, conditionally answer inbound call, outbound fax
- Call leg management—Create, redirect, interact while on hold
- Conference management—Create, join, exit
Intersession communication—Asynchronous events
Interpreter context—Invoke, terminate

Multimodal Interaction = Voice + Displays

Say which City you want weather for and see the information on your phone

"What is the weather in San Francisco?"

browser screen shot

Say which bands/CD’s you want to buy and confirm the choices visually

"I want to place an order for 'Hotshot' by Shaggy"

browser screen shot

W3C work on multimodal

Multimodal applications
- Voice + Display + Key pad + Stylus etc.
- User is free to switch between voice interaction and use of display/key pad/clicking/handwriting
July 2000 Published Multimodal Requirements Draft
Demonstrations of Multimodal prototypes at Paris face to face meeting of Voice Browser WG
Joint W3C/WAP Forum workshop on Multimodal – Hong Kong September 2000
February 2001 – W3C publishes Multimodal Request for Proposals
Plan to set up Multimodal Working Group later this year assuming we get appropriate submission(s)

Requirements

Primary market is mobile wireless
- cell phones, personal digital assistants and cars
Timescale is driven by deployment of 3G networks
Input modes:
- speech, keypads, pointing devices, and electronic ink
Output modes:
- speech, audio, and bitmapped or character cell displays
Architecture should allow for both local and remote speech processing

VoiceXML IP Issues

Technical work on VoiceXML 2.0 is proceeding well
Related specifications for grammar, speech synthesis, natural language synthesis, lexicon, and call control have or shortly will be published.
But, publication of VoiceXML 2.0 working draft is held up over IP issues (although internal version is accessible to all W3C Members)
W3C and VoiceXML Forum Management are in process of developing a formal Memorandum of Understanding
W3C Patent Policy allows for both Royalty-free and RAND working groups
W3C is convening a Patent Advisory Group to recommend detailed IP Policy for re-chartering the Voice Browser Activity

Copyright © 2001 W3C (MIT, INRIA, Keio ), All Rights Reserved. W3C liability, trademark, document use and software licensing rules apply. Your interactions with this site are in accordance with our public and Member privacy statements.