W3C

From Voice Browsers to Multimodal Systems

The W3C Speech Interface Framework

Dave Raggett <dsr@w3.org>, W3C Lead for Voice/Multimodal. Hong Kong, 1st May 2001

Voice – The Natural Interface available from over a billion phones

Personal assistant functions:

Voice Portals

Front-ends for Call Centers

W3C Voice Browser Working Group

Mission

Voice Browser working group members

Alcatel
AnyDevice
Ask Jeeves
AT&T
Avaya
BeVocal
Brience
BT
Canon
Cisco
Comverse
Conversay
EDF
France Telecom
General Magic
Hitachi
HP
IBM
Informio
Intel
IsSound
Lernout & Hauspie
Locus Dialogue
Lucent
Microsoft
Milo
Mitre
Motorola
Nokia
Nortel Networks
Nuance
Philips
Openwave
PipeBeach
SpeechHost
SpeechWorks
Sun Microsystems
Telecom Italia
Telera
Tellme
Unisys
Verascape
VoiceGenie
Voxeo
VoxSurf
Yahoo

Note: Invited experts are italicized.

W3C Speech Interface Framework

The Voice Browser working group is developing a suite of specifications. The following diagram shows how each of the specifications fits into the general framework for voice browsers:

voice browser framework

Published Documents

publication status

Voice User Interfaces and VoiceXML

Why use voice as a user interface?

Why do we need a language for specifying voice dialogs?

What does VoiceXML describe?

W3C aims to standardize VoiceXML based upon the VoiceXML 1.0 submission by AT&T, IBM, Lucent and Motorola

VoiceXML Architecture

Brings the power of the Web to Voice

architecture

Reaching Out to Multiple Channels

server architecture

VoiceXML Features

VoiceXML includes the following:

Example of a simple voice menu

<menu>
  <prompt>
   <speak>
     Welcome to Ajax Travel.
     Do you want to fly to 
     <emphasis>
       New York
     </emphasis>
     or 
     <emphasis> 
       Washington 
     </emphasis>
   </speak>
  </prompt>

 <choice next="http://www.NY...".>
    <grammar>
     <choice>
       <item> New York </item>  
       <item> Big Apple </item>
     </choice>  
    </grammar>
 </choice>

 <choice next="http://www.Wash...">
   <grammar>
      <choice>
       <item> Washington </item> 
        <item> The Capital </item>
      </choice> 
   </grammar>    
 </choice>
</menu>

Example of a simple form with two fields

<form id="weather_info">       
  <block>Welcome to the international weather service.</block>       
  <field name=“country">       
    <prompt>What country?</prompt>       
    <grammar src=“country.gram" type="application/x-jsgf"/>       
    <catch event="help">       
      Please say the country for which you want the weather.       
    </catch>       
  </field>       
  <field name="city">       
    <prompt>What city?</prompt>       
    <grammar src="city.gram" type="application/x-jsgf"/>       
    <catch event="help">       
      Please say the city for which you want the weather.       
    </catch>       
  </field>       
  <block>       
    <submit next="/servlet/weather" namelist="city country"/>       
  </block>       
</form> 

VoiceXML Implementations

If your company has an implementation and would like to be listed, contact Dave Raggett

Reusable Components

designing for reusability

Principles:

Possible examples include:

Speech Grammars

An example in both XML syntax

<rule id="state" scope="public">
   <one-of> 
       <item> Oregon </item>
       <item>Maine </item>
    </one-of> 
</rule>

When rewritten in ABNF this becomes:

public $state = Oregon | Maine

Speech Synthesis

Synthesis engines have detailed knowledge about language. Consider the following text:

Dr. Jones lives at 175 Park Dr. He weighs 175 lb. He plays bass in a blues band. He also likes to fish; last week he caught a 20 lb. bass.

The first instance of "Dr" is spoken as "Doctor" but the second is spoken as "Drive". Numbers in street addresses are (in the USA) spoken specially. Some words like "bass" have context dependent pronunciations. The previous text is spoken as it it were written as:

Doctor Jones lives at one seventy-five Park Drive. He weighs one hundred and seventy-five pounds. He plays base in a blues band. He likes to fish; last week he caught a twenty-pound bass.

Processing is done in stages:

  1. Structure analysis
  2. Text normalization
  3. Text to Phoneme conversion
  4. Prosody analysis
  5. Waveform production.

The W3C speech synthesis markup language gives authors control over each of these steps.

Lexicon Markup Language

Why?

example for 'either'

The lexicon is used in this example to get the synthesis engine to say "either" as /ay th r/ and for the the recognition engine to accept this pronunciation and also the alternate pronuciation: /iy th r/.

LexiconML - Key Requirements

Natural Language Semantics ML

A standard for representing the output from language understanding. This markup will be generated and consumed by software rather than by people. It builds upon XForms for representing the values and types of data extracted from spoken utterances.

flow diagram

Interaction Style

Call Control

Give developers control over speech and telephony resources at the network edge.

Multimodal Interaction = Voice + Displays

Say which City you want weather for and see the information on your phone

"What is the weather in San Francisco?"

browser screen shot




Say which bands/CD’s you want to buy and confirm the choices visually

"I want to place an order for 'Hotshot' by Shaggy"

browser screen shot

W3C work on multimodal

Requirements

VoiceXML IP Issues


Copyright  ©  2001 W3C (MIT, INRIA, Keio ), All Rights Reserved. W3C liability, trademark, document use and software licensing rules apply. Your interactions with this site are in accordance with our public and Member privacy statements.