VoiceXML and the Web

Dave Raggett, W3C Fellow (Canon)

W3C Activity Lead for Voice Browsers
and Multimodal interaction

http://www.w3.org/Voice/

An Introduction to the
Voice Browser Activity

Enabling voice access to suitably designed Web applications
- Builds upon well established Web technology
Made possible by advances in speaker independent speech recognition and speech synthesis
Accessible from over a billion telephones
Biggest take up is for Call Centers
Despite current economic setbacks, speech technology and voice applications remain boyant:
"a recent study by the Radicati Group estimates that the total market value for Voice Enabled Services will reach $202 million by year end 2002, and grow to over $659 million by 2006"

Voice Browsers

voice browser

User dials Voice Browser, e.g. +1 800 555 8355
Browser downloads markup from Web server
VoiceXML markup directs browser to:
- Prompt user using recorded or synthetic speech
- Interpret user's spoken response using speech grammar
- Determine what to do next
VoiceXML provides a form filling metaphor

W3C Speech Interface Framework

VoiceXML — markup for voice dialogs
- Candidate Recommendation - 28 Jan 2003
SRGS — markup for speech grammars
- Candidate Recommendation — 26 June 2002
SSML — markup for speech synthesis
- Candidate Recommendation expected 2nd quarter 2003
Semantic Interpretation — extracting recognition results (in an ECMAScript variable or as XML)
- Latest Working Draft — 1 April 2003
CCXML — markup for call control
- Latest Working Draft — 11 October 2002

Call Control eXtensible Markup Language (CCXML)

Designed as an adjunct language to provide telephony call control support for VoiceXML or other dialog systems
CCXML adds support for:
- Multi-party conferencing
- Sophisticated multiple-call handling and control, including the ability to place outgoing calls
- Handling for richer and more asynchronous events
- Ability to receive events and messages from external computational entities. Interacting with an outside call queue, or placing calls on behalf of a document server

CCXML Features

CCXML defines an event driven state machine which can:

Accept or reject an incoming phone call
Make an outbound call
Terminate a phone connection
Create and destroy multi-party audio conferences
Redirect an incoming call to a new end-point
Connect and disconnect two audio sources
Start and stop a dialog (VoiceXML)

CCXML - markup sample

<?xml version="1.0" encoding="UTF-8"?>

<ccxml version="1.0">
  <var name="in_callid" expr="'''"/>
  <var name="currentstate" expr="'initial'"/>
  
  <eventhandler statevariable="currentstate">
    <transition state="'initial'"
     event="connection.CONNECTION_ALERTING" name="evt">
        <assign name="currentstate" expr="'alerting'"/>
        <assign name="in_callid" expr="evt.callid"/>
        <accept callid="in_callid"/>
    </transition>

    <transition state="'alerting'"
     event="connection.CONNECTION_CONNECTED" name="evt">
        <assign name="currentstate" expr="'fetching'"/>
        <fetch next="'http://acme.com/conference.asp'" namelist="in_callid"/>
    </transition>

    <transition state="'fetching'" event="fetch.done" name="evt">
        <goto fetchid="evt.fetchid"/>
    </transition>

</eventhandler>
</ccxml>

Related work

ECMA standard Computer Supported Telecommications Applications (CSTA)
SALT Forum: SALT Call control object
- SALT provides speech tags for use with markup languages like HTML and scripting languages like ECMAScript
Call Processing Language (CPL) is an XML based language for user scripting of incoming VoIP calls

Speech Synthesis (SSML)

Markup language for prompting users via a combination of prerecorded speech, synthetic speech and music
You can select voice characteristics (name, gender and age) and the speed, volume, pitch, and emphasis
Provision for overriding the synthesis engine's default pronunciation
Used within VoiceXML 2.0
Expect to advance SSML to W3C Candidate Recommendation status during Summer 2003
Matching Speech Synthesis style sheet module for CSS3
- Replaces aural cascading style sheets in CSS2
- Developed as collaboration between VBWG and CSS WG

SSML markup sample

<?xml version="1.0" encoding="ISO-8859-1"?>
<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
                   http://www.w3.org/TR/speech-synthesis/synthesis.xsd"
         xml:lang="en-US">
  <voice gender="female"> 
    Any female voice here.
    <voice age="6"> 
      A female child voice here.
      <paragraph xml:lang="ja"> 
        <!-- A female child voice in Japanese. -->
      </paragraph>
    </voice>
  </voice>
</speak>

Speech Recognition Grammar (SRGS)

Context free grammar for expressing user's response
Expressed in XML as a grammar over words/characters
Used to guide speech recognition
- Recognizer maps words into an internal phonetic representation
- Used in statistical recognition process (Hidden Markov Models)
- May return an N-best list of results
Also used as a basis for extracting information from spoken utterances (see next slide)
Alternative augmented BNF syntax (little used)
Includes support for touch tone (DTMF) input
Expect to advance SRGS to W3C Proposed Recommendation status in Summer 2003

SRGS markup sample

<?xml version="1.0" encoding="ISO-8859-1"?>

<!DOCTYPE grammar PUBLIC "-//W3C//DTD GRAMMAR 1.0//EN"
                  "http://www.w3.org/TR/speech-grammar/grammar.dtd">
 
<!-- the default grammar language is US English -->
<grammar xmlns="http://www.w3.org/2001/06/grammar"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
         xsi:schemaLocation="http://www.w3.org/2001/06/grammar 
                             http://www.w3.org/TR/speech-grammar/grammar.xsd"
         xml:lang="en-US" version="1.0">
 
  <!-- 
     single language attachment to tokens
     "yes" inherits US English language
     "oui" is Canadian French language
  -->
  <rule id="yes">
    <one-of>
      <item>yes</item>
      <item xml:lang="fr-CA">oui</item>
    </one-of> 
  </rule> 
  
  <!-- Single language attachment to an expansion -->
  <rule id="people1">
    <one-of xml:lang="fr-CA">
      <item>Michel Tremblay</item>
      <item>André Roy</item>
    </one-of>
  </rule>
  
  <!--
     Handling language-specific pronunciations of the same word
     A capable speech recognizer will listen for Mexican Spanish 
     and US English pronunciations.
  -->
  <rule id="people2">
    <one-of>
      <item xml:lang="en-US">Jose</item>
      <item xml:lang="es-MX">Jose</item>
    </one-of>
  </rule>
  
  <!-- Multi-lingual input is possible -->
  <rule id="request" scope="public">
    <example> may I speak with André Roy </example>
    <example> may I speak with Jose </example>
  
    may I speak with
    <one-of>
      <item> <ruleref uri="#people1"/> </item>
      <item> <ruleref uri="#people2"/> </item>
    </one-of>
  </rule>
</grammar>

Semantic Interpretation

Annotations for speech grammars
Written in a subset of ECMAScript
Used to extract results from speech recognition
Result is either a ECMAScript variable or XML (EMMA)
- EMMA is being developed in the W3C Multimodal Interaction working group
Example:
- User says "I want to fly to Chicago"
- Application sees: <destination>Chicago</destination>
- Implementation adds further information about recognition confidence, timestamps and alternative interpretations
Expect to publish Last Call working draft in Summer 2003

VoiceXML 2.0

VoiceXML is a dialog markup language aimed at telephony applications
- synthesized speech, digitized audio, recognition of spoken and DTMF key input, recording of spoken input, telephony, and simple mixed initiative conversations
Builds upon other W3C Speech Interface Framework specifications
Menus and Links — support for navigation commands like "main menu", "help", "appointments", "sport", "news"
Form filling support
- Leading user through sequence of form fields
- Form level grammars for utterances that fill out multiple fields
- Tapered prompts — giving progressively stronger guidance when user is having difficulties
- Adapting to user's level of experience (novice vs power user)
- Uses ECMAScript for storing data and conditional behavior
- Sessions that persist over a number of VoiceXML pages

VoiceXML markup sample

<?xml version="1.0" encoding="ISO-8859-1"?>
<vxml version="2.0" lang="en">
<form>

<field name="city">
<prompt>Where do you want to travel to?</prompt>
<option>Edinburgh</option>
<option>New York</option>
<option>London</option>
<option>Paris</option>
<option>Stockholm</option>
</field>

<field name="travellers" type="number">
<prompt>How many are travelling to <value expr="city"/>?</prompt>
</field>

<block>
<submit next="http://localhost/handler" namelist="city travellers"/>
</block>

</form>
</vxml>

Some possible applications

with thanks to Jim Larson

Be notified of interesting events — stock movements, availability of special offers, flight delays, ...
Conference with colleagues — meet at a specific Web site, with ability to whisper private messages
Find me — application uses information about the whereabouts of the called person to handle the call
Find the whereabouts of a friend — based upon cellular network's knowledge of the location your friend's cellphone
Instant voice messaging. "Notify me when John Smith calls into his voicemail." Or "Voicemail from John Smith. He's checking his messages right now, would you like me to connect you?"
Voice control your home appliances — when returning from a vacation phone home to switch the heating on in advance

Patents

Voice Browser activity is chartered to produce specifications that can be implemented without the need to pay royalty fees
Current Patent Pratice Note — a Patent Advisory Group (PAG) must be launched upon:
- Incomplete patent disclosures by working group members currently in good standing
- Disclosure of essential claims for patents that are not available on royalty free (RF) terms
Several such patents were disclosed and as a result a PAG was launched
Unless resolved, this would prevent affected Voice Browser specifications from advancing to Proposed Recommendation status
The outlook is now positive, as companies have either offered to license the patents on royalty free terms or have indicated that the patent doesn't have essential claims on the Voice Browser specifications
The Voice Browser PAG expects to complete its work in June 2003

Next Steps

Driving current specs through to W3C REC status
Gathering requirements for next version of VoiceXML, e.g.
- Modularization
  - For use with XHTML for multimodal interaction
- Incremental changes building on success of 2.0
- Richer kinds of dialog with better support for tuning
- Support for XForms?
Possible interest in resuming work on pronunciation lexicons and stochastic language models

Demo1 — a small sample of what it would feel like to access the W3C website via VoiceXML

Demo2 — an example of the improvement in speech synthesis, and a hint at the opportunities for richer dialog

Note: commercial applications generally use precorded prompts made using human actors

What lies beyond ...

A personal view of the
challenges for voice interaction

More Robust Speech Recognition

speech spectrogram

Picking the nuances of speech out of the sound field
- Understanding the acoustic environment
Combining stochastic and linguistic models of speech
Using better knowledge about groups of speakers
- Something in between speaker dependent and speaker independent recognition?

Reducing the effort needed to tune new applications

Today this involves collecting lots of usage data, and carefully tuning the prompts and grammars accordingly
How can we make this easier?
- By sharing knowledge across applications
- By combining task specific knowledge with general knowledge
- With deeper natural language understanding
This is a research challenge!

Helping computers to understand things our way

Today's computers know nothing about the everyday world and that's why they are so stupid!
- Computers need to learn some common sense!
Common sense is estimated to involve hundreds of millions of ideas
Cyc, Thought Treasure and Open Mind
- Projects that have tackled common sense
Broad coverage natural language parsers
- Combining statistical and linguistic knowledge
- Natural language for knowledge representation
Harnessing the Web to tackle really big projects
- Open source projects with thousands of volunteers

Thank you for listening ....

Any Questions?