VoiceXML and the Web

Dave Raggett, W3C Fellow (Canon)

W3C Activity Lead for Voice Browsers
and Multimodal interaction

http://www.w3.org/Voice/

An Introduction to the
Voice Browser Activity

Voice Browsers

voice browser

W3C Speech Interface Framework

Call Control eXtensible Markup Language (CCXML)

CCXML Features

CCXML defines an event driven state machine which can:

CCXML - markup sample

<?xml version="1.0" encoding="UTF-8"?>

<ccxml version="1.0">
  <var name="in_callid" expr="'''"/>
  <var name="currentstate" expr="'initial'"/>
  
  <eventhandler statevariable="currentstate">
    <transition state="'initial'"
     event="connection.CONNECTION_ALERTING" name="evt">
        <assign name="currentstate" expr="'alerting'"/>
        <assign name="in_callid" expr="evt.callid"/>
        <accept callid="in_callid"/>
    </transition>

    <transition state="'alerting'"
     event="connection.CONNECTION_CONNECTED" name="evt">
        <assign name="currentstate" expr="'fetching'"/>
        <fetch next="'http://acme.com/conference.asp'" namelist="in_callid"/>
    </transition>

    <transition state="'fetching'" event="fetch.done" name="evt">
        <goto fetchid="evt.fetchid"/>
    </transition>

</eventhandler>
</ccxml>

Related work

Speech Synthesis (SSML)

SSML markup sample

<?xml version="1.0" encoding="ISO-8859-1"?>
<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
                   http://www.w3.org/TR/speech-synthesis/synthesis.xsd"
         xml:lang="en-US">
  <voice gender="female"> 
    Any female voice here.
    <voice age="6"> 
      A female child voice here.
      <paragraph xml:lang="ja"> 
        <!-- A female child voice in Japanese. -->
      </paragraph>
    </voice>
  </voice>
</speak>

Speech Recognition Grammar (SRGS)

SRGS markup sample

<?xml version="1.0" encoding="ISO-8859-1"?>

<!DOCTYPE grammar PUBLIC "-//W3C//DTD GRAMMAR 1.0//EN"
                  "http://www.w3.org/TR/speech-grammar/grammar.dtd">
 
<!-- the default grammar language is US English -->
<grammar xmlns="http://www.w3.org/2001/06/grammar"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
         xsi:schemaLocation="http://www.w3.org/2001/06/grammar 
                             http://www.w3.org/TR/speech-grammar/grammar.xsd"
         xml:lang="en-US" version="1.0">
 
  <!-- 
     single language attachment to tokens
     "yes" inherits US English language
     "oui" is Canadian French language
  -->
  <rule id="yes">
    <one-of>
      <item>yes</item>
      <item xml:lang="fr-CA">oui</item>
    </one-of> 
  </rule> 
  
  <!-- Single language attachment to an expansion -->
  <rule id="people1">
    <one-of xml:lang="fr-CA">
      <item>Michel Tremblay</item>
      <item>André Roy</item>
    </one-of>
  </rule>
  
  <!--
     Handling language-specific pronunciations of the same word
     A capable speech recognizer will listen for Mexican Spanish 
     and US English pronunciations.
  -->
  <rule id="people2">
    <one-of>
      <item xml:lang="en-US">Jose</item>
      <item xml:lang="es-MX">Jose</item>
    </one-of>
  </rule>
  
  <!-- Multi-lingual input is possible -->
  <rule id="request" scope="public">
    <example> may I speak with André Roy </example>
    <example> may I speak with Jose </example>
  
    may I speak with
    <one-of>
      <item> <ruleref uri="#people1"/> </item>
      <item> <ruleref uri="#people2"/> </item>
    </one-of>
  </rule>
</grammar>

Semantic Interpretation

VoiceXML 2.0

VoiceXML markup sample

<?xml version="1.0" encoding="ISO-8859-1"?>
<vxml version="2.0" lang="en">
<form>

<field name="city">
<prompt>Where do you want to travel to?</prompt>
<option>Edinburgh</option>
<option>New York</option>
<option>London</option>
<option>Paris</option>
<option>Stockholm</option>
</field>

<field name="travellers" type="number">
<prompt>How many are travelling to <value expr="city"/>?</prompt>
</field>

<block>
<submit next="http://localhost/handler" namelist="city travellers"/>
</block>

</form>
</vxml>

Some possible applications

with thanks to Jim Larson

Patents

Next Steps

Demo1 — a small sample of what it would feel like to access the W3C website via VoiceXML

Demo2 — an example of the improvement in speech synthesis, and a hint at the opportunities for richer dialog

Note: commercial applications generally use precorded prompts made using human actors

What lies beyond ...

A personal view of the
challenges for voice interaction

More Robust Speech Recognition

speech spectrogram

Reducing the effort needed to tune new applications

Helping computers to understand things our way

Thank you for listening ....

Any Questions?