Getting started with VoiceXML 2.0

Dave Raggett, revised 14th November 2001.

This is a short introduction to VoiceXML. It owes its heritage to a well travelled set of slides by Scott McGlashan of HP. Scott Co-Chairs the W3C Voice Browser working group and leads the VoiceXML subgroup.

VoiceXML Overview
Key Concepts of VoiceXML
VoiceXML Examples
Further Reading

VoiceXML Overview

telephone, voicexml interpreter and website

What is VoiceXML? Well its an XML language for writing Web pages you interact with by listening to spoken prompts and jingles, and control by means of spoken input. VoiceXML brings the Web to telephones. If you want to get a hands on feeling for what this is like, there are an increasing number of voice portals which you can phone into and try out for yourself. Several sites also offer free hosting for VoiceXML. Some pointers to these sites can be found in the FAQ on the overview page.

VoiceXML isnt HTML. HTML was designed for visual Web pages and lacks the control over the user-application interaction that is needed for a speech-based interface. With speech you can only hear one thing at a time (kind of like looking at a newspaper with a times 10 magnifying glass). VoiceXML has been carefully designed to give authors full control over the spoken dialog between the user and the application. The application and user take it in turns to speak: the application prompts the user, and the user in turn responds.

VoiceXML documents describe:

spoken prompts (synthetic speech)
output of audio files and streams
recognition of spoken words and phrases
recognition of touch tone (DTMF) key presses
recording of spoken input
control of dialog flow
telephony control (call transfer and hangup)

VoiceXML makes it easy to rapidly create new applications and shields developers from the low level and implementation details. It separates user-interaction from service logic. The W3C VoiceXML 2.0 specification is the definitive reference to VoiceXML. You can also find other related work on W3Cs voice browser overview page

Key Concepts

A session begins when the user starts to interact with a VoiceXML interpreter and continues as VoiceXML documents are loaded and unloaded. The session ends when requested by the user, VoiceXML document or interpreter context. The platform defines the default session behavior, although this can be overridden in part by VoiceXML.

VoiceXML documents define applications as a set of named dialog states. The user is always in one dialog state at any time. Each dialog specifies the next dialog to transition to using a URL.

VoiceXML dialogs include: forms and menus. A menu presents the user with a choice of options and the transitions to another dialog state based upon the users selection. A form defines an interaction that collects values for each of the fields in the form. Each field may specify a prompt, the expected input, and evaluation rules. The form can be submitted to a server in much the same way as for HTML.

An application is a set of VoiceXML documents that share the same application root document. The root document is automatically loaded whenever one of the application documents is loaded, and remains loaded until there is a transition to a different application, or when the call is disconnected. The root document information is available to all documents in the same application.

several VoiceXML documents can share the same application root document

Each dialog state has one of more grammars associated with it, that are used to describe the expected user input, either spoken input or touch-tone (DTMF) key presses. In the simplest case, only the dialogs grammars are active in that dialog. In more complex cases, other grammars can be active.

grammars defined within the dialog itself
external grammars referenced by links
grammars defined at the document level and marked as being globally active
grammars defined in the root application document and active throughout the application

A subdialog is like a function call: it allows you to call out to a new dialog and then returns to the original dialog, retaining the local state information for that dialog. Sub dialogs can be used to handle confirmations and to create a library of re-usable dialogs for common tasks.

inheritance model for catching events
VoiceXML allows you to define named variables for holding data. These can be defined at any level and their scope follows an inheritance model. You can test the values of variables to determine what dialog state to transition to next. Variable expressions can also be used for conditional prompts and grammars etc.

Events are thrown when the user fails to respond to a prompt, or when the input cant be understood. VoiceXML allows you to write handlers for catching events. These follow an inheritance model, and events can be caught at a higher level if there is no corresponding handler at the dialog level.

VoiceXML allows you to use scripting (ECMAScript) when you need additional control over the application. VoiceXML employs a form filling metaphor. You can define a complex grammar for collecting the values of several fields in a single response. Any unfilled fields can be handled by special subdialogs defined inline within each dialog.

VoiceXML Examples

Here is a very simple VoiceXML application. It says "Welcome to Travel Planner!", plays a short audio advertising jingle and then exits:

<?xml version="1.0" encoding="ISO-8859-1"?>
<vxml version="2.0" lang="en">
<form>
<block>
<prompt bargein="false">Welcome to Travel Planner!
<audio src="http://www.adline.com/mobile?code=12s4"/>
</prompt>
</block>
</form>
</vxml>

The following example offers a menu of three choices: sports, weather or news.

<?xml version="1.0"?>
<vxml version="2.0">
<menu>
  <prompt>
    Say one of: <enumerate/>
  </prompt>
  <choice next="http://www.sports.example/start.vxml">
     Sports
  </choice>
  <choice next="http://www.weather.example/intro.vxml">
     Weather
  </choice>
  <choice next="http://www.news.example/news.vxml">
     News
  </choice>
  <noinput>Please say one of <enumerate/></noinput>
</menu>
</vxml>

This dialog might proceed as follows:

Computer:		Say one of: Sports; Weather; News.
Human:		Astrology
Computer:		I did not understand what you said. (a platform-specific default message.)
Computer:		Say one of: Sports; Weather; News.
Human:		Sports
Computer:		(proceeds to http://www.sports.example/start.vxml)

Here is another example, this time, using a form to ask the user to choose a city and the number of travellers. Once this information has been collected it is submitted to a web server:

<?xml version="1.0" encoding="ISO-8859-1"?>
<vxml version="2.0" lang="en">
<form>

<field name="city">
<prompt>Where do you want to travel to?</prompt>
<option>Edinburgh</option>
<option>New York</option>
<option>London</option>
<option>Paris</option>
<option>Stockholm</option>
</field>

<field name="travellers" type="number">
<prompt>How many are travelling to <value expr="city"/>?</prompt>
</field>

<block>
<submit next="http://localhost/handler" namelist="city travellers"/>
</block>

</form>
</vxml>

VoiceXML allows you to give progressively more detailed prompts when the user is having difficulty answering. This relies on a counter that increments each time around. The following example shows how for a field that collects the number of people travelling. The user is initially asked: "How many are travelling to Boston". If this doesnt get a satisfactory answer, the user is then asked: "Please tell me the number of people travelling". The nomatch element allows you to provide a reminder if the user said something other than a number:

<field name="travellers" type="number">

<prompt count="1">
How many are travelling to <value expr="city"/>? 
</prompt>

<prompt count="2">
Please tell me the number of people travelling.
</prompt>

<prompt count="3">
To book a flight, you must tell me the number
of people travelling to <value expr="city"/>.
</prompt>

<nomatch>
<prompt>Please say just a number.</prompt>
<reprompt/>
</nomatch>       

</field>

Here is an example that checks the value of a field after it has been collected. This is used to issue a warning when the number of travellers in the group is greater than twelve:

<field name="travellers" type="number">
<prompt>How many are travelling to <value expr="city"/>?</prompt>

<filled> 
<var name="num_travellers" expr="travellers + 0"/>
<if cond="num_travellers > 12"> 
       <prompt>
         Sorry, we only handle groups of up to 12 people.
       </prompt> 
<clear namelist="travellers"/>
</if> 
</filled> 

</field>

VoiceXML allows you to define subdialogs that can be used for common tasks. Subdialogs are analogous to subroutines in programming languages. Here is an example of a confirmation subdialog where a confirmation is asked to decide whether to accept an earlier input or not:

<form id="ynconfirm">
<var name="user_input"/>

<field name="yn" type="boolean">

<prompt>Did you say <value expr="user_input"/></prompt>

<filled>
<var name="result" expr="false"/>
<if cond="yn">
<assign name="result" expr="true"/>
</if>
<return namelist="result"/>
</filled>

</field>

</form>

If the speech recognizer indicates that it wasnt quite sure of what the user said, VoiceXML allows you to tailor the dialog appropriately. In the following example, the user is asked for a confirmation if the confidence score for the city name is less than 0.7, but if it less than 0.3, the user will be asked to say the city name again:

<field name="city">
<prompt>Which city?</prompt>
...  
<filled>
<if cond="city$.confidence < 0.3">
<prompt>Sorry, I didnt get that</prompt>
<clear namelist="city"/>
<elseif cond="city$.confidence < 0.7"/>
<assign name="utterance" expr="city$.utterance"/>
<goto nextitem="confirmcity"/>
</if>
</filled>
</field>

<subdialog name="confirmcity" src="#ynconfirm" cond="false">
<param name="user_input" expr="utterance"/>
<filled>
<if cond="confirmcity.result==false">
<clear namelist="city"/>
</if>
</filled>
</subdialog>

If the confidence is less that 0.3, the user will be told "Sorry, I didnt get that", and will then be reprompted for the city name. If the confidence is less than 0.7, the generic confirmation subdialog is invoked. The subdialog element acts like a subroutine call. The param element is used to pass data to the subdialog.

You can also use grammars in separate files. The following example makes use of grammars in "trade.xml":

<form name="trader">

<field name="company">
  <prompt> Which company do you want to trade?</prompt>
  <grammar src="trade.xml#company" type="application/grammar+xml"/>
</field> 

<field name="action">
  <prompt>
    do you want to buy or sell shares in
    <value expr="company"/>?
  </prompt>
  <grammar src="trade.xml#action" type="application/grammar+xml"/>
</field>

</form>

You can use the import element to import grammar rules so that you can refer to them in locally defined grammars. In the following it is assumed that "politeness.xml" defines rules named "startPolite" (e.g. please) and "endPolite" (e.g. thankyou):

<grammar xml:lang="en">

<import uri="http://please.com/politeness.xml" name="polite"/> 

<rule name="command" scope="public">  
<ruleref import="polite#startPolite"/> 
<ruleref uri="#action"/> 
<ruleref uri="#object"/>  
<ruleref import="polite#endPolite"/>
</rule>

<rule name="action" scope="public"> 
<choice> 
<item tag="buy"> buy </item> 
<item tag="sell"> sell </item> 
</choice> 
</rule> 

<rule name="company" scope="public">
<choice> 
<item tag="ericsson"> ericsson </item> 
<item tag="nokia"> nokia </item> 
</choice> 
</rule>
</grammar>

In the following example for a stock trading application, the user can respond with a short phrase such as "buy ericsson" that sets both the company and the trade (buy or sell). The grammar for this is defined in the file "trade.xml". If the user fails to respond adequately, then the applications tries a simpler approach, prompting first for the company and then for the trade. The field elements are skipped if the corresponding field value has already been filled.

<form name="trader"> 

<grammar src="trade.xml#command" type= "application/grammar+xml"/>

<initial name="start">
<prompt>What trade do you want to make?</prompt>
<nomatch count="1"> 
<prompt>Please say something like 'buy ericsson' </prompt>
<reprompt/>
</nomatch> 
<nomatch count="2">
Sorry, I didn't understand your request. Let's try something simpler. 
<assign name="start" expr="true"/> 
</nomatch> 
</initial>

<field name="company"> ... </field> 
<field name="action"> ... </field>
</form>

The application may give the user the chance to change to a different task by speaking the appropriate command. The grammar for this can be specified at the document level or in the application root document. Here is an example of a document level command menu:

<?xml version="1.0" encoding="ISO-8859-1"?>
<vxml version="2.0" lang="en">

<form name="trader"> 
 ...
</form>

<menu name="portal-commands" scope="document"> 
<choice expr="http://www.wl.com?action=car">Car hire</choice>
<choice expr="http://www.wl.com?action=hotel">Hotel reservations</choice>
<choice expr="http://www.wl.com?action=news">Today's news</choice>
</menu>

...
</vxml>

To reference the application root document, you use the application attribute on the vxml element:

<?xml version="1.0" encoding="ISO-8859-1"?>
<vxml version="2.0" lang="en" 
   application="http://buster/portal?sessionID=12d4rf65hg4" >

...
</vxml>

Here is an example of a root document that makes available a command for returning to the portal home page. The example also includes a handler for catching "noinput" events in case these havent been caught by lower level handlers, e.g. on each dialog:

<form name="portal-commands" scope="document"> 
<field name="action">
<grammar src="http://buster/portal/commands.xml"
 type="application/grammar+xml"/>
</field> 
<block>
<submit next="http://www.wl.com"/>
</block>
</form>
<var name="portal-help" expr=
"To return to your portal home, say home page, or press 0."/>

<catch event="noinput">
Sorry, I didn't hear anything.
</catch>

Getting Further Information

W3Cs specification for VoiceXML 2.0 is the authorative specification for VoiceXML. For further information on the W3C Speech Interface Framework and related specifications, take a look at the W3C Voice Browser Activity. W3C Members can get access to the latest specs under development by the Voice Browser working group. Further tutorials and lots of other useful pointers can be found at the VoiceXML Forums website.

I plan to add further sections on speech grammars and speech synthesis as well as commentaries on W3Cs work on multimodal and other topics.

Best of luck and get writing!

Dave Raggett <dsr@w3.org>

Copyright © 2002-2003 W3C^® (MIT, ERCIM, Keio), All Rights Reserved. W3C liability, trademark, document use and software licensing rules apply. Your interactions with this site are in accordance with our public and Member privacy statements.

Getting started with VoiceXML 2.0

Contents

VoiceXML Overview

Key Concepts

VoiceXML Examples

Getting Further Information