Dave Raggett, revised 14th November 2001.
This is a short introduction to VoiceXML. It owes its heritage to a well travelled set of slides by Scott McGlashan of HP. Scott Co-Chairs the W3C Voice Browser working group and leads the VoiceXML subgroup.
What is VoiceXML? Well its an XML language for writing Web pages you interact with by listening to spoken prompts and jingles, and control by means of spoken input. VoiceXML brings the Web to telephones. If you want to get a hands on feeling for what this is like, there are an increasing number of voice portals which you can phone into and try out for yourself. Several sites also offer free hosting for VoiceXML. Some pointers to these sites can be found in the FAQ on the overview page.
VoiceXML isnt HTML. HTML was designed for visual Web pages and lacks the control over the user-application interaction that is needed for a speech-based interface. With speech you can only hear one thing at a time (kind of like looking at a newspaper with a times 10 magnifying glass). VoiceXML has been carefully designed to give authors full control over the spoken dialog between the user and the application. The application and user take it in turns to speak: the application prompts the user, and the user in turn responds.
VoiceXML documents describe:
VoiceXML makes it easy to rapidly create new applications and shields developers from the low level and implementation details. It separates user-interaction from service logic. The W3C VoiceXML 2.0 specification is the definitive reference to VoiceXML. You can also find other related work on W3Cs voice browser overview page
A session begins when the user starts to interact with a VoiceXML interpreter and continues as VoiceXML documents are loaded and unloaded. The session ends when requested by the user, VoiceXML document or interpreter context. The platform defines the default session behavior, although this can be overridden in part by VoiceXML.
VoiceXML documents define applications as a set of named dialog states. The user is always in one dialog state at any time. Each dialog specifies the next dialog to transition to using a URL.
VoiceXML dialogs include: forms and menus. A menu presents the user with a choice of options and the transitions to another dialog state based upon the users selection. A form defines an interaction that collects values for each of the fields in the form. Each field may specify a prompt, the expected input, and evaluation rules. The form can be submitted to a server in much the same way as for HTML.
An application is a set of VoiceXML documents that share the same application root document. The root document is automatically loaded whenever one of the application documents is loaded, and remains loaded until there is a transition to a different application, or when the call is disconnected. The root document information is available to all documents in the same application.
Each dialog state has one of more grammars associated with it, that are used to describe the expected user input, either spoken input or touch-tone (DTMF) key presses. In the simplest case, only the dialogs grammars are active in that dialog. In more complex cases, other grammars can be active.
A subdialog is like a function call: it allows you to call out to a new dialog and then returns to the original dialog, retaining the local state information for that dialog. Sub dialogs can be used to handle confirmations and to create a library of re-usable dialogs for common tasks.
VoiceXML allows you to define named variables for holding data.
These can be defined at any level and their scope follows an inheritance
model. You can test the values of variables to determine what dialog state to
transition to next. Variable expressions can also be used for conditional
prompts and grammars etc.
Events are thrown when the user fails to respond to a prompt, or when the input cant be understood. VoiceXML allows you to write handlers for catching events. These follow an inheritance model, and events can be caught at a higher level if there is no corresponding handler at the dialog level.
VoiceXML allows you to use scripting (ECMAScript) when you need additional control over the application. VoiceXML employs a form filling metaphor. You can define a complex grammar for collecting the values of several fields in a single response. Any unfilled fields can be handled by special subdialogs defined inline within each dialog.
Here is a very simple VoiceXML application. It says "Welcome to Travel Planner!", plays a short audio advertising jingle and then exits:
<?xml version="1.0" encoding="ISO-8859-1"?> <vxml version="2.0" lang="en"> <form> <block> <prompt bargein="false">Welcome to Travel Planner! <audio src="http://www.adline.com/mobile?code=12s4"/> </prompt> </block> </form> </vxml>
The following example offers a menu of three choices: sports, weather or news.
<?xml version="1.0"?> <vxml version="2.0"> <menu> <prompt> Say one of: <enumerate/> </prompt> <choice next="http://www.sports.example/start.vxml"> Sports </choice> <choice next="http://www.weather.example/intro.vxml"> Weather </choice> <choice next="http://www.news.example/news.vxml"> News </choice> <noinput>Please say one of <enumerate/></noinput> </menu> </vxml>
This dialog might proceed as follows:
Computer: | Say one of: Sports; Weather; News. | |
---|---|---|
Human: | Astrology | |
Computer: | I did not understand what you said. (a platform-specific default message.) |
|
Computer: | Say one of: Sports; Weather; News. | |
Human: | Sports | |
Computer: | (proceeds to http://www.sports.example/start.vxml) |
Here is another example, this time, using a form to ask the user to choose a city and the number of travellers. Once this information has been collected it is submitted to a web server:
<?xml version="1.0" encoding="ISO-8859-1"?> <vxml version="2.0" lang="en"> <form> <field name="city"> <prompt>Where do you want to travel to?</prompt> <option>Edinburgh</option> <option>New York</option> <option>London</option> <option>Paris</option> <option>Stockholm</option> </field> <field name="travellers" type="number"> <prompt>How many are travelling to <value expr="city"/>?</prompt> </field> <block> <submit next="http://localhost/handler" namelist="city travellers"/> </block> </form> </vxml>
VoiceXML allows you to give progressively more detailed prompts when the
user is having difficulty answering. This relies on a counter that increments
each time around. The following example shows how for a field that collects
the number of people travelling. The user is initially asked: "How many are
travelling to Boston". If this doesnt get a satisfactory answer, the user is
then asked: "Please tell me the number of people travelling". The
nomatch
element allows you to provide a reminder if the user
said something other than a number:
<field name="travellers" type="number"> <prompt count="1"> How many are travelling to <value expr="city"/>? </prompt> <prompt count="2"> Please tell me the number of people travelling. </prompt> <prompt count="3"> To book a flight, you must tell me the number of people travelling to <value expr="city"/>. </prompt> <nomatch> <prompt>Please say just a number.</prompt> <reprompt/> </nomatch> </field>
Here is an example that checks the value of a field after it has been collected. This is used to issue a warning when the number of travellers in the group is greater than twelve:
<field name="travellers" type="number"> <prompt>How many are travelling to <value expr="city"/>?</prompt> <filled> <var name="num_travellers" expr="travellers + 0"/> <if cond="num_travellers > 12"> <prompt> Sorry, we only handle groups of up to 12 people. </prompt> <clear namelist="travellers"/> </if> </filled> </field>
VoiceXML allows you to define subdialogs that can be used for common tasks. Subdialogs are analogous to subroutines in programming languages. Here is an example of a confirmation subdialog where a confirmation is asked to decide whether to accept an earlier input or not:
<form id="ynconfirm"> <var name="user_input"/> <field name="yn" type="boolean"> <prompt>Did you say <value expr="user_input"/></prompt> <filled> <var name="result" expr="false"/> <if cond="yn"> <assign name="result" expr="true"/> </if> <return namelist="result"/> </filled> </field> </form>
If the speech recognizer indicates that it wasnt quite sure of what the user said, VoiceXML allows you to tailor the dialog appropriately. In the following example, the user is asked for a confirmation if the confidence score for the city name is less than 0.7, but if it less than 0.3, the user will be asked to say the city name again:
<field name="city"> <prompt>Which city?</prompt> ... <filled> <if cond="city$.confidence < 0.3"> <prompt>Sorry, I didnt get that</prompt> <clear namelist="city"/> <elseif cond="city$.confidence < 0.7"/> <assign name="utterance" expr="city$.utterance"/> <goto nextitem="confirmcity"/> </if> </filled> </field> <subdialog name="confirmcity" src="#ynconfirm" cond="false"> <param name="user_input" expr="utterance"/> <filled> <if cond="confirmcity.result==false"> <clear namelist="city"/> </if> </filled> </subdialog>
If the confidence is less that 0.3, the user will be told "Sorry, I didnt get that", and will then be reprompted for the city name. If the confidence is less than 0.7, the generic confirmation subdialog is invoked. The subdialog element acts like a subroutine call. The param element is used to pass data to the subdialog.
You can also use grammars in separate files. The following example makes use of grammars in "trade.xml":
<form name="trader"> <field name="company"> <prompt> Which company do you want to trade?</prompt> <grammar src="trade.xml#company" type="application/grammar+xml"/> </field> <field name="action"> <prompt> do you want to buy or sell shares in <value expr="company"/>? </prompt> <grammar src="trade.xml#action" type="application/grammar+xml"/> </field> </form>
You can use the import
element to import grammar rules so
that you can refer to them in locally defined grammars. In the following it
is assumed that "politeness.xml" defines rules named "startPolite" (e.g.
please) and "endPolite" (e.g. thankyou):
<grammar xml:lang="en"> <import uri="http://please.com/politeness.xml" name="polite"/> <rule name="command" scope="public"> <ruleref import="polite#startPolite"/> <ruleref uri="#action"/> <ruleref uri="#object"/> <ruleref import="polite#endPolite"/> </rule> <rule name="action" scope="public"> <choice> <item tag="buy"> buy </item> <item tag="sell"> sell </item> </choice> </rule> <rule name="company" scope="public"> <choice> <item tag="ericsson"> ericsson </item> <item tag="nokia"> nokia </item> </choice> </rule> </grammar>
In the following example for a stock trading application, the user can
respond with a short phrase such as "buy ericsson" that sets both the company
and the trade (buy or sell). The grammar for this is defined in the file
"trade.xml". If the user fails to respond adequately, then the applications
tries a simpler approach, prompting first for the company and then for the
trade. The field
elements are skipped if the corresponding field
value has already been filled.
<form name="trader"> <grammar src="trade.xml#command" type= "application/grammar+xml"/> <initial name="start"> <prompt>What trade do you want to make?</prompt> <nomatch count="1"> <prompt>Please say something like 'buy ericsson' </prompt> <reprompt/> </nomatch> <nomatch count="2"> Sorry, I didn't understand your request. Let's try something simpler. <assign name="start" expr="true"/> </nomatch> </initial> <field name="company"> ... </field> <field name="action"> ... </field> </form>
The application may give the user the chance to change to a different task by speaking the appropriate command. The grammar for this can be specified at the document level or in the application root document. Here is an example of a document level command menu:
<?xml version="1.0" encoding="ISO-8859-1"?> <vxml version="2.0" lang="en"> <form name="trader"> ... </form> <menu name="portal-commands" scope="document"> <choice expr="http://www.wl.com?action=car">Car hire</choice> <choice expr="http://www.wl.com?action=hotel">Hotel reservations</choice> <choice expr="http://www.wl.com?action=news">Today's news</choice> </menu> ... </vxml>
To reference the application root document, you use the application
attribute on the vxml
element:
<?xml version="1.0" encoding="ISO-8859-1"?> <vxml version="2.0" lang="en" application="http://buster/portal?sessionID=12d4rf65hg4" > ... </vxml>
Here is an example of a root document that makes available a command for returning to the portal home page. The example also includes a handler for catching "noinput" events in case these havent been caught by lower level handlers, e.g. on each dialog:
<form name="portal-commands" scope="document"> <field name="action"> <grammar src="http://buster/portal/commands.xml" type="application/grammar+xml"/> </field> <block> <submit next="http://www.wl.com"/> </block> </form> <var name="portal-help" expr= "To return to your portal home, say home page, or press 0."/> <catch event="noinput"> Sorry, I didn't hear anything. </catch>
W3Cs specification for VoiceXML 2.0 is the authorative specification for VoiceXML. For further information on the W3C Speech Interface Framework and related specifications, take a look at the W3C Voice Browser Activity. W3C Members can get access to the latest specs under development by the Voice Browser working group. Further tutorials and lots of other useful pointers can be found at the VoiceXML Forums website.
I plan to add further sections on speech grammars and speech synthesis as well as commentaries on W3Cs work on multimodal and other topics.
Best of luck and get writing!