Copyright ©2001 W3C® (MIT, INRIA, Keio), All Rights Reserved. W3C liability, trademark, document use and software licensing rules apply.
This document specifies VoiceXML, the Voice Extensible Markup Language. VoiceXML is designed for creating audio dialogs that feature synthesized speech, digitized audio, recognition of spoken and DTMF key input, recording of spoken input, telephony, and mixed-initiative conversations. Its major goal is to bring the advantages of web-based development and content delivery to interactive voice response applications.
This specification describes markup for representing audio dialogs, and forms part of the proposals for the W3C Speech Interface Framework. This document has been produced as part of the W3C Voice Browser Activity, following the procedures set out for the W3C Process. The authors of this document are members of the Voice Browser Working Group (W3C Members only). This document is for public review, and comments and discussion are welcomed on the public mailing list <www-voice@w3.org>. To subscribe, send an email to <www-voice-request@w3. org> with the word subscribe in the subject line (include the word unsubscribe if you want to unsubscribe). The archive for the list is accessible online.
The proposed XML-based media types used in this specification have been submitted to the IETF for registration. Please note that during the registration process, the proposed media types may be modified or removed.
The Memorandum of Understanding between the W3C and the Voice XML Forum has paved the way for the publication of this working draft, with the VoiceXML Forum committing to abandoning trademark applications involving the name "VoiceXML".
This document seeks Member and public comment on both the technical design and the patent licensing issues arising out of the disclosure and licensing statements that have been made. Our decision to publish this first public working draft has been made to secure early comments from the community, but does not imply that all questions of patent licensing have been resolved or clarified. They must be resolved or work on this document in W3C will stop. As things stand at the time of publication of this specification, implementations conforming to this specification may require royalty bearing licenses for essential IPR. Further information can be found in the patent disclosures page. The patent policy for W3C as a whole is under wide discussion. A set of commitments by all participants in the Voice Browser Activity to royalty free is a possibility for the future but has NOT been made at time of publication.
This is a W3C Working Draft for review by W3C Members and other interested parties. It is a draft document and may be updated, replaced or made obsolete by other documents at any time. It is inappropriate to use W3C Working Drafts as reference material or to cite them as other than "work in progress".
This is work in progress and does not imply endorsement by the W3C membership. A list of current W3C Recommendations and other technical documents, including Working Drafts and Notes, can be found at http://www.w3.org/TR/.
In this document, the key words "must", "must not", "required", "shall", "shall not", "should", "should not", "recommended", "may", and "optional" are to be interpreted as described in RFC 2119 and indicate requirement levels for compliant VoiceXML implementations.
This document defines VoiceXML, the Voice Extensible Markup Language. Its background, basic concepts and use are presented in Section 1. The dialog constructs of form, menu and link, and the mechanism (Form Interpretation Algorithm) by which they are interpreted are then introduced in Section 2. User input using DTMF and speech grammars is covered in Section 3, while Section 4 covers system output using speech synthesis and recorded audio. Mechanisms for manipulating dialog control flow, including variables, events, and executable elements, are explained in Section 5. Environment features such as parameters and properties as well as resource handling are specified in Section 6. The appendices provide additional information including the VoiceXML DTD, a detailed specification of the Form Interpretation Algorithm and timing, audio file formats, and statements relating to conformance, internationalization, accessibility and privacy.
Developers familar with VoiceXML 1.0 are particularly directed to Changes from Previous Public Version which summarizes how VoiceXML 2.0 differs from VoiceXML 1.0.
VoiceXML is designed for creating audio dialogs that feature
synthesized speech, digitized audio, recognition of spoken and
DTMF key input, recording of spoken input, telephony, and
mixed-initiative conversations. Its major goal is to bring the
advantages of web-based development and content delivery to
interactive voice response applications.
Here are two short examples of VoiceXML. The first is the venerable "Hello World":
<?xml version="1.0"?>
<vxml version="2.0">
<form>
<block>Hello World!</block>
</form>
</vxml>
The top-level element is <vxml>, which is mainly a container for dialogs. There are two types of dialogs: forms and menus. Forms present information and gather input; menus offer choices of what to do next. This example has a single form, which contains a block that synthesizes and presents "Hello World!" to the user. Since the form does not specify a successor dialog, the conversation ends.
Our second example asks the user for a choice of drink and then submits it to a server script:
<?xml version="1.0"?>
<vxml version="2.0">
<form>
<field name="drink">
<prompt>Would you like coffee,tea, milk, or nothing?</prompt>
<grammar src="drink.grxml" type="application/grammar+xml"/>
</field>
<block>
<submit next="http://www.drink.example.com/drink2.asp"/>
</block>
</form>
</vxml>
A field is an input field. The user must provide a value for the field before proceeding to the next element in the form. A sample interaction is:
C (computer): Would you like coffee, tea, milk, or nothing?
H (human): Orange juice.
C: I did not understand what you said. (a platform-specific default message.)
C: Would you like coffee, tea, milk, or nothing?
H: Tea
C: (continues in document drink2.asp)
This section contains a high-level architectural model, whose terminology is then used to describe the goals of VoiceXML, its scope, its design principles, and the requirements it places on the systems that support it.
The architectural model assumed by this document has the following components:

Figure 1: Architectural Model
A document server (e.g. a web server) processes requests from a client application, the VoiceXML Interpreter, through the VoiceXML interpreter context. The server produces VoiceXML documents in reply, which are processed by the VoiceXML Interpreter. The VoiceXML interpreter context may monitor user inputs in parallel with the VoiceXML interpreter. For example, one VoiceXML interpreter context may always listen for a special escape phrase that takes the user to a high-level personal assistant, and another may listen for escape phrases that alter user preferences like volume or text-to-speech characteristics.
The implementation platform is controlled by the VoiceXML interpreter context and by the VoiceXML interpreter. For instance, in an interactive voice response application, the VoiceXML interpreter context may be responsible for detecting an incoming call, acquiring the initial VoiceXML document, and answering the call, while the VoiceXML interpreter conducts the dialog after answer. The implementation platform generates events in response to user actions (e.g. spoken or character input received, disconnect) and system events (e.g. timer expiration). Some of these events are acted upon by the VoiceXML interpreter itself, as specified by the VoiceXML document, while others are acted upon by the VoiceXML interpreter context.
VoiceXML’s main goal is to bring the full power of web development and content delivery to voice response applications, and to free the authors of such applications from low-level programming and resource management. It enables integration of voice services with data services using the familiar client-server paradigm. A voice service is viewed as a sequence of interaction dialogs between a user and an implementation platform. The dialogs are provided by document servers, which may be external to the implementation platform. Document servers maintain overall service logic, perform database and legacy system operations, and produce dialogs. A VoiceXML document specifies each interaction dialog to be conducted by a VoiceXML interpreter. User input affects dialog interpretation and is collected into requests submitted to a document server. The document server replies with another VoiceXML document to continue the user’s session with other dialogs.
VoiceXML is a markup language that:
Minimizes client/server interactions by specifying multiple interactions per document.
Shields application authors from low-level, and platform-specific details.
Separates user interaction code (in VoiceXML) from service logic (CGI scripts).
Promotes service portability across implementation platforms. VoiceXML is a common language for content providers, tool providers, and platform providers.
Is easy to use for simple interactions, and yet provides language features to support complex dialogs.
While VoiceXML strives to accommodate the requirements of a majority of voice response services, services with stringent requirements may best be served by dedicated applications that employ a finer level of control.
The language describes the human-machine interaction provided by voice response systems, which includes:
Output of synthesized speech (text-to-speech).
Output of audio files.
Recognition of spoken input.
Recognition of DTMF input.
Recording of spoken input.
Control of dialog flow.
Telephony features such as call transfer and disconnect.
The language provides means for collecting character and/or spoken input, assigning the input to document-defined request variables, and making decisions that affect the interpretation of documents written in the language. A document may be linked to other documents through Universal Resource Identifiers (URIs).
VoiceXML is an XML application. For details about XML, refer to the Annotated XML Specification .
The language promotes portability of services through abstraction of platform resources.
The language accommodates platform diversity in supported audio file formats, speech grammar formats, and URI schemes. While producers of platforms may support various grammar formats the language requires a common grammar format, namely the XML Form of the W3C Speech Recognition Grammar Format, to facilitate interoperability. Similarly, while various audio formats for playback and recording may be supported, the audio formats described in Appendix E must be supported
The language supports ease of authoring for common types of interactions.
The language has a well-defined semantics that preserves the author's intent regarding the behavior of interactions with the user. Client heuristics are not required to determine document element interpretation.
The language has a control flow mechanism.
The language enables a separation of service logic from interaction behavior.
It is not intended for intensive computation, database operations, or legacy system operations. These are assumed to be handled by resources outside the document interpreter, e.g. a document server.
General service logic, state management, dialog generation, and dialog sequencing are assumed to reside outside the document interpreter.
The language provides ways to link documents using URIs, and also to submit data to server scripts using URIs.
VoiceXML provides ways to identify exactly which data to submit to the server, and which HTTP method (get or post) to use in the submittal.
The language does not require document authors to explicitly allocate and deallocate dialog resources, or deal with concurrency. Resource allocation and concurrent threads of control are to be handled by the implementation platform.
This section outlines the requirements on the hardware/software platforms that will support a VoiceXML interpreter.
Document acquisition. The interpreter context is expected to acquire documents for the VoiceXML interpreter to act on. The "http" URI protocol must be supported. In some cases, the document request is generated by the interpretation of a VoiceXML document, while other requests are generated by the interpreter context in response to events outside the scope of the language, for example an incoming phone call. When issuing document requests via http, the interpreter context identifies itself using the "User-Agent" header variable with the value "<name>/<version>", for example, "acme-browser/1.2"
Audio output. An implementation platform must support audio output using audio files and text-to-speech (TTS). The platform must be able to freely sequence TTS and audio output. Audio files are referred to by a URI. The language specifies a required set of audio file formats which must be supported (see Appendix E); additional audio file formats may also be supported.
Audio input. An implementation platform is required to detect and report character and/or spoken input simultaneously and to control input detection interval duration with a timer whose length is specified by a VoiceXML document.
It must report characters (for example, DTMF) entered by a user. Platforms must support the DTMF grammar format described in XML Form of the W3C Speech Recognition Grammar Format. They should also support the DTMF grammar format described in Augmented BNF Form of the W3C Speech Recognition Grammar Format.
It must be able to receive speech recognition grammar data dynamically. It must be able to use speech grammar data in the XML Form of the W3C Speech Recognition Grammar Format. It should be able to receive speech recognition grammar data in the Augmented BNF Form of the W3C Speech Recognition Grammar Format, and may support other formats such as the JSpeech Grammar Format or proprietary formats. Some VoiceXML elements contain speech grammar data; others refer to speech grammar data through a URI. The speech recognizer must be able to accommodate dynamic update of the spoken input for which it is listening through either method of speech grammar data specification.
It must be able to record audio received from the user. The implementation platform must be able to make the recording available to a request variable. The language specifies a required set of recorded audio file formats which must be supported (see Appendix E); additional formats may also be supported.
Transfer The platform should be able to support making a third party connection through a communications network, such as the telephone.
A VoiceXML document (or a set of documents called an application) forms a conversational finite state machine. The user is always in one conversational state, or dialog, at a time. Each dialog determines the next dialog to transition to. Transitions are specified using URIs, which define the next document and dialog to use. If a URI does not refer to a document, the current document is assumed. If it does not refer to a dialog, the first dialog in the document is assumed. Execution is terminated when a dialog does not specify a successor, or if it has an element that explicitly exits the conversation.
There are two kinds of dialogs: forms and menus. Forms define an interaction that collects values for a set of field item variables. Each field may specify a grammar that defines the allowable inputs for that field. If a form-level grammar is present, it can be used to fill several fields from one utterance. A menu presents the user with a choice of options and then transitions to another dialog based on that choice.
A subdialog is like a function call, in that it provides a mechanism for invoking a new interaction, and returning to the original form. Variable instances, grammars, and state information are saved and are available upon returning to the calling document. Subdialogs can be used, for example, to create a confirmation sequence that may require a database query; to create a set of components that may be shared among documents in a single application; or to create a reusable library of dialogs shared among many applications.
A session begins when the user starts to interact with a VoiceXML interpreter context, continues as documents are loaded and processed, and ends when requested by the user, a document, or the interpreter context.
An application is a set of documents sharing the same application root document. Whenever the user interacts with a document in an application, its application root document is also loaded. The application root document remains loaded while the user is transitioning between other documents in the same application, and it is unloaded when the user transitions to a document that is not in the application. While it is loaded, the application root document’s variables are available to the other documents as application variables, and its grammars can also be set to remain active for the duration of the application.
Figure 2 shows the transition of documents (D) in an application that share a common application root document (root).

Figure 2: Transitioning between documents in an application.
Each dialog has one or more speech and/or DTMF grammars associated with it. In machine directed applications, each dialog’s grammars are active only when the user is in that dialog. In mixed initiative applications, where the user and the machine alternate in determining what to do next, some of the dialogs are flagged to make their grammars active (i.e., listened for) even when the user is in another dialog in the same document, or on another loaded document in the same application. In this situation, if the user says something matching another dialog’s active grammars, execution transitions to that other dialog, with the user’s utterance treated as if it were said in that dialog. Mixed initiative adds flexibility and power to voice applications.
VoiceXML provides a form-filling mechanism for handling "normal" user input. In addition, VoiceXML defines a mechanism for handling events not covered by the form mechanism.
Events are thrown by the platform under a variety of circumstances, such as when the user does not respond, doesn't respond intelligibly, requests help, etc. The interpreter also throws events if it finds a semantic error in a VoiceXML document. Events are caught by catch elements or their syntactic shorthand. Each element in which an event can occur may specify catch elements. Catch elements are also inherited from enclosing elements "as if by copy". In this way, common event handling behavior can be specified at any level, and it applies to all lower levels.
A link supports mixed initiative. It specifies a grammar that is active whenever the user is in the scope of the link. If user input matches the link’s grammar, control transfers to the link’s destination URI. A link can be used to throw an event to go to a destination URI.
| Element | Purpose | Section |
|---|---|---|
| <assign> | Assign a variable a value | 5.3.2 |
| <audio> | Play an audio clip within a prompt | 4.1.3 |
| <block> | A container of (non-interactive) executable code | 2.3.1 |
| <catch> | Catch an event | 5.2.2 |
| <choice> | Define a menu item | 2.2 |
| <clear> | Clear one or more form item variables | 5.3.3 |
| <disconnect> | Disconnect a session | 5.3.11 |
| <else> | Used in <if> elements | 5.3.4 |
| <elseif> | Used in <if> elements | 5.3.4 |
| <enumerate> | Shorthand for enumerating the choices in a menu | 2.2 |
| <error> | Catch an error event | 5.2.3 |
| <exit> | Exit a session | 5.3.9 |
| <field> | Declares an input field in a form | 2.3.1 |
| <filled> | An action executed when fields are filled | 2.4 |
| <form> | A dialog for presenting information and collecting data | 2.1 |
| <goto> | Go to another dialog in the same or different document | 5.3.7 |
| <grammar> | Specify a speech recognition or DTMF grammar | 3.1 |
| <help> | Catch a help event | 5.2.3 |
| <if> | Simple conditional logic | 5.3.4 |
| <initial> | Declares initial logic upon entry into a (mixed-initiative) form | 2.3.3 |
| <link> | Specify a transition common to all dialogs in the link’s scope | 2.5 |
| <log> | Generate a debug message | 5.3.13 |
| <menu> | A dialog for choosing amongst alternative destinations | 2.2 |
| <meta> | Define a meta data item as a name/value pair | 6.2 |
| <noinput> | Catch a noinput event | 5.2.3 |
| <nomatch> | Catch a nomatch event | 5.2.3 |
| <object> | Interact with a custom extension | 2.3.5 |
| <option> | Specify an option in a <field> | 2.3 |
| <param> | Parameter in <object> or <subdialog> | 6.4 |
| <prompt> | Queue speech synthesis and audio output to the user | 4.1 |
| <property> | Control implementation platform settings. | 6.3 |
| <record> | Record an audio sample | 2.3.6 |
| <reprompt> | Play a field prompt when a field is re-visited after an event | 5.3.6 |
| <return> | Return from a subdialog. | 5.3.10 |
| <script> | Specify a block of ECMAScript client-side scripting logic | 5.3.12 |
| <subdialog> | Invoke another dialog as a subdialog of the current one | 2.3.4 |
| <submit> | Submit values to a document server | 5.3.8 |
| <throw> | Throw an event. | 5.2.1 |
| <transfer> | Transfer the caller to another destination | 2.3.7 |
| <value> | Insert the value of an expression in a prompt | 4.1.4 |
| <var> | Declare a variable | 5.3.1 |
| <vxml> | Top-level element in each VoiceXML document | 1.5.1 |
A VoiceXML document is primarily composed of top-level elements called dialogs. There are two types of dialogs: forms and menus. A document may also have <meta> elements, <var> and <script> elements, <property> elements, <catch> elements, and <link> elements.
Document execution begins at the first dialog by default. As each dialog executes, it determines the next dialog. When a dialog doesn’t specify a successor dialog, document execution stops.
Here is "Hello World!" expanded to illustrate some of this. It now has a document level variable called "hi" which holds the greeting. Its value is used as the prompt in the first form. Once the first form plays the greeting, it goes to the form named "say_goodbye", which prompts the user with "Goodbye!" Because the second form does not transition to another dialog, it causes the document to be exited.
<?xml version="1.0"?>
<vxml version="2.0">
<meta name="author" content="John Doe"/>
<meta name="maintainer" content="hello-support@hi.example.com"/>
<var name="hi" expr="'Hello World!'"/>
<form>
<block>
<value expr="hi"/>
<goto next="#say_goodbye"/>
</block>
</form>
<form id="say_goodbye">
<block>
Goodbye!
</block>
</form>
</vxml>
Stylistically it is best to combine the forms:
<?xml version="1.0"?>
<vxml version="2.0">
<meta name="author" content="John Doe"/>
<meta name="maintainer" content="hello-support@hi.example.com"/>
<var name="hi" expr="'Hello World!'"/>
<form>
<block>
<value expr="hi"/> Goodbye!
</block>
</form>
</vxml>
Attributes of <vxml> include:
| version | The version of VoiceXML of this document (required). The current version number is 2.0. |
|---|---|
| base | The base URI. As in HTML, an absolute URI which all relative references within the document take as their base. |
| xml:lang | The language and locale type for this document as defined in RFC 1766. If omitted, the value is a platform-specific default. |
| application | The URI of this document’s application root document, if any. |
Language information is inherited down the document hierarchy: the value of "xml:lang" is inherited by elements which also define the "xml:lang" attribute, such as <grammar> and <prompt>, unless these elements specify an alternative value.
Normally, each document runs as an isolated application. In cases where you want multiple documents to work together as one application, you select one document to be the application root document, and refer to it in the other documents’<vxml> elements.
When this is done, every time the interpreter is told to load a document in this application, it also loads the application root document if it is not already loaded. The application root document remains loaded until the interpreter is told to load a document that belongs to a different application. Thus one of the following two conditions always holds during interpretation:
The application root document is loaded and the user is executing in it.
The application root document and one other document in the application, known as an application leaf document, are both loaded and the user is executing in the leaf document.
If there is a chain of subdialogs defined in separate documents, then there may be more than one leaf document loaded although execution will only be in one of these documents.
There are several benefits to multi-document applications. First, the application root document’s variables are available for use by the other documents in the application, so that information can be shared and retained. Second, the grammars of the application root document are active even when the user is in other application documents, so that the user can always interact with forms, links, and menus from the application root document. Third, catch elements of the application root document can define default event handling for all documents within the application. Fourth, property elements in the application root document can specify default values for properties used throughout the application.
Here is a two-document application illustrating this:
Application root document (app-root.vxml)
<?xml version="1.0"?> <vxml version="2.0"> <var name="bye" expr="'Ciao'"/> <link next="operator_xfer.vxml"> <grammar> <rule id="root" scope="public">operator</rule> </grammar> </link> </vxml>
Leaf document (leaf.vxml)
<?xml version="1.0"?>
<vxml version="2.0" application="app-root.vxml">
<form id="say_goodbye">
<field name="answer" type="boolean">
<prompt>Shall we say <value expr="application.bye"/>?</prompt>
<filled>
<if cond="answer">
<exit/>
</if>
<clear namelist="answer"/>
</filled>
</field>
</form>
</vxml>
In this example, the application is designed so that leaf.vxml must be loaded first. Its application attribute specifies that app-root.vxml should be used as the application root document. So, app-root.vxml is then loaded, which creates the application variable bye and also defines a link that navigates to operator-xfer.vxml whenever the user says "operator". The user starts out in the say_goodbye form:
C: Shall we say Ciao?
H: Si.
C: I did not understand what you said. (a platform-specific default message.)
C: Shall we say Ciao?
H: Ciao
C: I did not understand what you said.
H: Operator.
C: (Goes to operator_xfer.vxml, which transfers the caller to a human operator.)
Note that when the user is in a multi-document application, at most two documents are loaded at any one time: the application root document and, unless the user is actually interacting with the application root document, an application leaf document.A root document's <vxml> element does not have an application attribute specified. A leaf document's <vxml> element does have an application attribute specified. An interpreter always has an application root document loaded; it does not always have an application leaf document loaded.
The name of the interpreter's current application is the application root document's absolute URI. The absolute URI includes a query string, if present, but it does not include a fragment identifier. The interpreter remains in the same application as long as the name remains the same. When the name changes, a new application is entered and its root context is initialized. The application's root context consists of the variables, grammars, catch elements, and properties in application scope.
During a user session an interpreter transitions from one document to another as requested by <choice>, <goto> <link>, <subdialog>, and <submit> elements. Some transitions are within an application, others are between applications. The preservation or initialization of the root context depends on the type of transition:
If a document refers to a non-existent application root document, or if a document's application attribute refers to a document that also has an application attribute specified, an error.semantic event is thrown.
The following diagrams illustrate the effect of the transitions between root and leaf documents on the application root context. In these diagrams, boxes represent documents, box texture changes identify root context initialization, solid arrows symbolize transitions to the URI in the arrow's label, dashed vertical arrows indicate an application attribute whose URI is the arrow's label.

Figure 3: Transitions that Preserve the Root Context
In this diagram, all the documents belong to the same application. The transitions are identified by the numbers 1-4 across the top of the figure. They are:
The next diagram illustrates transitions which initialize the root context.

Figure 4: Transitions that Initialize the Root Context
A subdialog is a mechanism for decomposing complex sequences of dialogs to better structure them, or to create reusable components. For example, the solicitation of account information may involve gathering several pieces of information, such as account number, and home telephone number. A customer care service might be structured with several independent applications that could share this basic building block, thus it would be reasonable to construct it as a subdialog. This is illustrated in the example below. The first document, app.vxml, seeks to adjust a customer’s account, and in doing so must get the account information and then the adjustment level. The account information is obtained by using a subdialog element that invokes another VoiceXML document to solicit the user input. While the second document is being executed, the calling dialog is suspended, awaiting the return of information. The second document provides the results of its user interactions using a <return> element, and the resulting values are accessed through the variable defined by the name attribute on the <subdialog> element.
Customer Service Application (app.vxml)
<?xml version="1.0"?>
<vxml version="2.0">
<form id="billing_adjustment">
<var name="account_number"/>
<var name="home_phone"/>
<subdialog name="accountinfo" src="acct_info.vxml#basic">
<filled>
<!-- Note the variable defined by "accountinfo" is
returned as an ECMAScript object and it contains two
properties defined by the variables specified in the
"return" element of the subdialog. -->
<assign name="account_number" expr="accountinfo.acctnum"/>
<assign name="home_phone" expr="accountinfo.acctphone"/>
</filled>
</subdialog>
<field name="adjustment_amount" type="currency">
<prompt>
What is the value of your account adjustment?
</prompt>
<filled>
<submit next="/cgi-bin/updateaccount"/>
</filled>
</field>
</form>
</vxml>
Document Containing Account Information Subdialog (acct_info.vxml)
<?xml version="1.0"?>
<vxml version="2.0">
<form id="basic">
<field name="acctnum" type="digits">
<prompt> What is your account number? </prompt>
</field>
<field name="acctphone" type="phone">
<prompt> What is your home telephone number? </prompt>
<filled>
<!-- The values obtained by the two fields are supplied
to the calling dialog by the "return" element. -->
<return namelist="acctnum acctphone"/>
</filled>
</field>
</form>
</vxml>
Subdialogs add a new execution context when they are invoked.The subdialog could be a new dialog within the existing document, or a new dialog within a new document.
Subdialogs can be composed of several documents. Figure 5 shows the execution flow where a sequence of documents (D) transitions to a subdialog (SD) and then back.

Figure 5: Subdialog composed of several documents
returning from the last subdialog document.
The execution context in dialog D2 is suspended when it invokes the subdialog SD1 in document sd1.vxml. This subdialog specifies execution is to be transfered to the dialog in sd2.vxml (using <goto>). Consequently, when the dialog in sd2.vxml returns, control is returned directly to dialog D2.
Figure 6 shows an example of a multi-document subdialog where control is transferred from one subdialog to another.

Figure 6: Subdialog composed of several documents
returning from the first subdialog document.
The subdialog in sd1.vxml specifies that control is to be transfered to a second subdialog, SD2, in sd2.vxml. When executing SD2, there are two suspended contexts: the dialog context in D2 is suspending awaiting SD1 to return; and the dialog context in SD1 awaiting SD2 to return. When SD2 returns, control is returned to the SD1. It in turn returns control to dialog D2.
Forms are the key component of VoiceXML documents. A form contains:
A set of form items, elements that are visited in the main loop of the form interpretation algorithm. Form items are subdivided into field items, those that define the form’s field item variables, and control items, those that help control the gathering of the form’s fields.
Declarations of non-field item variables.
Event handlers.
"Filled" actions, blocks of procedural logic that execute when certain combinations of field items are filled in.
Form attributes are:
| id | The name of the form. If specified, the form can be referenced within the document or from another document. For instance <form id="weather">, <goto next="#weather">. |
|---|---|
| scope | The default scope of the form’s grammars. If it is dialog then the form grammars are active only in the form. If the scope is document, then the form grammars are active during any dialog in the same document. If the scope is document and the document is an application root document, then the form grammars are active during any dialog in any document of this application. Note that the scope of individual form grammars takes precedence over the default scope; for example, in non-root documents a form with the default scope "dialog", and a form grammar with the scope "document", then that grammar is active in any dialog in the document. |
This section describes some of the concepts behind forms, and then gives some detailed examples of their operation.
Forms are interpreted by an implicit form interpretation algorithm (FIA). The FIA has a main loop that repeatedly selects a form item and then visits it. The selected form item is the first in document order whose guard condition is not satisfied. For instance, a field item’s default guard condition tests to see if the field item variable has a value, so that if a simple form contains only field items, the user will be prompted for each field item in turn.
Interpreting a form item generally involves:
Selecting and playing one or more prompts;
Collecting a user input, either a response that fills in one or more fields, or a throwing of some event (help, for instance); and
Interpreting any <filled> actions that pertained to the newly filled in fields.
The FIA ends when it interprets a transfer of control statement (e.g. a <goto> to another dialog or document, or a <submit> of data to the document server). It also ends with an implied <exit> when no form item remains eligible to select.
The FIA is described in more detail in Section 2.1.6.
A form’s form items are the elements that can be visited in the main loop of the form interpretation algorithm. Field items direct the FIA to gather a specific field. When the FIA selects a control item, the control item may contain a block of procedural code to execute, or it may tell the FIA to set up the initial prompt-and-collect for a mixed initiative form.
A field item specifies a field item variable to gather from the user. Field items have prompts to tell the user what to say or key in, grammars that define the allowed inputs, and event handlers that process any resulting events. A field item may also have a <filled> element that defines an action to take just after the field item variable is filled in. Field items are subdivided into:
| <field> | A field item whose value is obtained via ASR or DTMF grammars. |
|---|---|
| <record> | A field item whose value is an audio clip recorded by the user. A <record> element could collect a voice mail message, for instance. |
| <transfer> | A field item which transfers the user to another telephone number. If the transfer returns control, the field variable will be set to the result status. |
| <object> | This field item invokes a platform-specific "object" with various parameters. The result of the platform object is an ECMAScript Object with one or more properties. One platform object could be a builtin dialog that gathers credit card information. Another could gather a text message using some proprietary DTMF text entry method. There is no requirement for implementations to provide platform-specific objects, although implementations must handle the <object> element by throwing error.unsupported.object if the particular platform-specific object is not supported. |
| <subdialog> | A <subdialog> field item is roughly like a function call. It invokes another dialog on the current page, or invokes another VoiceXML document. It returns an ECMAScript Object as its result. |
There are two types of control items:
| <block> | A sequence of procedural statements used for prompting and computation, but not for gathering input. A block has a (normally implicit) form item variable that is set to true just before it is interpreted. |
|---|---|
| <initial> | This element controls the initial interaction in a mixed initiative form. Its prompts should be written to encourage the user to say something matching a form level grammar. When at least one field item variable is filled as a result of recognition during an <initial> element, the form item variable of <initial> becomes true, thus removing it as an alternative for the FIA. |
Each form item has an associated form item variable, which by default is set to undefined when the form is entered. This form item variable will contain the result of interpreting the form item. A field item’s form item variable is also called a field item variable, and it holds the value collected from the user. A form item variable can be given a name using the name attribute, or left nameless, in which case an internal name is generated.
Each form item also has a guard condition, which governs whether or not that form item can be selected by the form interpretation algorithm. The default guard condition just tests to see if the form item variable has a value. If it does, the form item will not be visited.
Typically, field items are given names, but control items are not. Generally form item variables are not given initial values and additional guard conditions are not specified. But sometimes there is a need for more detailed control. One form may have a form item variable initially set to hide a field, and later cleared (e.g., using <clear>) to force the field’s collection. Another field may have a guard condition that activates it only when it has not been collected, and when two other fields have been filled. A block item could execute only when some condition holds true. Thus, fine control can be exercised over the order in which form items are selected and executed by the FIA, however in general, many dialogs can be constructed without resorting to this level of complexity.
In summary, all form items have the following attributes:
| name | The name of a dialog-scoped form item variable that will hold the value of the form item. |
|---|---|
| expr | The initial value of the form item variable; default is ECMAScript undefined. If initialized to a value, then the form item will not be executed unless the form item variable is cleared. |
| cond | An expression to evaluate in conjunction with the test of the form item variable. If absent, this defaults to true, or in the case of <initial>, a test to see if any field item variable has been filled in. |
The simplest and most common type of form is one in which the form items are executed exactly once in sequential order to implement a computer-directed interaction. Here is a weather information service that uses such a form.
<form id="weather_info">
<block>Welcome to the weather information service.</block>
<field name="state">
<prompt>What state?</prompt>
<grammar src="state.grxml" type="application/grammar+xml"/>
<catch event="help">
Please speak the state for which you want the weather.
</catch>
</field>
<field name="city">
<prompt>What city?</prompt>
<grammar src="city.grxml" type="application/grammar+xml"/>
<catch event="help">
Please speak the city for which you want the weather.
</catch>
</field>
<block>
<submit next="/servlet/weather" namelist="city state"/>
</block>
</form>
This dialog proceeds sequentially:
C (computer): Welcome to the weather information service. What state?
H (human): Help
C: Please speak the state for which you want the weather.
H: Georgia
C: What city?
H: Tblisi
C: I did not understand what you said. What city?
H: Macon
C: The conditions in Macon Georgia are sunny and clear at 11 AM ...
The form interpretation algorithm’s first iteration selects the first block, since its (hidden) form item variable is initially undefined. This block outputs the main prompt, and its form item variable is set to true. On the FIA’s second iteration, the first block is skipped because its form item variable is now defined, and the state field is selected because the dialog variable state is undefined. This field prompts the user for the state, and then sets the variable state to the answer. The third form iteration prompts and collects the city field. The fourth iteration executes the final block and transitions to a different URI.
Each field in this example has a prompt to play in order to elicit a response, a grammar that specifies what to listen for, and an event handler for the help event. The help event is thrown whenever the user asks for assistance. The help event handler catches these events and plays a more detailed prompt.
Here is a second directed form, one that prompts for credit card information:
<form id="get_card_info">
<block>We now need your credit card type, number,
and expiration date.</block>
<field name="card_type">
<prompt count="1">What kind of credit card
do you have?</prompt>
<prompt count="2">Type of card?</prompt>
<!-- This is an inline grammar. -->
<grammar>
<rule id="r2" scope="public">
<one-of>
<item>visa</item>
<item>master <count number="optional">card</count></item>
<item>amex</item>
<item>american express</item>
</one-of>
</rule>
</grammar>
<help> Please say Visa, Mastercard, or American Express.</help>
</field>
<!-- The grammar for type="digits" is built in. -->
<field name="card_num" type="digits">
<prompt count="1">What is your card number?</prompt>
<prompt count="2">Card number?</prompt>
<catch event="help">
<if cond="card_type =='amex' || card_type =='american express'">
Please say or key in your 15 digit card number.
<else/>
Please say or key in your 16 digit card number.
</if>
</catch>
<filled>
<if cond="(card_type == 'amex' || card_type =='american express')
&& card_num.length != 15">
American Express card numbers must have 15 digits.
<clear namelist="card_num"/>
<throw event="nomatch"/>
<elseif cond="card_type != 'amex'
&& card_type !='american express'
&& card_num.length != 16"/>
Mastercard and Visa card numbers have 16 digits.
<clear namelist="card_num"/>
<throw event="nomatch"/>
</if>
</filled>
</field>
<field name="expiry_date" type="digits">
<prompt count="1">What is your card's expiration date?</prompt>
<prompt count="2">Expiration date?</prompt>
<help>
Say or key in the expiration date, for example one two oh one.
</help>
<filled>
<!-- validate the mmyy -->
<var name="mm"/>
<var name="i" expr="expiry_date.length"/>
<if cond="i == 3">
<assign name="mm" expr="expiry_date.substring(0,1)"/>
<elseif cond="i == 4"/>
<assign name="mm" expr="expiry_date.substring(0,2)"/>
</if>
<if cond="mm == '' || mm < 1 || mm > 12">
<clear namelist="expiry_date"/>
<throw event="nomatch"/>
</if>
</filled>
</field>
<field name="confirm" type="boolean">
<prompt>
I have <value expr="card_type"/> number
<value expr="card_num"/>, expiring on
<value expr="expiry_date"/>.
Is this correct?
</prompt>
<filled>
<if cond="confirm">
<submit next="place_order.asp"
namelist="card_type card_num expiry_date"/>
</if>
<clear namelist="card_type card_num expiry_date acknowledge"/>
</filled>
</field>
</form>
Note that the grammar alteratives 'amex' and 'american express' return literal values which need to be handled separately in the conditional expressions. Section 3.1.5 describes how semantic attachments in the grammar can be used to return a single representation of these inputs.
The dialog might go something like this:
C: We now need your credit card type, number, and expiration date.
C: What kind of credit card do you have?
H: Discover
C: I did not understand what you said. (a platform-specific default message.)
C: Type of card? (the second prompt is used now.)
H: Shoot. (fortunately treated as "help" by this platform)
C: Please say Visa, Master card, or American Express.
H: Uh, Amex. (this platform ignores "uh")
C: What is your card number?
H: One two three four ... wait ...
C: I did not understand what you said.
C: Card number?
H: (uses DTMF) 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 #
C: What is your card’s expiration date?
H: one two oh one
C: I have Amex number 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 expiring on 1 2 0 1. Is this correct?
H: Yes
Fields are the major building blocks of forms. A field declares a variable and specifies the prompts, grammars, DTMF sequences, help messages, and other event handlers that are used to obtain it. Each field declares a VoiceXML field item variable in the form’s dialog scope. These may be submitted once the form is filled, or copied into other variables.
Each field has its own speech and/or DTMF grammars, specified explicitly using <grammar> elements, or implicitly using the type attribute. The type attribute is used for standard builtin grammars, like digits, boolean, or number. The type attribute also governs how that field’s value is spoken by the speech synthesizer.
Each field can have one or more prompts. If there is one, it is repeatedly used to prompt the user for the value until one is provided. If there are many, they must be given count attributes. These determine which prompt to use on each attempt. In the example, prompts become shorter. This is called tapered prompting.
The <catch event="help"> elements are event handlers that define what to do when the user asks for help. Help messages can also be tapered. These can be abbreviated, so that the following two elements are equivalent:
<catch event="help"> Please say visa, mastercard, or amex. </catch> <help> Please say visa, mastercard, or amex. </help>
The <filled> element defines what to do when the user provides a recognized input for that field. One use is to specify integrity constraints over and above the checking done by the grammars, as with the date field above.
The last section talked about forms implementing rigid, computer-directed conversations. To make a form mixed initiative, where both the computer and the human direct the conversation, it must have one or more <initial> form items and one or more form-level grammars.
If a form has form-level grammars:
Its fields can be filled in any order.
More than one field can be filled as a result of a single user utterance.
The filling of field variables when using a form-level grammar is described in Section 3.1.6.
Also, the form’s grammars can be active when the user is in other dialogs. If a document has two forms on it, say a car rental form and a hotel reservation form, and both forms have grammars that are active for that document, a user could respond to a request for hotel reservation information with information about the car rental, and thus direct the computer to talk about the car rental instead. The user can speak to any active grammar, and have fields set and actions taken in response.
Example. Here is a second version of the weather information service, showing mixed initiative. It has been "enhanced" for illustrative purposes with advertising and with a confirmation of the city and state:
<form id="weather_info">
<grammar src="cityandstate.grxml" type="application/grammar+xml"/>
<!-- Caller can't barge in on today's advertisement. -->
<block>
<prompt bargein="false">
Welcome to the weather information service.
<audio src="http://www.online-ads.example.com/wis.wav"/>
</prompt>
</block>
<initial name="start">
<prompt>
For what city and state would you like the weather?
</prompt>
<help>
Please say the name of the city and
state for which you would like a weather report.
</help>
<!-- If user is silent, reprompt once, then
try directed prompts. -->
<noinput count="1"> <reprompt/></noinput>
<noinput count="2"> <reprompt/>
<assign name="start" expr="true"/></noinput>
</initial>
<field name="state">
<prompt>What state?</prompt>
<help>
Please speak the state for which you want the weather.
</help>
</field>
<field name="city">
<prompt>Please say the city in <value expr="state"/>
for which you want the weather.</prompt>
<help>Please speak the city for which you
want the weather.</help>
<filled>
<!-- Most of our customers are in LA. -->
<if cond="city == 'Los Angeles' && state == undefined">
<assign name="state" expr="'California'"/>
</if>
</filled>
</field>
<field name="go_ahead" type="boolean" modal="true">
<prompt>Do you want to hear the weather for
<value expr="city"/>, <value expr="state"/>?
</prompt>
<filled>
<if cond="go_ahead">
<prompt bargein="false">
<audio src="http://www.online-ads.example.com/wis2.wav"/>
</prompt>
<submit next="/servlet/weather" namelist="city state"/>
</if>
<clear namelist="start city state go_ahead"/>
</filled>
</field>
</form>
Here is a transcript showing the advantages for even a novice user:
C: Welcome to the weather information service. Buy Joe’s Spicy Shrimp Sauce.
C: For what city and state would you like the weather?
H: Uh, California.
C: Please say the city in California for which you want the weather.
H: San Francisco, please.
C: Do you want to hear the weather for San Francisco, California?
H: No
C: For what city and state would you like the weather?
H: Los Angeles.
C: Do you want to hear the weather for Los Angeles, California?
H: Yes
C: Don’t forget, buy Joe’s Spicy Shrimp Sauce tonight!
C: Mostly sunny today with highs in the 80s. Lows tonight from the low 60s ...
The go_ahead field has its modal attribute set to true. This causes all grammars to be disabled except the ones defined in the current form item, so that the only grammar active during this field is the builtin grammar for boolean.
An experienced user can get things done much faster (but is still forced to listen to the ads):
C: Welcome to the weather information service. Buy Joe’s Spicy Shrimp Sauce.
C: What ...
H (barging in): LA
C: Do you ...
H (barging in): Yes
C: Don’t forget, buy Joe’s Spicy Shrimp Sauce tonight!
C: Mostly sunny today with highs in the 80s. Lows tonight from the low 60s ...
The form interpretation algorithm can be customized in several ways. One way is to assign a value to a form item variable, so that its form item will not be selected. Another is to use <clear> to set a form item variable to undefined; this forces the FIA to revisit the form item again.
Another method is to explicitly specify the next field item to visit using <goto nextitem>. This forces an immediate transfer to that field item. No variables, conditions or counters in the targeted form item will be reset. The form item's prompt will be played even if it has already been visited. If the <goto nextitem> occurs in a <filled> action, the rest of the <filled> action and any pending <filled> actions will be skipped.
Here is an example <goto nextitem> executed in response to the exit event:
<form id="survey_2000_03_30">
<catch event="exit">
<reprompt/>
<goto nextitem="confirm_exit"/>
</catch>
<block>
<prompt>
Hello, you have been called at random to answer questions
critical to U.S. foreign policy.
</prompt>
</block>
<field name="q1" type="boolean">
<prompt>Do you agree with the IMF position on
privatizing certain functions of Burkina Faso’s
agriculture ministry?</prompt>
</field>
<field name="q2" type="boolean">
<prompt>If this privatization occurs, will its
effects be beneficial mainly to Ouagadougou and
Bobo-Dioulasso?</prompt>
</field>
<field name="q3" type="boolean">
<prompt>Do you agree that sorghum and millet output
might thereby increase by as much as four percent per
annum?</prompt>
</field>
<block>
<submit next="register" namelist="q1 q2 q3"/>
</block>
<field name="confirm_exit" type="boolean">
<prompt>You have elected to exit. Are you
sure you want to do this, and perhaps adversely affect
U.S. foreign policy vis-`-vis sub-Saharan Africa for
decades to come?</prompt>
<filled>
<if cond="confirm_exit">
Okay, but the U.S. State Department is displeased.
<exit/>
<else/>
Good, let's pick up where we left off.
<clear namelist="confirm_exit"/>
</if>
</filled>
</field>
</form>
If the user says "exit" in response to any of the survey questions, an exit event is thrown by the platform and caught by the <catch> event handler. This handler directs that confirm_exit be the next visited field. The confirm_exit field would not be visited during normal completion of the survey because the preceding <block> element transfers control to the registration script.
We’ve presented the form interpretation algorithm (FIA) at a conceptual level. In this section we describe it in more detail. A more formal description is provided in Appendix C.
Whenever a form is entered, it is initialized. Internal prompt counter variables (in the form’s dialog scope) are reset to 1. Each variable (form-level <var> elements and form item variables) is initialized, in document order, to undefined or to the value of the relevant expr attribute.
The main loop of the FIA has three phases:
The select phase: the next form item is selected for visiting.
The collect phase: the next unfilled form item is visited, which prompts the user for input, enables the appropriate grammars, and then waits for and collects an input (such as a spoken phrase or DTMF key presses) or an event (such as a request for help or a no input timeout).
The process phase: an input is processed by filling form items and executing <filled> elements to perform actions such as input validation. An event is processed by executing the appropriate event handler for that event type.
Note that the FIA may be given an input (a set of grammar slot/slot value pairs) that was collected while the user was in a different form’s FIA. In this case the first iteration of the main loop skips the select and collect phases, and goes right to the process phase with that input.
The purpose of the select phase is to select the next form item to visit. This is done as follows:
If a <goto> from the last main loop iteration’s process phase specified a <goto nextitem>, then the specified form item is selected.
Otherwise the first form item whose guard condition is false is chosen to be visited.
If no guard condition is false, then the last iteration completed the form without encountering an explicit transfer of control, so the FIA does an implicit <exit> operation.
The purpose of the collect phase is to collect an input or an event. The selected form item is visited, which performs actions that depend on the type of form item:
If a field item is visited, the FIA selects and queues up any prompts based on the field item’s prompt counter and the prompt conditions. Then it listens for the field level grammar(s) and any active higher-level grammars, and waits for a grammar recognition or for some event.
If a <transfer> is visited, the prompts are queued based on the item’s prompt counter and the prompt conditions. The item grammars are activated. The queue is played before the transfer is executed.
If a <subdialog> or <object> is visited, the prompts are queued based on the item’s prompt counter and the prompt conditions. Grammars are not activated. Instead, the input collection behavior is specified by the executing context for the subdialog or object. The queue is not played before the subdialog or object is executed, but instead should be played during the subsequent input collection.
If an <initial> is visited, the FIA selects and queues up prompts based on the <initial>’s prompt counter and prompt conditions. Then it listens for the form level grammar(s) and any active higher-level grammars. It waits for a grammar recognition or for an event.
A <block> element is visited by setting its form item variable to true, evaluating its content, and then bypassing the process phase. No input is collected, and the next iteration of the FIA’s main loop is entered.
The purpose of the process phase is to process the input or event collected during the collect phase, as follows:
If an input matches a grammar in this form, then:
After completion of the process phase, interpretation continues by returning to the select phase.
A more detailed form interpretation algorithm can be found in Appendix C.
A menu is a convenient syntactic shorthand for a form containing a single anonymous field that prompts the user to make a choice and transitions to different places based on that choice. Like a regular form, it can have its grammar scoped such that it is active when the user is executing another dialog. The following menu offers the user three choices:
<menu>
<prompt>
Welcome home. Say one of: <enumerate/>
</prompt>
<choice next="http://www.sports.example.com/vxml/start.vxml">
Sports
</choice>
<choice next="http://www.weather.example.com/intro.vxml">
Weather
</choice>
<choice next="http://www.stargazer.example.com/voice/astronews.vxml">
Stargazer astrophysics news
</choice>
<noinput>Please say one of <enumerate/></noinput>
</menu>
This dialog might proceed as follows:
C: Welcome home. Say one of: sports; weather; Stargazer astrophysics news.
H: Astrology.
C: I did not understand what you said. (a platform-specific default message.)
C: Welcome home. Say one of: sports; weather; Stargazer astrophysics news.
H: sports.
C: (proceeds to http://www.sports.example.com/vxml/start.vxml)
This identifies the menu, and determines the scope of its grammars. Menu attributes are:
| id | The identifier of the menu. It allows the menu to be the target of a <goto> or a <submit>. |
|---|---|
| scope | The menu’s grammar scope. If it is dialog – the default – the menu’s grammars are only active when the user transitions into the menu. If the scope is document, its grammars are active over the whole document (or if the menu is in the application root document, any loaded document in the application). |
| dtmf | When set to true, any choices that do not have explicit DTMF elements are given the implicit ones "1", "2", etc. |
| accept | When set to "exact" (the default), the text of the choice elements in the menu defines the exact phrase to be recognized. When set to "approximate", the text of the choice elements defines an approximate recognition phrase (as described under grammar generation). Each <choice> can override this setting. |
The <choice> element serves several purposes:
It specifies a speech grammar fragment and/or a DTMF grammar fragment that determines when that choice has been selected.
The contents are used to form the <enumerate> prompt string.
It specifies the URI to go to when the choice is selected.
Choice attributes are:
| dtmf | The DTMF sequence for this choice. |
|---|---|
| accept | Override the setting for accept in <menu> for this particular choice. When set to "exact" (the default), the text of the choice element defines the exact phrase to be recognized. When set to "approximate", the text of the choice element defines an approximate recognition phrase (as described under grammar generation). |
| next | The URI of next dialog or document. |
| event | Specify an event to be thrown instead of specifying a next. The 'next' and 'expr' attributes have precedence over the 'event' attribute. |
| expr | Specify an expression to evaluate as a URI to transition to instead of specifying a next. The 'next' attribute has precedence over the 'expr' attribute. |
| fetchaudio | See Section 6.1. This defaults to the fetchaudio property. |
| fetchhint | See Section 6.1. This defaults to the documentfetchhint property. |
| fetchtimeout | See Section 6.1. This defaults to the fetchtimeout property. |
| maxage | See Section 6.1. This defaults to the documentmaxage property. |
| maxstale | See Section 6.1. This defaults to the documentmaxstale property. |
If a <grammar> element is specified in <choice>, then the external grammar is used instead of an automatically generated grammar. This allows the developer to precisely control the <choice> grammar; for example:
<menu> <choice next="http://www.sports.example.com/vxml/start.vxml"> <grammar src="sports.grxml" type="application/grammar+xml"/> Sports </choice> <choice next="http://www.weather.example.com/intro.vxml"> <grammar src="weather.grxml" type="application/grammar+xml"/> Weather </choice> <choice next="http://www.stargazer.example.com/voice/astronews.vxml"> <grammar src="astronews.grxml" type="application/grammar+xml"/> Stargazer astrphysics </choice> </menu>
Menus can rely purely on speech, purely on DTMF, or both in combination by including a <property> element in the <menu>. Here is a DTMF-only menu with explicit DTMF sequences given to each choice, using the choice’s dtmf attribute:
<menu> <property name="inputmodes" value="dtmf"/> <prompt> For sports press 1, For weather press 2, For Stargazer astrophysics press 3. </prompt> <choice dtmf="1" next="http://www.sports.example.com/vxml/start.vxml"/> <choice dtmf="2" next="http://www.weather.example.com/intro.vxml"/> <choice dtmf="3" next="http://www.stargazer.example.com/astronews.vxml"/> </menu>
Alternatively, you can set the <menu>’s dtmf attribute to true to assign sequential DTMF digits to each of the first nine choices: the first choice has DTMF "1", and so on:
<menu dtmf="true">
<property name="inputmodes" value="dtmf"/>
<prompt>
For sports press 1, For weather
press 2, For Stargazer astrophysics press 3.
</prompt>
<choice next="http://www.sports.example.com/vxml/start.vxml"/>
<choice next="http://www.weather.example.com/intro.vxml"/>
<choice
next="http://www.stargazer.example.com/voice/astronews.vxml"/>
</menu>
The <enumerate> element is an automatically generated description of the choices available to the user. It specifies a template that is applied to each choice in the order they appear in the menu. If it is used with no content, a default template that lists all the choices is used, determined by the interpreter context. If it has content, the content is the template specifier. This specifier may refer to two special variables: _prompt is the choice’s prompt, and _dtmf is the choice’s assigned DTMF sequence. For example, if the menu were rewritten as
<menu dtmf="true">
<prompt>
Welcome home.
<enumerate>
For <value expr="_prompt"/>, press <value
expr="_dtmf"/>.
</enumerate>
</prompt>
<choice next="http://www.sports.example.com/vxml/start.vxml">
sports </choice>
<choice next="http://www.weather.example.com/intro.vxml">
weather </choice>
<choice next="http://www.stargazer.example.com/voice/astronews.vxml">
Stargazer astrophysics news
</choice>
</menu>
then the menu’s prompt would be:
C: Welcome home. For sports, press 1. For weather, press 2. For Stargazer astrophysics news, press 3.
The <enumerate> element may be used within the prompts and the catch elements associated with <menu> elements and with <field> elements that contain <option> elements, as discussed in Section 2.3.1.3. An error.semantic event is thrown if <enumerate> is used elsewhere.
Any choice phrase specifies a set of words and phrases to listen for. A choice phrase is constructed from the PCDATA of the elements contained directly or indirectly in the <choice> element.
If the accept attribute is "exact" then the user must say the entire phrase in the same order in which they occur in the choice phrase.
If the accept attribute is "approximate", then the choice may be matched when a user says a subphrase of the expression. For example, in response to the prompt "Stargazer astrophysics news" a user could say "Stargazer", "astrophysics", "Stargazer news", "astrophysics news", and so on. The equivalent grammar may be language and platform dependent.
As an example of using "exact" and "approximate" in different choices, consider this example:
<menu accept="approximate"> <choice next="..."> Stargazer Astrophysics News </choice> <choice accept="exact" next="..."> Physics Weekly </choice> <choice accept="exact" next="..."> Particle Physics Update </choice> <choice next="..."> Astronomy Today </choice> </menu>
Because "approximate" is specified for the first choice, the user may say a subphrase when matching the first choice; for instance, "Stargazer" or "Astrophysics News". However, because "exact" is specified in the second and third choices, only a complete phrase will match: "Physics Weekly" and "Partical Physics Update".
As an example of the use of PCDATA contained in descendants of the <choice> element, consider the following example:
<choice accept="exact"
next="http://www.stargazer.example.com/voice/astronews.vxml">
<audio src="http://www.stargazer.example.com/space.wav">
Stargazer <emphasis>astrophysics</emphasis> news
</audio>
</choice>
This choice would be read from the audio file, or as "Stargazer Astrophysics News" if the file could not be played. The grammar for the choice would be the exact phrase "Stargazer astrophysics news" gleaned from the PCDATA of the <choice> element’s descendants.
A menu behaves like a form with a single field that does all the work. The menu prompts become field prompts. The menu event handlers become the field event handlers. The menu grammars become form grammars.
Upon entry, the menu’s grammars are built and enabled, and the prompt is played. When the user input matches a choice, control transitions according to the value of the next, expr, or event attribute of the <choice>, only one of which may be specified. If an event attribute is specified but its event handler does not cause the interpreter to exit or transition control, then the FIA will clear the form item variable of the menu's anonymous field, causing the menu to be executed again.
A form item is an element of a <form> that can be visited during form interpretation. They include <field>, <block>, <initial>, <subdialog>, <object>, <record>, and <transfer>.
All form items have the following characteristics:
They have a result variable, specified by the name attribute. This variable may be given an initial value with the expr attribute.
They have a guard condition specified with the cond attribute.
Form items are subdivided into field items, those that define the form’s field item variables, and control items, those that help control the gathering of the form’s fields. Field items (<field>, <subdialog>, <object>, <record>, and <transfer>) generally may contain the following elements:
<filled> elements containing some action to execute at the moment the result field is filled in.
<property> elements to specify properties that are in effect for this field item.
<prompt> elements to specify prompts to be played when this field is visited.
<grammar> elements to specify allowable spoken and character input for this field item.
<catch> elements and catch shorthands that are in effect for this field item.
Each field item may have an associated set of shadow variables. Shadow variables are used to return results from the execution of a field item, other than the value stored under the name attribute. For example, it may be useful to know the confidence level that was obtained as a result of a recognized grammar in a <field> element. A shadow variable is referenced as name$.shadowvar where name is the value of the field item’s name attribute, and shadowvar is the name of a specific shadow variable. For example, the <field> element returns a shadow variable confidence. The code fragment below illustrates how this shadow variable is accessed.
<field name="state">
<prompt> Please say the name of a state. </prompt>
<grammar src="http://mygrammars.example.com/states.grm"
type="application/grammar"/>
<filled>
<if cond="state$.confidence < 0.4">
<throw event="nomatch"/>
</if>
</filled>
</field>
In the example, the confidence of the result is examined, and the result is rejected if the confidence is too low.
A field specifies an input item to be gathered from the user. Attributes of fields include:
| name | The field item variable in the dialog scope that will hold the result. The name must be a unique variable name within the scope of the form. If the name is not unique, then a badfetch error is thrown when the document is fetched. The name must conform to the variable naming conventions in Section 5.1. |
|---|---|
| expr | The initial value of the form item variable; default is ECMAScript undefined. If initialized to a value, then the form item will not be visited unless the form item variable is cleared. |
| cond | A boolean condition that must also evaluate to true in order for the form item to be visited. |
| type | The type of field, i.e., the name of an internal grammar. This name must be from a standard set supported by all conformant platforms. If not present, <grammar> elements can be specified instead. |
| slot | The name of the grammar slot used to populate the variable (if it is absent, it defaults to the variable name). This attribute is useful in the case where the grammar format being used has a mechanism for returning sets of slot/value pairs and the slot names differ from the field item variable names. If the grammar returns only one slot, as do the builtin type grammars like boolean, then no matter what the slot’s name, the field item variable gets the value of that slot. |
| modal | If this is false (the default) all active grammars are turned on while collecting this field. If this is true, then only the field’s grammars are enabled: all others are temporarily disabled. |
The shadow variables of a <field> element whose name is name are the same used in the application.lastresult$ array, name$.confidence, name$.utterance, name$.inputmode, and name$.interpretation. The value of each of these shadow variables will necessarily be the same as that found in first element of the array: application.lastresult$[0].confidence, application.lastresult$[0].utterance, application.lastresult$[0].inputmode, and application.lastresult$[0].interpretation, respectively. See Section 5.1.5 for a description of the contents of these variables.
Issues:
The <field> type attribute is used to specify a builtin grammar for one of the fundamental types, and also specifies how its value is to be spoken if subsequently used in a value attribute in a prompt. An example:
<field name="lo_fat_meal" type="boolean">
<prompt>
Do you want a low fat meal on this flight?
</prompt>
<help>
Low fat means less than 10 grams of fat, and under
250 calories.
</help>
<filled>
<prompt>
I heard <emphasis><value expr="lo_fat_meal"/></emphasis>.
</prompt>
</filled>
</field>
In this example, the boolean type indicates that inputs are various forms of true and false. The value actually put into the field is either true or false. The field would be read "yes" or "no" in prompts.
In the next example, digits indicates that input will be spoken or keyed digits. The result is stored as a string, and rendered as digits, i.e., "one-two-three", not "one hundred twenty-three". The <filled> action tests the field to see if it has 12 digits. If not, the user hears the error message.
<field
name="ticket_num" type="digits">
<prompt>
Read the 12 digit number from your ticket.
</prompt>
<help>The 12 digit number is to the lower left.</help>
<filled>
<if cond="ticket_num.length != 12">
<prompt>
Sorry, I didn't hear exactly 12 digits.
</prompt>
<assign name="ticket_num" expr="undefined"/>
</if>
</filled>
</field>
Each builtin type has a convention for the format of the value returned. These are independent of locale and of the implementation. The return type for builtin fields is string except for the boolean field type. To access the actual recognition result, the author can reference the shadow variable name$.utterance.
The builtin types are defined in such a way that a VoiceXML application developer can assume some consistency of user input across implementations. This permits help messages and other prompts to be independent of platform in many instances. For example, the boolean type’s grammar should minimally allow "yes" and "no" responses, but each implementation is free to add other choices, such as "yeah" and "nope".
In cases where an application requires specific behavior or different behavior than defined for a builtin, it should use an explicit field grammar. The following are circumstances in which an application must provide an explicit field grammar in order to ensure portability of the application with a consistent user interface
A platform is not required to implement a grammar that accepts all possible values that might be returned by a builtin. For instance, the currency builtin defines the return value formatting for a very broad range of currencies (by reference to ISO 4217:1995). The platform is not required to support spoken input that includes any of the world's currencies since that can negatively impact recognition accuracy. Similarly, the number builtin can return positive or negative floating point numbers but the grammar is not required to support all possible spoken floating point numbers.
Builtins are also limited in their ability to handle underspecified spoken input. For instance, "20 peso" cannot be resolved to a specific ISO 4217:1995 currency code because the "peso" is the name of the currency of numerous nations. In such cases the platform may return a specific currency code according to the locale or may omit the currency code.
All builtin types must support both voice and DTMF entry.
The builtin types are:
| boolean | Inputs include affirmative and negative phrases appropriate to the current locale. DTMF 1 is yes and 2 is no. The result is ECMAScript true for "yes" or false for "no". The value will be submitted as the string "true" or the string "false". If the field value is subsequently used in a prompt, it will be spoken as an affirmative or negative phrase appropriate to the current locale. |
|---|---|
| date | Valid spoken inputs include phrases that specify a date, including a month day and year. DTMF inputs are: four digits for the year, followed by two digits for the month, and two digits for the day. The result is a fixed-length date string with format yyyymmdd, e.g. "20000704". If the year is not specified, yyyy is returned as "????"; if the month is not specified mm is returned as "??"; and if the day is not specified dd is returned as "??". The set of accepted spoken date formats is platform dependent and may vary by locale. |
| digits | Valid spoken or DTMF inputs include one or more digits, 0 through 9. The result is a string of digits. If the field value is subsequently used in a prompt, it will be spoken as a sequence of digits. A user can say for example "two one two seven", but not "twenty one hundred and twenty-seven". A platform may support constructs such as "two double-five eight". |
| currency | Valid spoken inputs include phrases that specify a currency amount. For DTMF input, the "*" key will act as the decimal point. The result is a string with the format UUUmm.nn, where UUU is the three character currency indicator according to ISO standard 4217:1995, or mm.nn if the currency is not spoken by the user or if the currency cannot be reliably determined (e.g. "dollar" and "peso" are ambiguous). If the field value is subsequently used in a prompt, it will be spoken as a currency amount appropriate to the current locale. The set of accepted spoken currency formats is platform dependent and may vary by locale. |
| number | Valid spoken inputs include phrases that specify numbers, such as "one hundred twenty-three", or "five point three". Valid DTMF input includes positive numbers entered using digits and "*" to represent a decimal point. The result is a string of digits from 0 to 9 and may optionally include a decimal point (".") and/or a plus or minus sign. ECMAScript automatically converts result strings to numerical values when used in numerical expressions. The result must not use a leading zero (which would cause ECMAScript to interpret as an octal number). The set of accepted spoken number formats is platform dependent and may vary by locale. |
| phone | Valid spoken inputs include phrases that specify a phone number. DTMF asterisk "*" represents "x". The result is a string containing a telephone number consisting of a string of digits and optionally containing the character "x" to indicate a phone number with an extension. For North America, a result could be "8005551234x789". The range of accepted spoken phone formats is platform dependent and may vary by locale. |
| time | Valid spoken inputs include phrases that specify a time, including hours and minutes. The result is a five character string in the format hhmmx, where x is one of "a" for AM, "p" for PM, "h" to indicate a time specified using 24 hour clock, or "?" to indicate an ambiguous time. Input can be via DTMF. Because there is no DTMF convention for specifying AM/PM, in the case of DTMF input, the result will always end with "h" or "?". If the field value is subsequently used in a prompt, the value will be spoken as a time appropriate to the current locale. The set of accepted spoken time formats is platform dependent and may vary by locale. |
Explicit grammars can be specified via a URI, which can be absolute or relative:
<field name="flavor">
<prompt>What is your favorite ice cream?</prompt>
<grammar src="../grammars/ice_cream.grxml"
type="application/grammar+xml"/>
</field>
Grammars can be specified inline, for example using a W3C Augmented BNF grammar:
<field name="flavor"> <prompt>What is your favorite flavor?</prompt> <help>Say one of vanilla, chocolate, or strawberry.</help> <grammar type="application/grammar"> vanilla | chocolate | strawberry </grammar> </field>
If both the src attribute and an inline grammar are provided the grammar identified by the src attribute takes precedence.
When a simple set of alternatives is all that is needed to specify the legal input values for a field, it may be more convenient to use an option list than a grammar. An option list is represented by a set of <option> elements contained in a <field> element. Each <option> element contains PCDATA that is used to generate a grammar for the spoken input it accepts using the same method described for <choice>. It also has attributes specifying the DTMF key for selecting the option and the value to assign to the field when the option is chosen.
The following field offers the user three choices and assigns the value of the value attribute of the selected option to the maincourse variable:
<form>
<field name="maincourse">
<prompt>
Please select an entree. Today, we’re featuring <enumerate/>
</prompt>
<option dtmf="1" value="fish"> swordfish </option>
<option dtmf="2" value="beef"> roast beef </option>
<option dtmf="3" value="chicken"> frog legs </option>
<filled>
<submit next="/cgi-bin/maincourse.cgi"
method="post" namelist="maincourse"/>
</filled>
</field>
</form>
This conversation might sound like:
C: Please select an entree. Today, we’re featuring swordfish; roast beef; frog legs.
H: frog legs
C: (assigns "chicken" to "maincourse", then submits "maincourse=chicken" to /maincourse.cgi)
The following example shows proper and improper use of <enumerate> in a catch element of a form with several fields containing <option> elements:
<form>
<block>
We need a few more details to complete your order.
</block>
<field name="color">
<prompt>Which color?</prompt>
<option>red</option>
<option>blue</option>
<option>green</option>
</field>
<field name="size">
<prompt>Which size?</prompt>
<option>small</option>
<option>medium</option>
<option>large</option>
</field>
<field name="quantity" type="number">
<prompt>How many?</prompt>
</field>
<block>
Thank you. Your order is being processed.
<submit next="details.cgi"/>
</block>
<catch event="help nomatch">
Your options are <enumerate/>.
</catch>
</form>
A scenario might be:
C: We need a few more details to complete your order. Which color?
H: help. (throws "help" event caught by form-level <catch>)
C: Your options are red, blue, green.
H: red.
C: Which size?
H: 7 (throws "nomatch" event caught by form-level <catch>)
C: Your options are small, medium, large.
H: small.
In the steps above, the <enumerate/> in the form-level catch had something to enumerate: the <option> elements in the "color" and "size" <field> elements. The next <field>, however, is different:
C: How many?
H: a lot. (throws "nomatch" event caught by form-level <catch>)
The form-level <catch>'s use of <enumerate> causes an "error.semantic" event to be thrown because the "quantity" <field> does not contain any <option> elements that can be enumerated.
One solution is to add a field-level <catch> to the "quantity" <field>:
<catch event="help nomatch">
Please say the number of items to be ordered.
</catch>
The "nomatch" event would then be caught locally, resulting in the following possible completion of the scenario:
C: Please say the number of items to be ordered.
H: 50
C: Thank you. Your order is being processed.
The <enumerate> element is also discussed in Section 2.2.
The attributes of <option> are:
| dtmf | The DTMF sequence for this option. |
|---|---|
| value | The string to assign to the field item variable when a user selects this option, whether by speech or DTMF. If a DTMF sequence is specified, but not value or CDATA, then the field variable is assigned the DTMF sequence. The default assignment is the CDATA content of the <option> element with leading and trailing white space removed. |
The use of <option> does not preclude the simultaneous use of <grammar>. The result would be the match from either 'grammar', not unlike the occurence of two <grammar> elements in the same <field> representing a disjunction of choices.
Fundamental builtin grammars are explicitly referenced using the special-purpose "builtin:" URI scheme which allows access to resources such as speech grammars, DTMF grammars and audio files. In addition, the "builtin:" URI scheme may also be used to access platform-specific builtin grammars that are supported by particular interpreter contexts. It is recommended that plaform-specific builtin grammar names begin with the string "x-", as this namespace will not be used in future versions of the standard.
Examples of fundamental builtin grammars:
<grammar src="builtin:grammar/boolean"/> <grammar src="builtin:dtmf/boolean"/>
where the first <grammar> references the builtin boolean speech grammar, and the second references the builtin boolean DTMF grammar.
Examples of platform-specific builtin grammars:
<grammar src="builtin:grammar/x-sample"/> <grammar src="builtin:dtmf/x-sample"/>
Some builtin field types and grammars can be parameterized. This may be done by explicitly referring to builtin grammars using the special-purpose "builtin:" URI s