W3C

Voice Extensible Markup Language (VoiceXML) 3.0

W3C Working Draft 2 25 June August 2009

This version:
http://www.w3.org/TR/2009/WD-voicexml30-20090602/ http://www.w3.org/TR/2009/WD-voicexml30-20090825/
Latest version:
http://www.w3.org/TR/voicexml30/
Previous version:
http://www.w3.org/TR/2008/WD-voicexml30-20081219/ http://www.w3.org/TR/2009/WD-voicexml30-20090602/
Editors:
Scott McGlashan, Hewlett-Packard (co-Editor-in-Chief)
Daniel C. Burnett, Voxeo (co-Editor-in-Chief)
Rahul Akolkar, IBM
RJ Auburn, Voxeo
Paolo Baggia, Loquendo
Jim Barnett, Genesys Telecommunications Laboratories
Michael Bodell, Microsoft
Jerry Carter, Nuance
Matt Oshry, Microsoft
Kenneth Rehor, Cisco
Milan Young, Nuance
Xu Yang, Aspect
Rafah Hosn, Hosn (until 2008, when at IBM)

Abstract

This document specifies VoiceXML 3.0, a modular XML language for creating interactive media dialogs that feature synthesized speech, recognition of spoken and DTMF key input, telephony, mixed initiative conversations, and recording and presentation of a variety of media formats including digitized audio, and digitized video.

Its major goal is to bring the advantages of Web-based development and content delivery to interactive voice response applications.

Status of this Document

This section describes the status of this document at the time of its publication. Other documents may supersede this document. A list of current W3C publications and the latest revision of this technical report can be found in the W3C technical reports index at http://www.w3.org/TR/.

This is the 2 June 25 August 2009 Second Third Public Working Draft of "Voice Extensible Markup Language (VoiceXML) 3.0". The main differences from the previous draft are as follows: Added clarification to the description in "2 Overview" and "3 Data Flow Presentation (DFP) Framework". Added brief description on "Speaker Identification and Verification" functionality to "5.4 SIV Resource". Added clarification to the description in "6.10.1 Syntax". Added "6.11 Builtin Grammar Module", "6.12 Data Access and Manipulation Module", "6.13 External Communication Module" and "6.14 Session Root Module". Added clarification to the text described in "7 Profiles". Renamed "7.2 Media Server Profile" to "7.2 Basic Profile", and added clarification to Appendix E Major changes since the text. Added "7.3 Maximal Profile" and "7.4 Convenience Syntax (Syntactic Sugar)". Added "9 Integration with Other Markup Languages". last Working Draft .

A diff-marked version of this document is also available for comparison purposes.

This document is very much a work in progress. Many sections are incomplete, only stubbed out, or missing entirely. To get early feedback, the group focused on defining enough functionality, modules, and profiles to demonstrate the general framework. To complete the specification, the group expects to introduce additional functionality (for example speaker identification and verification, external eventing) and describe the existing functionality at the level of detail given for the Prompt and Field modules. We explicitly request feedback on the framework, particularly any concerns about its implementability or suitability for expected applications. By the middle of 2009 early 2010 the group expects to have all existing functionality defined in detail, the new functionality stubbed out, and the VoiceXML 2.1 profile largely defined. By late-2009 the group expects to have all functionality defined and both the profiles defined in detail.

Applications written as 2.1 documents can be used under a 3.0 processor using the 2.1 profile. As an example, the Implementation Report tests for 2.1 (which includes the IR tests for 2.0) will be supported on a 3.0 processor. Exceptions will be clarifications and changes needed to improve interoperability.

This document is a W3C Working Draft . It has been produced as part of the Voice Browser Activity . The authors of this document are participants in the Voice Browser Working Group ( W3C members only ). . For more information see the Voice Browser FAQ . The Working Group expects to advance this Working Draft to Recommendation status.

Comments are welcome on www-voice@w3.org ( archive ). See W3C mailing list and archive usage guidelines .

Publication as a Working Draft does not imply endorsement by the W3C Membership. This is a draft document and may be updated, replaced or obsoleted by other documents at any time. It is inappropriate to cite this document as other than work in progress.

This document was produced by a group operating under the 5 February 2004 W3C Patent Policy . W3C maintains a public list of any patent disclosures made in connection with the deliverables of the group; that page also includes instructions for disclosing a patent. An individual who has actual knowledge of a patent which the individual believes contains Essential Claim(s) must disclose the information in accordance with section 6 of the W3C Patent Policy .

Table of Contents

1 Terminology
2 Overview
    2.1 Structure of VoiceXML 3.0
    2.2 Structure of this document
    2.3 How to read this document
3 Data Flow Presentation (DFP) Framework
    3.1 Data
    3.2 Flow
    3.3 Presentation
4 Core Concepts
    4.1 Semantics Syntactic and Semantic descriptions
        4.1.1 Resources     4.2 Semantic Conceptual Model
        4.1.2 Resource Controllers (RCs)         4.2.1 Top Level Controller
    4.2     4.3 Syntax
    4.3     4.4 Event Model
        4.3.1         4.4.1 Internal Events
            4.4.1.1 Event Interfaces
            4.3.1.1                 4.4.1.1.1 Event
            4.3.1.2                 4.4.1.1.2 EventTarget
            4.3.1.3                 4.4.1.1.3 EventListener
        4.3.2             4.4.1.2 Event Flow
            4.3.2.1                 4.4.1.2.1 Event Listener Registration
            4.3.2.2                 4.4.1.2.2 Event Listener Activation
        4.3.3             4.4.1.3 Event Categories
    4.4         4.4.2 External Events
    4.5 Document Initialization and Execution
        4.4.1         4.5.1 Initialization
        4.4.2         4.5.2 Execution
            4.4.2.1             4.5.2.1 Subdialogs
            4.4.2.2             4.5.2.2 Application Root
            4.4.2.3             4.5.2.3 Summary of Syntax/Semantics Interaction
5 Resources
    5.1 Datamodel Resource
        5.1.1 Data Model Resource API
    5.2 Prompt Queue Resource
        5.2.1 State Chart Representation
        5.2.2 SCXML Representation
        5.2.3 Defined Events
        5.2.4 Device Events
        5.2.5 Open Issue
    5.3 Recognition Resources
        5.3.1 Definition
        5.3.2 Defined Events
        5.3.3 Device Events
        5.3.4 State Chart Representation
        5.3.5 SCXML Representation
    5.4 SIV Resource
6 Modules
    6.1 Grammar Module
        6.1.1 Syntax
            6.1.1.1 Attributes
            6.1.1.2 Content Model
        6.1.2 Semantics
            6.1.2.1 Definition
            6.1.2.2 Defined Events
            6.1.2.3 External Events
            6.1.2.4 State Chart Representation
        6.1.3 Events
        6.1.4 Examples
    6.2 Inline SRGS Grammar Module
        6.2.1 Syntax
        6.2.2 Semantics
            6.2.2.1 Definition
            6.2.2.2 Defined Events
            6.2.2.3 External Events
            6.2.2.4 State Chart Representation
            6.2.2.5 SCXML Representation
        6.2.3 Events
        6.2.4 Examples
    6.3 External Grammar Module
        6.3.1 Syntax
            6.3.1.1 Attributes
            6.3.1.2 Content Model
        6.3.2 Semantics
            6.3.2.1 Definition
            6.3.2.2 Defined Events
            6.3.2.3 External Events
            6.3.2.4 State Chart Representation
            6.3.2.5 SCXML Representation
        6.3.3 Events
        6.3.4 Examples
    6.4 Prompt Module
        6.4.1 Syntax
            6.4.1.1 Attributes
            6.4.1.2 Content Model
        6.4.2 Semantics
            6.4.2.1 Definition
            6.4.2.2 Defined Events
            6.4.2.3 External Events
            6.4.2.4 State Chart Representation
            6.4.2.5 SCXML Representation
        6.4.3 Events
        6.4.4 Examples
    6.5 Builtin SSML Module
        6.5.1 Syntax
        6.5.2 Semantics
        6.5.3 Examples
    6.6 Media Module
        6.6.1 Syntax
            6.6.1.1 Attributes
            6.6.1.2 Content Model
                6.6.1.2.1 Tips (informative)
        6.6.2 Semantics
        6.6.3 Examples
    6.7 Parseq Module
        6.7.1 Syntax
        6.7.2 Semantics
        6.7.3 Examples
    6.8 Foreach Module
        6.8.1 Syntax
            6.8.1.1 Attributes
            6.8.1.2 Content Model
        6.8.2 Semantics
        6.8.3 Examples
    6.9 Form Module
        6.9.1 Syntax
        6.9.2 Semantics
            6.9.2.1 Form RC
                6.9.2.1.1 Definition
                6.9.2.1.2 Defined Events
                6.9.2.1.3 External Events
                6.9.2.1.4 State Chart Representation
                6.9.2.1.5 SCXML Representation
    6.10 Field Module
        6.10.1 Syntax
        6.10.2 Semantics
            6.10.2.1 Field RC
                6.10.2.1.1 Definition
                6.10.2.1.2 Defined Events
                6.10.2.1.3 External Events
                6.10.2.1.4 State Chart Representation
                6.10.2.1.5 SCXML Representation
            6.10.2.2 PlayandRecognize RC
                6.10.2.2.1 Definition
                6.10.2.2.2 Defined Events
                6.10.2.2.3 External Events
                6.10.2.2.4 State Chart Representation
                6.10.2.2.5 SCXML Representation
    6.11 Builtin Grammar Module
        6.11.1 Usage of Platform Grammars
        6.11.2 Platform Requirements
        6.11.3 Syntax and Semantics
        6.11.4 Examples
    6.12 Data Access and Manipulation Module
        6.12.1 Overview
        6.12.2 Semantics
            6.12.2.1 The scope stack
            6.12.2.2 Relevance of scope stack to properties
            6.12.2.3 Implicit variables
            6.12.2.4 Variable resolution
            6.12.2.5 Standard session variables
            6.12.2.6 Standard application variables
            6.12.2.7 Legal variable values and expressions
        6.12.3 Syntax
            6.12.3.1 Creating variables: the <var> element
            6.12.3.2 Reading variables: "expr" and "cond" attributes and the <value> element
                6.12.3.2.1 Inserting variable values in prompts: The <value> element
            6.12.3.3 Updating variables: the <assign> and <data> elements
                6.12.3.3.1 The <assign> element
                6.12.3.3.2 The <data> element
            6.12.3.4 Deleting variables: the <clear> element
            6.12.3.5 Relevance for properties
        6.12.4 Backward compatibility with VoiceXML 2.1
        6.12.5 Implicit functions using XPath
    6.13 External Communication Module
        6.13.1 Receiving external messages within a voice application
            6.13.1.1 External Message Reflection
            6.13.1.2 Receiving External Messages Asynchronously
            6.13.1.3 Receiving External Messages Synchronously
                6.13.1.3.1 <receive>
        6.13.2 Sending messages from a voice application
            6.13.2.1 sendtimeout
    6.14 Session Root Module
        6.14.1 Syntax
        6.14.2 Semantics
        6.14.3 Examples
7 Profiles
    7.1 VoiceXML 2.1 Legacy Profile
        7.1.1 Conformance
        7.1.2 Vxml Root Module Requirements
        7.1.3 Form Module Requirements
        7.1.4 Field Module Requirements
        7.1.5 Prompt Module Requirements
        7.1.6 Grammar Module Requirements
        7.1.7 Data Access and Manipulation Module Requirements
    7.2 Basic Profile
    7.3 Maximal Profile
    7.4 Convenience Syntax (Syntactic Sugar)
8 Environment
    8.1 Resource Fetching
        8.1.1 Fetching
        8.1.2 Caching
            8.1.2.1 Controlling the Caching Policy
        8.1.3 Prefetching
        8.1.4 Protocols
    8.2 Properties
        8.2.1 Speech Recognition Properties
        8.2.2 DTMF Recognition Properties
        8.2.3 Prompt and Collect Properties
        8.2.4 Media Properties
        8.2.5 Fetch Properties
        8.2.6 Miscellaneous Properties
    8.3 Speech and DTMF Input Timing Properties
        8.3.1 DTMF Grammars
            8.3.1.1 timeout, No Input Provided
            8.3.1.2 interdigittimeout, Grammar is Not Ready to Terminate
            8.3.1.3 interdigittimeout, Grammar is Ready to Terminate
            8.3.1.4 termchar and interdigittimeout, Grammar Can Terminate
            8.3.1.5 termchar Empty When Grammar Must Terminate
            8.3.1.6 termchar Non-Empty and termtimeout When Grammar Must Terminate
            8.3.1.7 termchar Non-Empty and termtimeout When Grammar Must Terminate
            8.3.1.8 Invalid DTMF Input
        8.3.2 Speech Grammars
            8.3.2.1 timeout When No Speech Provided
            8.3.2.2 completetimeout With Speech Grammar Recognized
            8.3.2.3 incompletetimeout with Speech Grammar Unrecognized
    8.4 Value Designations
        8.4.1 Integers
        8.4.2 Real Numbers
        8.4.3 Times
9 Integration with Other Markup Languages
    9.1 Embedding of VoiceXML within SCXML
    9.2 Integrating Flow Control Languages into VoiceXML
        9.2.1 SCXML for Dialog Management
            9.2.1.1 System-driven Dialog
            9.2.1.2 User-driven Dialog
        9.2.2 Graceful Degradation
        9.2.3 SCXML as Basis for Recursive MVC

Appendices

A Acknowledgements
B References
    B.1 Normative References
    B.2 Informative References
C Glossary of Terms
D VoiceXML 3.0 XML Schema
    D.1 Schema for VXML Root Module
    D.2 Schema for Form Module
    D.3 Schema for Field Module
    D.4 Schema for Prompt Module
    D.5 Schema for Builtin SSML Module
    D.6 Schema for Foreach Module
    D.7 Schema for Data Access and Manipulation Module
    D.8 Schema for Legacy Profile
E Major changes since the last Working Draft


1 Terminology

In this document, the key words "must", "must not", "required", "shall", "shall not", "should", "should not", "recommended", "may", and "optional" are to be interpreted as described in [RFC2119] and indicate required levels for compliant VoiceXML 3.0 implementations.

Terms used in this specification are defined in Appendix C Glossary of Terms .

2 Overview

How does one build a successor to VoiceXML 2.0/2.1? Requests for improvements to VoiceXML fell into two main categories: extensibility and new functionality.

To accommodate both, the Voice Browser Working Group

  1. Developed the detailed semantic descriptions of VoiceXML functionality that versions 2.0 and 2.1 lacked. The semantic descriptions clarify the meaning of the VoiceXML 2.0 and 2.1 functionalities and how they relate to each other. The semantic descriptions are represented in this document as English text, UML state chart visual diagrams [ref] and/or textual SCXML representations [ref]. Figure 1 illusrates illustrates the VoiceXML 3.0 framework which contains some abstract UML state chart visual diagrams representing some existing VoiceXML functionality. Functionality from VoiceXML 2.0

    Figure 1: VoiceXML 3.0 Framework - The red-filled cells indicates some functionality from VoiceXML 2.0 expressed as state charts

  2. Described the detailed semantics for new functionality. New functions include, for example, speaker identification and verification, video capture and replay, and a more powerful prompt queue. These semantic descriptions for these new functions are also represented in this document as English text, UML state chart visual diagrams [ref] and/or textual SCXML representations [ref]. Figure 2 contains some abstract UML state chart visual diagrams representing new functionality. New functionality

    Figure 2: VoiceXML 3.0 Framework - The red-filled cells indicates new functionality

  3. Organized the functionality into modules, with each module implementing different functions. One reason for the introduction of a more rigorous semantic definition is that it allows us to assign semantics to individual modules. This makes it easier to understand what happens when modules are combined or new ones are defined. In contrast, VoiceXML 2.0 and 2.1 had a single global semantic definition (the FIA), which made it difficult to understand what would happen if certain elements were removed from the language or if new ones were added. Figure 3 contains some modules, each containing VoiceXML 3.0 functionality Vendors may extend VoiceXML functionality by creating additional modules with additional functionality not described in this document. For example, a vendor might create a new GPS input module. Application developers should be cautious about using vendor-specific modules because the resulting application may not be portable. Modules

    Figure 3: VoiceXML 3.0 Framework - The red bolded rectangles indicates moduless modules

  4. Restructured and revisedDefined the syntax of each module to incorporate any new functionality. Application developers use the syntax of each module as an API to invoke the module’s functions. Figure 4 illustrates some simplified syntax associated with modules. Syntax

    Figure 4: VoiceXML 3.0 Framework - The bolded red text indicates syntax

  5. Introduced the concept of a profile (language) which incorporates the syntax of several modules. Figure 5 illustrates two profiles. For example, a VoiceXML 2.1 profile incorporates the syntax of most of the modules corresponding to the VoiceXML 2.1 functionality which will support most existing VoiceXML 2.1 applications. Thus most VoiceXML 2.1 applications can be easily ported to VoiceXML 3.0 using the VoiceXML 2.1 profile. Another profile omits the VoiceXML 2.1 Form Interpretation Algorithm (FIA). This profile may be used by developers who want to define their one own flow control rather than using the FIA. Profiles enable platform developers to select just the functionality that application developers need for a platform or class of application. Multiple profiles enables developers to use just the profile (language) needed for a platform or class of applications. For example, a lean profile for portable devices, or a full-function profile for servers-based applications using all of the new functionality of VoiceXML 3.0. Profiles

    Figure 5: VoiceXML 3.0 Framework - The dotted red area and the dashed green area indicate two profiles

One of the benefits of detailed semantic descriptions is improving portability within VoiceXML. Two vendors may implement the same functionality differently; however, the functionality must be consistent with the semantic meanings described in this document so that application authors are isolated from the different implementations. This increases portable among platforms that support the same syntax. Note that there are many other factors that effect to the portability that is outside the scope of this document (e.g. speech recognition capabilities, telephony).

2.1 Structure of VoiceXML 3.0

This document covers the following:

  • This document explains the core of VoiceXML 3.0, an extensible framework that describes how semantics are defined, how syntax is defined and how the two are connected together. In this document, "semantics" means both SCXML and/or textis means behavior represented as English text, SCXML syntax and/or state charts diagram. definitions of core functionality, such as might be used by an implementer of VoiceXML 3.0. The term "syntax" refers to XML elements and attributes that are an application author's programming interface to the functionality defined by the "semantics".
  • Within this document, all the functionality of VoiceXML 3.0 is grouped into modules of related capabilities.
  • Modules can be combined together to create complete profiles (languages). This document describes how to define both modules and profiles.
  • Modules can be combined together to create complete profiles (languages). This document describes how to define both modules and profiles.
  • In addition to describing the general framework, this document explicitly defines a broad range of functionality, several modules and two profiles.

2.2 Structure of this document

The remainder of this document is structured as follows:

3 Data Flow Presentation (DFP) Framework presents the Data-Flow-Presentation Framework, its importance for the development of VoiceXML 3.0 and how VoiceXML 3.0 fits into the model.

4 Core Concepts explains the core concepts underlying the new structure for VoiceXML, including resources, resource controllers, the relationship between syntax and semantics, DOM eventing, modules and profiles.

5 Resources presents the resources defined for the language. These provide the key presentation-related functionality in the language.

6 Modules presents the modules defined for the language. Each module consists of a syntax piece (with its user-visible events), a semantics piece (with its behind-the-scenes events) and a description of how the two are connected.

7 Profiles presents two profiles. The first, the VoiceXML 2.1 profile, shows how a language similar to VoiceXML 2.1 can be created using the structure and functionality of VoiceXML 3.0. The second, the Basic profile, is a simple compilation of all of the functionality available in VoiceXML 3.0.

The Appendices provide useful references and a glossary of terms used in the specification.

2.3 How to read this document

For everyone: Please first read 3 Data Flow Presentation (DFP) Framework . The data-flow- presentation distinction applies not only to VoiceXML 3.0, but to many of W3C's specifications. Understanding VoiceXML's role as a presentation language is crucial context for understanding the rest of the specification.

For application authors: we recommend that you begin with syntax and only gradually explore details of the semantics as you need to understand behavioral specifics.

  1. If you are familiar with VoiceXML 2 you might want to begin with the VoiceXML 2.1 Legacy profile in 7.1 VoiceXML 2.1 Legacy Profile to see an example of all the syntactic pieces in the finished profile.
  2. You should then review the syntax sections of each of the modules in 6 Modules , along with the Basic profile in 7.2 Basic Profile . When you need to understand how a bit of syntax is implemented, read the semantics section corresponding to that syntax.
  3. Along the way you will definitely want to review the parts of 4 Core Concepts that are relevant to your other reading (profiles, modules, syntax, semantics, and DOM eventing).

For VoiceXML platform developers: we recommend that you begin with the functionality and framework and only focus on syntax later.

  1. If you are familiar with VoiceXML 2 you might want to begin with the VoiceXML 2.1 Legacy profile in 7.1 VoiceXML 2.1 Legacy Profile to see the user-visible differences between the original VoiceXML 2.1 language and the new VoiceXML 2.1 Legacy profile. A brief review of the Basic profile in 7.2 Basic Profile would be good as well.
  2. Next you should review 4 Core Concepts in detail, since the rest of the language is built upon the framework described there.
  3. 5 Resources and 6 Modules (the semantics part) should be the bulk of your focus. Remember that they are semantic descriptions only and that you can implement the functionality any way you wish as long as the semantics remain the same.
  4. For document authors one significant difference with VoiceXML 2.1 is support for DOM 4.3 4.4 Event Model .

3 Data Flow Presentation (DFP) Framework

Unlike VoiceXML 2.0/2.1, the focus in VoiceXML 3.0 is almost exclusively on the user interface portions of the language. By choice, very little work has gone into the development of data storage and manipulation or control flow capabilities. Although VoiceXML 3.0 is a presentation language, it contains within it all 3 levels of the DFP framework ([DFP)]: data, flow and presentation. In short, VoiceXML 3.0 has been designed from the ground up as a *presentation* language, according to the definition presented in the Data Flow Presentation ( [DFP] ) Framework.

As Although VoiceXML 3.0 is a presentation language, VoiceXML 3.0 it also contains within it all three 3 levels of the Data Flow presentation Framework DFP framework ( Figure 6).

DFP Architecture

Figure 6: DFP Architecture

The Data Flow Presentation (DFP) Framework is an instance of the Model-View-Controller paradigm, where computation and control flow are kept distinct from application data and from the way in which the application communicates with the outside world. This partitioning of an application allows for any one layer to be replaced independently of the other two. In addition, it is possible to simultaneously make use of more than one Data (Model) language, Flow (Controller), and/or Presentation (View) language.

3.1 Data

The Data layer of VoiceXML 3.0 is responsible for maintaining all presentation-specific information in a format that is easily accessible and easily editable. Note that the data layer of VoiceXML 3.0 is very different from the backend data for an application. Presentation-specific datainformation disappears when the VoiceXML 3.0 application terminates, while data in the backend database continues to exist after VoiceXML 3.0 terminates. Examples of presentation-specific data might include the status of the dialog in collecting certain information, which prompts have just been played, and how many of various error conditions have occurred so far, and the values entered by the user until they are transmitted to the back-end database or file system.

Within VoiceXML 3.0 the Data layer is realized through a pluggable data language and a data access or manipulation language. Access to and use of the data is aligned with options available in SCXML for simpler interaction with the Flow layer (see the next section). This specification defines two specific data languages, XML and ECMAScript, and two data access and manipulation languages, E4X/DOM and XPath. Others may be defined by implementers.

3.2 Flow

The Flow layer of VoiceXML 3.0 is responsible for all application control flow, including business logic, dialog management, and anything else that is not strictly data or presentation. VoiceXML 3.0 provides primitives that contain the control flow needed to implement them, but all combinations between and among the elements at the syntax level is done via calls to external control flow processors. Two that are likely to be used with VoiceXML are CCXML and SCXML. Note that flow control components written outside of VoiceXML may be communicating not only with a VoiceXML processor but with an HTML browser, a video game controller, or any of a variety of other input and output components.

3.3 Presentation

The Presentation layer of VoiceXML 3.0 is responsible for all interaction with the outside world, i.e., human beings and external software components. VoiceXML 3.0 *is* the Presentation layer. Designed originally for human-computer interaction, VoiceXML "presents" a dialog by accepting audio and dtmf input and producing audio and video output. All [?] of the modules defined in this document belong to the VoiceXML 3.0 presentation layer.

4 Core Concepts

4.1 Syntactic and Semantic descriptions

This document specifies the VoiceXML 3.0 language as a collection of modules. Each module is described at two levels:

  1. Syntax level -- The syntax is a set of XML elements, attributes, and events used by VoiceXML 3.0 application developers to specify applications. The VoiceXML 3.0 elements and attributes are specified within each module and in the XML schema in appendix TBD. The events are DOM level 3 events. This document provides a textual description of each element, attribute, and event.
  2. Semantics level -- The semantics of each module is described in terms of resources, resource controllers, and semantic events that the resource controllers may generate and consume. Semantics is described by both UML state chart visual diagrams and SCXML representations.

The visual UML state chart diagrams are informative. They are included for ease of reading and quick understanding. The more detailed textual SCXML representations are normative.

The semantic descriptions are important for reasons including the following:

  • Enable alternative implementations of a module. The modular description in this document describes the semantics—“what” a module does, but not “how” a module is implemented. A module may be implemented differently on different platforms.
  • Enable alternative syntax for the semantic description code. For example, one platform developer might use for a particular syntax for server-based modules, while another platform developer might use an alternative syntax for an embedded mobile platform.
  • The same semantic descriptions can be reused in multiple modules. For example, the semantic description of the Prompt module and be reused (referenced) from within the PromptQueue module.

4.2 Semantic Conceptual Model

The resources, resource controllers, and the events they generate are intended only to describe the semantics of VoiceXML 3. 3 modules. Implementations are not required to use SCXML to implement VoiceXML 3, 3 modules, nor must they create objects corresponding to resources, resource controllers, and the SCXML events they raise. The logical components are useful for describing how different syntax use similar resources or for future extensions to the language that may use these resources or hook into specific places in the semantic framework, but only the behavior exposed is necessary for a conformant VoiceXML 3 interpreter.

It is important to note that the semantic framework this model places no burden or requirements that a VoiceXML interpreter must implement behavior as described here in the model. Rather, the requirement is a logical one. The resources, resource controllers, and that the events they generate are intended only behavior must be the same as if it were implemented as described, but it is permitted to describe have optimizations or different architecture behind the semantics implementation of VoiceXML 3. the markup interpretation.

Implementations are not required to use SCXML to implement VoiceXML 3, 3.0, nor must they create objects corresponding to resources, resource controllers, and the SCXML events they raise. These The logical SCXML events must be distinguished from the author-visible DOM events that are a mandatory part of the VoiceXML 3 language. Implementations MUST raise these DOM events and process them in the manner described in Section 4.3 4.4 Event Model . Model. The interaction between actual DOM events and logical SCXML events is described in Section 4.4 4.5 Document Initialization and Execution , Execution, below.

4.1 Semantics

The semantic model is a conceptual representation of the underlying behavior of VoiceXML (form interpretation, prompt selection, etc). Each VoiceXML 3.0 module such as form interpretation, prompt selection, etc, contains is described using SCXML notation and optionally a conceptual UML state chart representation of its it’s the underlying behavior expressed in terms of resources and resource controllers. While the resources and resource controllers are not exposed directly in the markup, they are used to define the semantics of VoiceXML 3.0 markup elements. For example, Figure 1 presents a high-level semantic description of the PromptQueue which consists of the PromptController, Prompt Queue resource, and SSML/media player. For a detailed description of the semantics of 7 illustrates the Prompt Queue, see state chart representation in Section 5.2.1 State Chart Representation relationship among resource controllers, resources, and the SCXML representation in Section 5.2.2 SCXML Representation . Section 5.2.3 Defined Events defines each event. (Additional examples TBD) media devices. The VoiceXML 3.0 semantic model arrows represent events exchanged among components. A more concrete example is illustrated represented in Figure 7. Editorial note   8 which illustrates the Prompt Resource controller (further defined in Section 4.1. Replace Figure 1 by a picture illustrating 6.4.2), the three levels of (resource controller, resources, devices) with three examples corresponding to PromptQueue Resource, and the examples of section 4.1.2 -- JimL (members only) SSML Media Player.

Semantic model overview

Figure 7: Semantic model with Resources and Resource Controllers

Editorial note   More architecture diagrams will be added in later versions. It is important to note that this Semantic model details

Figure 8: Semantic model places no burden or requirements that a VoiceXML interpreter must implement behavior as described in the model. Rather, the requirement is that the behavior must be the same as if it were implemented as described, but it is permitted to have optimizations or different architecture behind the implementation of the markup interpretation. with Specific Examples

4.1.1 Resources 4.2.1 Top Level Controller

Resources are the building blocks of the semantic model. Each resource is a self contained object in the semantic model that is capable of providing a service. The resources are singletons, global in scope and persist for In addition to the whole session (e.g. even over subdialogs). Multiple different resources may be simultaneously active. Resource Controllers communicate with resources by sending and receiving events. Resources do not communicate associated with one another directly. Different modules can use the same resources, and language profiles can require specified resources. Different profiles may only have the ability to support limited resources and some profiles modules may require new resources. Examples of resources required for the VoiceXML 2.1 profile include a Prompt Queue/Player, a DTMF Recognizer, an ASR recognizer, an recording service, a transfer service, and a hierarchically scoped (ECMAScript) data model. Other modules and profiles of VoiceXML 3.0 may require that existing resources are extended, or that new ones are created. Examples of new resources may include: a whole call recording service, an XML data model and a SIV service. The semantics of VoiceXML markup elements may be captured by saying like <form>, there is a markup element interacts top-level controller associated with a resource; for example, the semantic representation of the <value> <vxml> element may be that it represents a single data resource lookup result (part of the conceptual API the data model resource offers) that is expected to execute the expression in the expr attribute and to return either the result or an error. 4.1.2 Resource Controllers (RCs) The conceptual objects responsible for coordinating input and output across multiple resources are resource controllers RCs. Each resource controller may interact with resources starting processing and other resource controllers deciding which Resource Controller to model the semantics of one execute next (i.e., for <form> or more parts of the markup. Each VoiceXML 3.0 markup other interaction element can be represented as 0 or more resource controllers. If there are more than one resource controller associated with a markup element, one of these is designated as the primary RC. Examples of resource controllers include: a controller for the part of field's collect phase that coordinates the prompt playing and bargein with the grammars being prepared with the recognition starting. a controller for implementing that part of the form tag that deals with slot filling and moving from the collect phase to the process phase dealing with the filled handling algorithm. a controller for implementing the semantic behavior of a link that gets matched. Note that this means that the Form Interpretation Algorithm is realized through the interactions of multiple resource controllers. ). The lifecycle and scope of resource controllers are not top-level controller also holds session level properties and global respectively and there may be multiple conceptual instances of the controllers instantiated (waiting is responsible for an event, containing state) at returning results to the same time. However, conceptually only one controller may be active doing work (handling an event) at a time. Flow Level when script execution terminates.

4.2 4.3 Syntax

VoiceXML 3.0 elements are defined using Schema and represented in DOM (Level 3).

4.3 4.4 Event Model

4.4.1 Internal Events

The event model for VoiceXML 3.0 builds upon the DOM Level 3 Events [DOM3Events] specification. DOM Level 3 Events offer a robust set of interfaces for managing the listener registration, dispatching, propagation, and handling of events, as well as a description of how events flow through an XML tree.

The DOM 3.0 event model offers VoiceXML developers a rich set of interfaces that allow them to easily add behavior to their applications. In addition, conforming to the standard DOM event model enables authors to integrate their Voice applications in next generation multimodal or multi-namespaced frameworks such as MMI and CDF with minimal efforts.

Within the VoiceXML 3.0 semantic model, the DOM Level 3 Events APIs are available to all Resource Controllers that have markup elements associated with them. Indeed, this section covers the eventing APIs as available to VoiceXML 3.0 markup elements. The following section describes how the semantic model ties in with the DOM eventing model.

4.3.1 4.4.1.1 Event Interfaces

All VoiceXML 3.0 markup elements implement interfaces that support the following:

  • Subscription to events by event listeners and, symmetrically, the removal of event listeners.
  • Publishing of the events emitted by their resources.
  • Event handling.
4.3.1.1 4.4.1.1.1 Event

The VoiceXML 3.0 Event interface extends the DOM Level 3 Event interface to support voice specific event information. In particular, the VoiceXML 3.0 Event interface supports a count integer that stores the number of times a resources emits a particular event type. The semantic model manages the count field by incrementing its value and resetting it as described in the section that follows.

Note:

RH: should we expose the count to authors? If so, should we have a special variable like event.count or something similar ?
4.3.1.2 4.4.1.1.2 EventTarget

VoiceXML 3.0 markup elements implement the DOM Level 3 EventTarget interface.This interface allows registration and removal of event listeners as well as dispatching of events.

4.3.1.3 4.4.1.1.3 EventListener

The VoiceXML 3.0 markup elements implement the DOM Level 3 EventListener interface. This interface allows the activation of handlers associated with a particular event. When a listener is activated, the event handler execution is done in the semantic model as described in the section that follows.

4.3.2 4.4.1.2 Event Flow

[To be updated by Michael Bodell (members only) due April 1 2008]

Events propagate through markup elements as per the DOM event flow . Event listeners may be registered on any of VoiceXML markup elements.

When processing a VoiceXML 2.0 profile, event listeners are not allowed to be registered for the capture phase, as this contradicts the as-if-by-copy event semantics of VoiceXML 2.0. If a listener is registered with the capture phase set to true in a VoiceXML 2.0 document, an error.event.illegalphase event will be dispatched onto the root document and the listener registration will be ignored (does that sound reasonable to people?).

4.3.2.1 4.4.1.2.1 Event Listener Registration

The DOM Level 3 Event specification supports the notion of partial ordering using the event listener group ; all events within a group are ordered. As such, in VoiceXML 3.0, event listeners are registered as they are encountered in the document. Furthermore, all event listeners registered on an element belong to the same default group. Both of these provisions ensure that event handlers will execute in document order.

4.3.2.2 4.4.1.2.2 Event Listener Activation

An event listener is triggered if:

  1. The type of the event propagating through the markup element matches the type or category of the event listener.
  2. The event listener is registered for the same phase. Note that in a VoiceXML 2.0 profile, only the at-target and bubble phases are supported.
  3. The event propagation has not been stopped for the listener's group.
  4. The conditional expression, if present on the handler's cond attribute, must evaluate to true.
  5. The value of the count attribute on a handler, if present, must be less than or equal to the count field inside the event being propagated.

Once en event listener is triggered, the execution is handled by the semantic model as described in the section below. Event propagation blocks until it is notified by the semantic model to proceed.

4.3.3 4.4.1.3 Event Categories

The VoiceXML 3.0 specification extends the DOM 3 Event specification to support partial name matching on events. VoiceXML 3.0 creates categories of events ( the list of categories needs to be specified in the VoiceXML 3.0 spec ) and allows authors and the platform to register listeners for either a specific event type or for all events within a particular category or subcategory. For example, VoiceXML 3.0 may create a connection category such as:

      
      {"http://www.example.org/2007/v3","connection"} 
      
    

The spec may also declare a subcategory of connection or a specific event type that belongs to this category:

      
      {"http://www.example.org/2007/v3","connection.disconnect"}
      {"http://www.example.org/2007/v3","connection.disconnect.hangup"}
      
    

Following this declaration, the VoiceXML 3.0 Event specification uses partial name matching to associate events propagating through the DOM to listeners registered on the tree. The VoiceXML 3.0 Event specification follows the prefix matching used in VoiceXML 2.0 for associating events with their categories.

Note:

It might be useful to introduce the "*" notation to be specify a catch for all events irrespective of their type and/or category.

4.4.2 External Events

VoiceXML 3.0 interpreters may receive events from external sources, for example SCXML engines. In particular, it may receive the life cycle events specified as part of the Multimodal Architecture and Interfaces specification [MMI] . These life cycle events allow the flow component of the DFP architecture to control the presentation layer by starting and stopping the processing of markup. By handling these events, the VoiceXML interpreter acts as a 'modality component' in the multimodal architecture, while the flow component acts as an 'interaction manager'. As a result, VoiceXML 3 applications can be easily extended into multimodal applications. However it is important to note that support for the life cycle events is required by the DFP framework in all applications, whether uni- or multimodal.

The interpreter must handle the following life cycle events automatically:

  • PrepareRequest. This event instructs the interpreter to prepare to run a VoiceXML script. The event contains: a) either the URI of the markup to run or the actual markup itself, b) a context ID which will be used in subsequent messages referring to the same markup, and c) a specification of media channel to use. Note that the interpreter does not actually start running the markup in question when it receives this message. Once the interpreter has finished its preparation, it sends a PrepareResponse event in reply. The PrepareResponse event contains the context ID that was sent in the PrepareRequest plus a status field containing 'success' or 'failure'. This message with a status of 'success' thus indicates that the interpreter is now ready to run the specified markup.
  • StartRequest. This event instructs the interpreter to run the specified script. It will contain the same context ID as the preceding PrepareRequest, and may optionally contain a new specification of the markup to run and the media channel to use, overriding those contained in the PrepareRequest. This event may also be sent without a preceding PrepareRequest. When the interpreter receives this event, it must start running the specified markup using the specified media channel. It will then send a StartResponse event in reply.
  • CancelRequest with 'immediate' flag set to true. This event instructs the interpreter to disconnect from the media and to stop processing the markup. This event contains the context ID. The interpreter may continue processing clean-up handlers etc. after it receives this event, but it should not end any events back to the sender other than the CancelResponse, which acknowledges the receipt of this command. If the 'immediate' flag is set to false, the interpreter passes the event up to be handled by author code, as described below.
  • ClearContextRequest???

All other life cycle events and all other external events are ignored unless the External Communications Module 6.13 External Communication Module is included in the profile. If the External Communications Module is present, all other external events are passed up to the application, placed in the application event queue and then handled as specified by the developer using the functionality defined in that module.

Editorial note  
Open Issue: Should ClearContextRequest be handled automatically? Should Done be sent automatically when the document is finished? Where do these response events get sent?

4.4 4.5 Document Initialization and Execution

4.4.1 4.5.1 Initialization

The initialization ordering described here is a logical one, specifying which objects and information are available at each stage. Implementations are allowed to use a different ordering (in particular, they are allowed to interleave the construction of the DOM with the creation of semantic objects) as long as they behave as if they were following the order specified here. Similarly, we refer to a 'semantic constructor' as a cover term for whatever mechanism is used to create the Resource Controllers for a given node. No particular implementation is implied or required.

Before a VoiceXML 3.0 application is first loaded, all Resources are created. Whenever a document is loaded within that application, its DOM (level 3) is created. Then the initialization process creates the Resource Controllers by invoking the semantic constructor for the root <vxml> node of the DOM. The root <vxml> node constructor is responsible for invoking the constructors for all nodes in the document that have them. When it does this, it will call the semantic constructor routine passing it

  1. a pointer to the node that has the constructor
  2. a pointer to the root of the DOM
  3. an arbitrary data structure
Editorial note  
Open Issue: we must specify the operation of the root node constructor in more detail as part of the V3 specification. Other people can define modules, but we must specify how they are assembled into a full semantic representation of the application.) If there is an application root document specified, the root node constructor will have to construct its RCs as well, by calling its root node constructor.

Note that the initial construction process creates the RCs but does not necessarily fully configure them. Further initialization, including in particular the creation of variables and variable scopes, will happen only when the RCs are activated at runtime (e.g. by visiting a Form.) Form). However, at this point the list of children for each element (and thus each RC) is known. For each RC this list of children will populate into the appropriate place in the RC data model before semantic initialization of the RC.

Once the RCs are constructed, they are independent of the DOM, except for the interactions specified below. However, while they are running the RCs often make use of what appears to be syntactic information. For example, the concept of 'next item' relies heavily on document order, while <goto> can take a specific syntactic label as its target. We provide for this by assuming that RCs can maintain a shadow copy of relevant syntactic information, where "shadow copy" is intended to allow a variety of implementations. In particular, platforms may make an actual copy of the information or may maintain pointers back into the DOM. The construction process may create multiple RCs for a given node. In that case, one of the RCs will be marked as the primary RC. It is the one that will be invoked when the flow of control reaches that (shadow) node.

Needs to happen after creation of RCs and before general semantic initialization. After the creation of of the RCs is when the mapping from syntax to RCs will occur, and that's when the list would be known.

4.4.2 4.5.2 Execution

After initialization, the semantic control flow does a <goto> to the initial Resource Controller. Once a RC is running, it invokes Resources and other RCs by sending them events. The DOM is not involved in this process. At various points in the processing, however, an RC may decide to raise an author-visible event. It does this by creating an event targeted at a specific DOM node and sending it back to the DOM. When the DOM receives the event, it performs the standard bubble/capture cycle with the target specified in the event. In the course of the bubble/capture cycle, various event handlers may fire. Their execution is a semantic action and occurs back in the semantic 'side' of the environment. The DOM sends messages back to the appropriate semantic objects to cause this to happen. Note that this means that the DOM must store some sort of link to the appropriate RCs. The event handlers may update the data model, execute script, or raise other DOM events. When the handler finishes processing on the semantic side, it sends a notification back to the DOM so that it can resume the bubble/capture phase. (N.B. This notification is NOT a DOM event.) When the DOM finishes the bubble/capture processing of the event, it sends a notification back to the RC that raised the event so that it can continue processing.

Editorial note  
Open Issue: Is this notification a standard semantic event? Note that RC processing must pause during the bubble/capture phase to avoid concurrency problems.
4.4.2.1 4.5.2.1 Subdialogs

A subdialog has a completely separate context from the invoking application. Thus it has a separate DOM and a separate set of RCs. However it shares the same set of Resources since they are global. When a subdialog is entered, the Datamodel Resource will have to create a new scope for the subdialog and hide the calling document's scopes. When the subdialog is exited, the Datamodel resource will destroy the subdialog scope(s) and restore the calling document's scope(s).

4.4.2.2 4.5.2.2 Application Root

To handle event propagation from the leaf application to the application root document, we create a Document Manager to handle all communication between the documents. This means that the DOMs of the two documents remain separate. When an event is not handled in the leaf document, the Document Manager will propagate it to the application root, where it will be targeted at the <vxml> node. Requests to fetch properties or to active grammars will be handled by the Document Manager in a similar fashion. To handle platform- and/or language-level defaults, we will create a "super-root" document above the application root. The Document Manager will pass it events and requests that are not handled in the root document. If root and superroot documents do not handle an event, the Document Manager will ensure that the event is thrown away.

4.4.2.3 4.5.2.3 Summary of Syntax/Semantics Interaction

There seem to be four kinds of interactions between RCs and the DOM at runtime:

  1. RCs can inject DOM events into the DOM.
  2. The DOM can invoke the RCs for specific event handlers.
  3. The event handler RCs can signal the DOM when they have finished processing.
  4. The DOM can signal the emitting RC when it has finished processing a DOM event.
Editorial note  
Open Issue: DOM Modification. There are two possibilities: 1) we can refuse to allow the DOM to be modified (or ignore the modifications if it is) 2) we can reconstruct the relevant resource controllers when the DOM is modified. In the latter case, the straightforward approach would be: a) find the least node that is an ancestor of all the changes and that has a constructor b) call its constructor as during initialization, using the current state of the DOM and RCs as context.

5 Resources

This section describes semantic models for common VoiceXML resources. Resources have a life cycle of creation and destruction. Specific resources may specify detailed requirements on these phases. All resources must be created prior to their use by a VoiceXML interpreter.

Editorial note  
Standard lifecycle events are expected to be defined in later versions: create event: from idle to created; destroy event: from created to idle.

Resources are defined in terms of a state model and events which it processes within defined states. Events may be divided into those which are defined by the resource itself and events defined by other conceptual entities which the resource receives or sends within these states. These conceptual entities include resource controllers and a 'device' which provides an implementation of the services defined by the resource.

The semantic model is specified in both UML state chart diagrams and SCXML representations. In case of ambiguity, the SCXML representation takes precedence over UML diagrams. Note that SCXML is used here to define the states and events for resources and this definitional usage should not be confused with the use of SCXML to specify application flow (see 3.2 Flow ). Furthermore, these resource events are conceptual, not DOM events: they are used to define relationship with other conceptual entities and are not exposed at the markup level. The relationship between conceptual events and DOM events is described in XXX.

The following resources are defined: data model ( 5.1 Datamodel Resource ), prompt queue ( 5.2 Prompt Queue Resource ) and DTMF and ASR recognition ( 5.3 Recognition Resources ).

[Later versions will defined the following resources: recorder, SIV. Later versions may define the following resources: session recorder, ...]

5.1 Datamodel Resource

Editorial note  

Later versions of this document will clarify that different datamodels may be instanced, such as ECMAScript, XML, etc. Conformance requirements will be stated at a later stage.

The datamodel is a repository for both user- and system-defined data and properties. To simplify variable lookup,we define the datamodel with a synchronous function-call API, rather than an asynchronous one based on events. The data model API does not assume any particular underlying representation of the data or any specific access language, thus allowing implementations to plug in different concrete data model languages.

There is a single global data model that is created when the system is first initialized. Access to data is controlled by means of scopes, which are stored in a stack. Data is always accessed within a particular scope, which may be specified by name but defaults to being the top scope in the stack. At initialization time, a single scope named "Global" is created. Thereafter scopes are explicitly created and destroyed by the data model's clients.

Editorial note  
Resource and Resource controller description to be updated with API calls rather than events.

5.1.1 Data Model Resource API

Table 1: Data Model API
Function Arguments Return Value Sequencing Description
CreateScope name(optional) Success or Failure Creates a new scope object and pushes it on top of the scope stack. If no name is provided the scope is anonymous and may be accessed only when it on the top of the scope stack. A Failure status is returned if a scope already exists with the specified name.
DeleteScope name(optional) Success or Failure Removes a scope from the scope stack. If no name is provided, the topmost scope is removed. Otherwise the scope with provided name is removed. A Failure status is returned if the stack is empty or no scope with the specified name exists.
CreateVariable variableName, value(optional), scopeName(optional) Success or Error Creates a variable. If scopeName is not specified, the variable is created in the top most scope on the scope stack. If no value is provided, the variable is created with the default value specified by the underlying datamodel. A Failure status is returned if a variable of the same name already exists in the specified scope.
DeleteVariable variableName, scopeName(optional) Success or Failure Deletes the variable with the specified name from the specified scope. If no scopeName is provided, the variable is deleted from the topmost scope on the stack. The status Failure is returned if no variable with the specified name exists in the scope.
UpdateVariable variableName, newValue, scopeName(optional) Success or Failure Assigns a new value to the variable specified. If scopeName is not specified, the variable is accessed in the topmost scope on the stack. A Failure status is returned if the specified variable or scope cannot be found.
ReadVariable variableName, scopeName(optional) value Returns the value of the variable specified. If scopeName is not specified, the variable is accessed in the topmost scope on the stack. An error is raised if the specified variable or scope cannot be found.
EvaluateExpression expr, scopeName(optional) value Evaluates the specified expression and returns its value. If scopeName is not specified, the expression is evaluated in the topmost scope on the stack. An error is raised if the specified scope cannot be found.

Issue ():

Do we need event listeners on the data model, e.g., to notify when the value of a variable changes?

Resolution:

None recorded.

5.2 Prompt Queue Resource

5.2.1 State Chart Representation

Here is a UML representation of the prompt queue. This state machine assumes that "queue" and "play" are separate commands and that a separate "play" will always be issued to trigger the play. When the "play" is issued, the systems plays any queued prompts, up to and including the first fetch audio in the queue. Then it halts, even if there are additional prompts or fetch audio in the queue and waits for another "play" command.

Editorial note  

Open issue: Can queued prompt commands, either audio or TTS, be left un-fetched or un-rendered until a play command is issued to the prompt resource? This may result in delays or gaps in the production of the actual audio, as the rendering or fetching may not produce playable audio fast enough to avoid inter-prompt delays.

The prompt structure assumed here is fairly abstract. It consists of a specification of the audio along with optional parameters controlling playback (for example, speed or volume.) The audio may be presented in-line, as SSML or some other markup language, or as a pointer to a file or streaming audio source. Logically, URLs are dereferenced at the time the prompt is queued, but implementations are not required to fetch the actual media until the prompt in question is sent to the player device. Note that the player device is assumed to be able to handle both recorded prompts and TTS, and to be able to interpret SSML. Platforms are free to optimize their implementations as long as they conform to the state machine specified here. In particular, platforms may prefetch audio or begin TTS processing in the background before the prompt is sent to the player device. For applications that make use of VCR controls (speed up, skip forward, etc.), actual performance may depend on whether the platform has implemented such optimizations. For example, a request to skip forward on a platform that does not prefetch prompts may result in a long delay. Such performance issues are outside the scope of this specification.

This diagram assumes that SSML mark information is delivered in the Player.Done event, and that the player returns a Player.Done event when it is sent a 'halt' event (otherwise mark information would get lost on barge-in and hangup, etc).

Note that the "FetchAudio" state is shown stubbed out for reasons of space, and is expanded in a separate diagram below the main one.

Semantic model for prompt queue semantics

Figure X: Prompt Queue Model

Semantic model for fetch audio

Figure Y: Fetch audio Model

5.2.2 SCXML Representation

<?xml version="1.0" encoding="UTF-8"?>
<scxml initialstate="Created">
  <datamodel>
    <data name="queue"/>
    <data name="markName"/>
    <data name="markTime"/>
    <data name="bargeInType"/>
  </datamodel>
   <state id="Created">
    <initial id="Idle"/>
    <transition event="QueuePrompt">
     <insert  pos="after" loc = "datamodel/data[@name='queue']/prompt" val="_eventData/prompt"/>
    </transition>
     <transition event="QueueFetchAudio">
       <foreach var="node" nodeset="datamodel/data[@name='queue']/prompt"> 
         <if cond="$node[@fetchAudio='true']">
          <delete loc="$node"/>
         <else>
          <assign loc="$node[@bargeInType]" val="unbargeable"/>
         </else>
         </if>
       </foreach>
    <insert pos="after" name="datamodel/data[@name='queue']/prompt" val="_eventData/audio"/>
    </transition>
    <transition event="setParameter">
     <send target="player" event="setParameter" namelist="_eventData.paramName, _eventData.newValue"/>
    </transition>
    <transition event="Cancel" target="Idle">
     <send target="player" event="halt"/>
     <send event="PlayDone" namelist="/datamodel/data[@name='markName'].text(), /datamodel/data[@name='markTime'].text()"/>
     <delete loc="datamodel/data[@name='queue']/prompt"/>
    </transition>
    <transition event="CancelFetchAudio">
       <foreach var="node" nodeset="datamodel/data[@name='queue']/prompt"> 
         <if cond="$node[@fetchAudio='true']">
          <delete loc="$node"/>
         </if>
       </foreach>
    </transition>
    <state id="Idle">
     <onentry>
      <assign loc="/datamodel/data[@name='markName']" val=""/>
      <assign loc="/datamodel/data[@name='markTime']" val="-1"/>
      <assign loc="/datamodel/data[@name='bargeInType']" val=""/>
     </onentry>
     <transition event="Play" cond="/datamodel/data[@name='queue']/prompt[1][@fetchAudio] eq 'false'" target="PlayingPrompt"/>
     <transition event="Play" cond="/datamodel/data[@name='[queue']/prompt[1][@fetchAudio] eq 'true'" target="FetchAudio"/>
    </state>
    <state id="PlayingPrompt">
     <datamodel>
      <data name="currentPrompt"/>
     </datamodel>
     <onentry>
      <assign loc="/datamodel/data[@name='currentPrompt']/prompt" val="/datamodel/data[@name='queue']/prompt[1])"/>
      <delete loc="/datamodel/data[@name='queue']/prompt[1]"/>
      <if cond="/datamodel/data[@name='currentPrompt']/prompt[@bargeInType] != /datamodel/data[@name='bargeInType']">
       <send event="BargeInChange" namelist="/datamodel/ data[@name='currentPrompt']/prompt[@bargeInType]"/>
       <assign loc="/datamodel/data[@name='bargeInType']" expr="/ datamodel/data[@name='currentPrompt']/prompt[@bargeInType]"/>
      </if>
     </onentry>
     <invoke targettype="player" srcexpr="/datamodel/ data[@name='currentPrompt']/prompt"/>
     <finalize>
       <if cond="_eventData/MarkTime neq '-1'">
         <assign name="/datamodel/data[@name='markName']/" val="_eventData/markName.text()"/>
         <assign name="/datamodel/data[@name='markTime']/" val="_eventData/markTime.text()"/>
       </if>
     </finalize>
     <transition event="player.Done" cond="/datamodel/data[@name='queue']/prompt[last()] le '1'" target="Idle">
      <send event="PlayDone" namelist="/datamodel/data[@name='markName'].text(), /datamodel/data[@name='markTime'].text()"/>
     </transition>
     <transition event="player.Done" cond="/datamodel/data[@name='queue'/prompt[1][@fetchAudio] neq 'true'" target="PlayingPrompt"/>
     <transition event="player.Done"
         cond="/datamodel/data[@name='queue']/prompt[1][@fetchAudio] eq 'true'" target="FetchAudio"/>
    </state> <!-- end PlayingPrompt -->
    <state id="FetchAudio">
      <initial id="WaitFetchAudio"/>
      <transition event="player.Done" target="FetchAudioFinal"/>
      <state id="WaitFetchAudio">
        <onentry>
          <send target="self" event="fetchAudioDelay"
          delay="/datamodel/data[@name='queue']/prompts[1][@fetchaudiodelay]"/>
        </onentry>
       <transition event="fetchAudioDelay" next="StartFetchAudio"/>
       <transition event="cancelFetchAudio" next="FetchAudioFinal"/>
      </state>
     <state id="StartFetchAudio">
      <datamodel>
       <data name="fetchAudio"/>
      </datamodel>
      <onentry>
       <assign loc="/datamodel/data[@name='fetchAudio']" expr="/datamodel/data[@name='queue']/prompts[1]"/>
       <delete loc="/datamodel/data[@name='queue']/prompts[1]"/>
       <send target="self" event="fetchAudioMin" delay="/datamodel/data[@name='fetchAudio'][@fetchaudiominimum]"/>
       <send target="player" event="Play" namelist="/datamodel/data[@name='fetchAudio']"/>
       <if cond="/datamodel/data[@name='bargeInType'].text() ne 'fetchAudio'">
         <send event="BargeInChange" namelist="fetchAudio"/>
       </if>
      </onentry>
      <transition event="CancelFetchAudio" target="WaitFetchMinimum"/>
      <transition event="fetchAudioMin" target="WaitFetchCancel"/>
     </state>
     <state id="WaitFetchMinimum">
       <transition event="fetchAudioMin" target="FetchAudioFinal">
         <send target="player" event="halt"/>
       </transition>
     </state>
     <state id="WaitFetchCancel">
       <transition event="CancelFetchAudio" target="FetchAudioFinal">
         <send target="player" event="halt"/>
       </transition>
     </state>
     <state id="FetchAudioFinal" final="true" />
     <!-- could put cleanup handling here -->
    </state> <!-- end FetchAudio -->
   </state> <!-- end Created -->
</scxml>

5.2.3 Defined Events

The prompt queue resource can be controlled by means of the following events:

Table 2: Events received by prompt queue resource
Event Source Payload Sequencing Description
queuePrompt any prompt (M), properties(O) adds prompt to queue, but does not cause it to be played
queueFetchAudio any prompt (M) adds fetch audio to queue, removing any existing fetch audio from queue. Does not cause it to be played.
play any Causes any queued prompts or fetch audio to be played
changeParameter any paramName, newValue Sets the value of paramName to newValue, which may be either an absolute or relative value. The new setting takes effect immediately, even if there is already a prompt playing.
cancelFetchAudio any Deletes any queued fetch audio. Also cancels any fetch audio that is already playing, unless fetchAudioMin has been specified and not yet reached.
cancel any Immediately cancels any prompt or fetch audio that is playing and clears the queue.

The prompt queue resource returns the following events to its invoker:

Table 3: Events sent by prompt queue resource
Event Target Payload Sequencing Description
prompt.Done controller markName(O), markTime(O) Indicates prompt queue has played to completion and is now empty
bargeintypeChange controller one of: unbargeable, hotword, energy, fetchAudio sent at start of prompt play and whenever a new prompt or fetch audio is played whose bargeinType differs from the preceding one.

Issue ():

Do we need 'fetchAudio' as a distinct bargein type?

Resolution:

None recorded.

5.2.4 Device Events

The prompt queue receives the following events from the underlying player:

Table 4: Prompt Queue: Events from Device
Event Payload Sequencing Description
player.Done Sent whenever a single prompt or piece of fetch audio finishes playing.

and sends the following events to the underlying device:

Table 5: Prompt Queue: Events sent to Device
Event Payload Sequencing Description
play prompt (M) sent to platform to cause a single prompt to be played.
setParameter paramName (M), value(O) sent to platform to change the value of a playback parameter such as speed or volume. The new value may be absolute or relative. The change takes effect immediately.

5.2.5 Open Issue

Issue ():

Differences in PromptQueue Definition: see Details (members only).

Resolution:

None recorded.

5.3 Recognition Resources

Two types of recognition resources are defined: DTMF recognition for recognition of DTMF input; and ASR recognition for recognition of speech input. Both recognition resources are associated with a device which implements their respective recognition services. Each device represents one or more actual recognizer instances. In case of a device implemented with multiple recognizers - for example two different speech recognition engines - it is the responsibility of the interpreter implementation to ensure that they adhere to the semantic model defined in this section.

DTMF and ASR recognition resources are semantically similar. They share the same state and eventing model as well as recognition processing, timing and result handling. However, the resources differ in the following respects:

  • Properties: the DTMF resource uses DTMF properties (vxml20, 6.3.3), while the ASR uses speech recognition properties (vxml20, 6.3.2).
  • Mode: the DTMF resource has the mode value 'dtmf' and the ASR resource has the value 'voice' (vxml20, inputmodes, 6.3.6)
  • Buffering: only DTMF resource may buffer input when the resource is not active (e.g. in the FIA transition state, vxml20: 4.1.8).

Otherwise, ASR and DTMF recognition resources share the same semantic model.

If a resource controller activates both DTMF and ASR recognition resources, then that resource controller is responsible for managing the resources so that only a single recognition result is produced per recognition cycle.

5.3.1 Definition

The recognition resource is defined in its created state grammars are added to the resource and subsequently prepared on the device, recognition with these grammars can be activated and suspended, and recognition results are returned.

When the recognition resource is ready to recognize (at least one active grammar), one or more recognition cycles may occur in sequence.

  • A recognition cycle is initiated when the resource sends the device an event instructing it to listen to the input stream.
  • A recognition cycle is terminated if the device sends the resource an error event, or the device is instructed to stop recognition by the resource. When terminated, the device removes partially or wholly processed input from its buffer, and resource awaits grammars to prepare.
  • During the recognition cycle, the device may send events to the resource indicating ongoing recognition status, and recognition results describing one or more input sequences which match active grammars.
  • During the recognition cycle, the device may receive instructions to suspend recognition. When the device is suspended, input is not buffered and the device must not send any events until it receives instructions to re-start or terminate recognition.
  • When the resource receives recognition results from the device during the recognition cycle, it passes them to its controller. A recognition cycle is now complete and the resource awaits instructions either to start another recognition cycle or to terminate recognition.

Thus a resource recognition resource may enter multiple recognition cycles (as required for 'hotword' recognition), while requiring that a device, even if it has multiple instantiations, only produces one set of recognition results per recognition cycle.

The recognition resource is defined in terms of a data model and state model.

The data model is composed of the following elements:

  • activeGrammars: an ordered list of grammars with which to recognize. Each item in the list contains the following information:
    • content: a URI or inline content to the grammar itself
    • properties: grammar-specific properties (vxml20: weight, mode, type, maxage, maxstale, etc)
    • listener: a resource controller associated with this grammar
  • properties: properties pertaining to the recognition process. These properties differ depending on the type of the recognition resource: for a DTMF recognition resources, the properties include DTMF properties, and for ASR recognition resource, speech recognition properties. The properties may also include platform-specific recognition properties.
  • controller: the resource controller to which recognition status, results and error events are sent.
  • mode: the recognition resource's inputmode: 'voice' for an ASR recognition resource, and 'dtmf' for a DTMF recognition resource.

The state model is composed of states corresponding to functional state: idle , preparing grammars , ready to recognize , recognizing , suspended recognition and waiting for results .

In the idle state, the resource awaits events from resource controllers to activate grammars for recognition on the device. The data model - activeGrammars, properties, controller and mode - is (re-)initialized upon entry to this state: activeGrammars are cleared, properties and controllers are set to null. If the resource receives an 'addGrammar' event, a new item is added to activeGrammars using grammar, properties and listener data in the event payload. If the resource receives a 'prepare' event, it updates its data model with event data: 'properties' with the properties event data and 'controller' is updated with the controller event data. Subsequent event notifications and responses are sent to the resource controller identified as the 'controller'. The recognition resource then moves into the preparing grammars state.

In the preparing grammars state, the resource behavior depends on whether activeGrammars is empty or not. If activeGrammars is empty (i.e. no active grammars are defined for this recognition resource), the resource sends the controller a 'notPrepared' event and returns to the idle state. If activeGrammar is non-empty, the resource sends a 'prepare' event to the device. The event payload includes 'grammars' and 'properties' parameters. The 'grammars' value is an ordered list where each list item is a grammar's content and its properties extracted from activeGrammars. The order of grammars in the 'grammars' parameter must follow the order in the activeGrammar data model. If the device sends a 'prepared' event, the resource sends a 'prepared' event to the controller and transitions into the ready to recognize state.

When the recognition resource is in a ready to recognize state, it may receive a 'stop' event. In this case, the resource sends a 'stop' event to the device, and returns to the idle state. If the resource receives a 'listen' event, it sends a 'listen' event to the device and moves into the recognizing state .

When the resource is in a recognizing state , it can toggle between this state and a suspended recognizing state by receiving further 'listen' and 'suspend' events. If the resource receives a 'suspend' event, then it moves into the suspended recognizing state and sends the device a 'suspend' event which causes the device to suspend recognition and delete any buffered input. No input is buffered while the device is in a suspended state. If the resource then receives a 'listen' event, it moves back into the recognizing state .

When in the recognizing state , the resource may receive an 'inputStarted' event from the device, indicating that user input has been detected. The resource then moves into a waiting for results state. The device may send an 'error' event (for example, if maximum time has been exceeded) causing it to return to the idle state and send the controller an 'error' event. Alternatively, the device may send a 'recoResults' event, which contains a results parameter, a data structure representing recognition results in VoiceXML 2.0 or EMMA format. The structure may contain zero or more recognition results. Each result must specify the grammar associated with the recognition (using the same grammar name as used in the payload of the 'prepare' event), its recognition confidence and its input mode. The resource sends its controller a 'recoResults' event with event data containing the device's results parameter together with a listener parameter whose value is the listener associated with the grammar of the first result with the highest confidence (if there are no results, then the listener parameter is not defined). The resource then returns to the ready to recognize state , awaiting either a 'stop' event to terminate recognition or a 'listen' event to start another recognition cycle using the same active grammars and recognition properties.

5.3.2 Defined Events

A recognition resource is defined by the events it receives:

Table 6: Events received by recognition resource
Event Source Payload Sequencing Description
addGrammar any grammar (M), listener (M), properties (O) creates a grammar item composed of the grammar, listener and properties, and adds it to the activeGrammars
prepare any controller (M), properties (M) prepares the device for recognition using activeGrammars and properties
listen any initiates/resumes recognition
suspend any suspends recognition
stop any terminates recognition

and the events it sends:

Table 7: Events sent by recognition resource
Event Target Payload Sequencing Description
prepared controller one-of: prepared, notPrepared positive response to prepare (activeGrammars prepared)
notPrepared controller one-of: prepared, notPrepared negative response to prepare (no activeGrammars defined)
inputStarted controller notification that onset of input has been detected
inputFinished controller notification that the end of input has been detected
partialResult controller results (M), listener (O) notification of a partial recognition result
recoResult controller results (M), listener (O) notification of complete recognition result, including the results structure and a listener
error controller error status (M) notification that an error has occurred

5.3.3 Device Events

The resource receives from the recognition device the following events:

Table 8: Recognition: Events from Device
Event Payload Sequencing Description
prepared response to prepare indicating that activeGrammars have been successfully prepared
inputStarted notification that the onset of input has been detected
inputFinished notification that the end of input has been detected
partialResults results (M) notification of a partial recognition results
recoResults results (M) notification of final recognition results
error error status (M) an error occurred

and sends to the recognition device the following events:

Table 9: Recognition: Events sent to Device
Event Payload Sequencing Description
prepare grammars (M), properties (M) the recognition device is prepared with grammars and properties
clear all grammars and properties in the recognition device are to be cleared
listen recognition is to be initiated
suspend recognition is to be suspended
stop recognition is to be stopped

5.3.4 State Chart Representation

The state model for an ASR recognition resource are shown in Figure 8. 9. The DTMF resource model only differs in that the value for the mode data is 'dtmf' instead of 'voice'.

[generalize stop event returning resource to idle state ...]

Recognition Resource States

Figure 8: 9: Recognition Resource States

5.3.5 SCXML Representation

<?xml version="1.0" encoding="UTF-8"?>

<scxml initialstate="Created">
  <datamodel>
    <data name="activeGrammars"/>
    <data name="properties"/>
    <data name="controller"/>
    <data name="mode"/>
  </datamodel>
  <state id="Created">
    <initial id="idle"/>

    <state id="idle">
      <onentry>
        <foreach var="node" nodeset="datamodel/data[@name='activeGrammars']">
          <delete loc="$node"/>
        </foreach>
        <assign loc="/datamodel/data[@name='properties']" val="null"/>
        <assign loc="/datamodel/data[@name='controller']" val="null"/>
        <assign loc="/datamodel/data[@name='mode']" val="voice"/>
      </onentry>
      <transition event="AddGrammar">
        <datamodel>
          <data name = "gram"/>
        </datamodel>
        <assign name="/datamodel/data/[@name='gram']/grammar" expr="_eventData/grammar" />
        <assign name="/datamodel/data/[@name='gram']/properties" expr="_eventData/properties" />
        <assign name="/datamodel/data/[@name='gram']/listener" expr="_eventData/listener" />
        <insert pos="after" name="datamodel/data[@name='activeGrammars']" val="gram"/>
      </transition>
      <transition event="prepare" target="preparingGrammars">
        <assign loc="/datamodel/data[@name='properties']" expr="_eventData/properties"/>
        <assign loc="/datamodel/data[@name='controller']" expr="_eventData/controller"/>
      </transition>
    </state>    <!-- end idle -->
    <state id="preparingGrammars">
      <onentry>
        <if cond="isEmpty(/datamodel/data[@name='activeGrammars']) eq 'false'">
          <send target="device" event="dev:clear"/>
          <send target="device" event="dev:prepare" namelist="/datamodel/data[@name='activeGrammars'], /datamodel/data[@name='properties']"/>
        </if>
      </onentry>
      <transition cond="isEmpty(/datamodel/data[@name='activeGrammars']) eq 'true'" target="idle">
        <send target="controller" event="notPrepared"/>
      </transition>
      <transition event="stop" target="idle">
        <send target="device" event="dev:stop"/>
      </transition>
      <transition event="dev:prepared" target="readyToRecognize">
        <send target="controller" event="Prepared"/>
      </transition>
    </state>    <!-- end preparingGrammars -->
    <state id="readyToRecognize">
      <transition event="listen" target="recognizing" />
      <transition event="stop" target="idle">
        <send target="device" event="dev:stop"/>
      </transition>
    </state>    <!-- end readyToRecognize -->
    <state id="recognizing">
      <onentry>
        <send target="device" event="dev:listen"/>
      </onentry>
      <transition event="suspend" target="suspendedRecognizing"/>
      <transition event="dev:inputStarted" target="waitingForResult"/>
      <transition event="stop" target="idle">
        <send target="device" event="dev:stop"/>
      </transition>
    </state>    <!-- end recognizing -->
    <state id="suspendedRecognizing">
      <onentry>
        <send target="device" event="dev:suspend"/>
      </onentry>
      <transition event="listen" target="recognizing"/>
      <transition event="stop" target="idle">
        <send target="device" event="dev:stop"/>
      </transition>
    </state>    <!-- end suspendedRecognizing -->
    <state id="waitingForResult">
      <onentry>
        <send target="controller" event="inputStarted"/>
      </onentry>
      <transition event="dev:inputFinished">
        <send target="controller" event="inputFinished"/>
      </transition>
      <transition event="dev:partialResult">
        <send target="controller" event="partialResult" namelist="_eventData/results,_eventData/grammar/listener"/>
      </transition>
      <transition event="dev:recoResults" target="readyToRecognize">
        <send target="controller" event="recoResult" namelist="_eventData/results,_eventData/grammar/listener"/>
      </transition>
      <transition event="dev:error" target="idle">
        <send target="controller" event="error" namelist="_eventData/error status"/>
      </transition>
      <transition event="stop" target="idle">
        <send target="device" event="dev:stop"/>
      </transition>
    </state>    <!-- end waitForResult -->
  </state>  <!-- end Created -->
</scxml>

5.4 SIV Resource

The working group plans to define an SIV (Speaker Identification and Verification) resource within this document. The group currently expects it to have the following characteristics:

  • It will be defined as an additional recognition resource (along with the existing DTMF and speech recognition resources) which can be activated alone or simultaneously with the other recognition resources.
  • It will make use of voice models rather than grammars.
  • It additionally has the notion of creation of voice models through an enrollment process.
  • Administration of voice biometrics including deletion of voice models and mapping of identity is explicitly out of scope of VoiceXML 3.

6 Modules

In VoiceXML 3.0, the language is partitioned into independent modules which can be combined in various ways. In addition to the modules defined in this section, it is also possible for third parties to define their own modules (see Section XXX).

Each module is assigned a schema, which defines its syntax, plus one or more Resource Controllers (RCs), which define its semantics, plus a "constructor" that knows how to create them from the syntactic representation at initialization time. Only DOM nodes that have schemas and constructors (and hence RCs) assigned to them can be modules in VoiceXML 3.0. However, we may choose to define constructors and RCs for nodes that are not modules. Nodes that do not have constructors and RCs ultimately depend on some module for their interpretation. (Those modules are usually ancestor nodes, but we do not require this.) There can be multiple modules associated with the same VoiceXML element. They may set properties differently, add different child elements, etc. In many cases, some of the modules will be extensions of the others, but we don't require this.

Note there is not necessarily a one-to-one relationship between semantic RCs and syntactic markup elements. It may take several RCs to implement the functionality of a single markup element.

6.1 Grammar Module

This module describes the syntactic and semantic features of a <grammar> element which defines grammars used in ASR and DTMF recognition. Grammars defined via this module are used by other modules.

The attributes and content model of <grammar> are specified in 6.1.1 Syntax . Its semantics are specified in 6.1.2 Semantics .

6.1.1 Syntax

[See XXX for schema definitions].

6.1.1.1 Attributes

The <grammar> element has the attributes specified in Table 10.

Table 10: <grammar> Attributes
Name Type Description Required Default Value
mode The only allowed values are "voice" and "dtmf" Defines the mode of the grammar following the modes of the W3C Speech Recognition Grammar Specification [SRGS] . No The value of the document property "grammarmode"
weight Weights are simple positive floating point values without exponentials. Legal formats are "n", "n.", ".n" and "n.n" where "n" is a sequence of one or many digits. Specifies the weight of the grammar. See vxml2: Section 3.1.1.3 No 1.0
fetchhint One of the values "safe" or "prefetch" Defines when the interpreter context should retrieve content from the server. prefetch indicates a file may be downloaded when the page is loaded, whereas safe indicates a file that should only be downloaded when actually needed. No None
fetchtimeout Time Designation The interval to wait for the content to be returned before throwing an error.badfetch event. No None
maxage An unsigned integer Indicates that the document is willing to use content whose age is no greater than the specified time in seconds (cf. 'max-age' in HTTP 1.1 [RFC2616]). The document is not willing to use stale content, unless maxstale is also provided. No None
maxstale An unsigned integer Indicates that the document is willing to use content that has exceeded its expiration time (cf. 'max-stale' in HTTP 1.1 [RFC2616]). If maxstale is assigned a value, then the document is willing to accept content that has exceeded its expiration time by no more than the specified number of seconds. No None
Editorial note  

The default value of the "grammarmode" document property (see XXXX) is "voice".

6.1.1.2 Content Model

The content model of <grammar> consists of exactly one of:

6.1.2 Semantics

The grammar RC is the primary RC for the <grammar> element.

6.1.2.1 Definition

The grammar RC is defined in terms of a data model and state model.

The data model is composed of the following parameters:

  • controller: the RC controlling this grammar RC
  • properties: weight attribute value, fetchtimeout, maxage, maxstale, charset, encoding, and language
  • mode: mode
  • fetchhint: fetchhint

The grammar RC's state model consists of the following states: Idle, Initializing, Ready, and Executing.

While in the Idle state, the RC may receive an 'initialize' event, whose 'controller' event data is used to update the data model. The RC then transitions into the Initializing state.

In the Initializing state, the grammar RC first initializes its child.

  • The values of the fetchtimeout attribute and the grammarfetchtimeout property (**REF**) are used to determine the fetchtimeout property value according to section XXXX.
  • The values of the maxage attribute and the grammarmaxage property (**REF**) are used to determine the maxage property value according to section XXXX.
  • The values of the maxstale attribute and the grammarmaxstale property (**REF**) are used to determine the maxstale property value according to section XXXX.
  • The values of the fetchhint attribute and the grammarfetchhint property (**REF**) are used to determine the fetchhint parameter value according to section XXXX.

Next, the language, charset, and encoding parameters are set to the values in effect at this point in the document. If the fetchhint parameter value is "Prefetch", the RC sends the Prefetch event to the DTMF or ASR Recognizer resource, as appropriate (see below), with the following data: the child RC, fetchtimeout, maxage, maxstale. Finally, the RC sends the controller an 'initialized' event and transitions to the Ready state.

In the Ready state, when the grammar RC receives an 'execute' event it transitions to the Executing state.

In the Executing state,

  • The values of the fetchtimeout attribute and the grammarfetchtimeout property (**REF**) are used to determine the fetchtimeout property value according to section XXXX.
  • The values of the maxage attribute and the grammarmaxage property (**REF**) are used to determine the maxage property value according to section XXXX.
  • The values of the maxstale attribute and the grammarmaxstale property (**REF**) are used to determine the maxstale property value according to section XXXX.

If the child RC is an External Grammar, the grammar RC sends an 'execute' event to the child RC and waits for it to complete.

Then, the grammar RC sends an AddGrammar event to the DTMF Recognizer Resource if mode="dtmf" or to the ASR Recognizer Resource if mode="voice", with the following as event data: the child RC, the fetchhint, language, charset, and encoding parameter values, and the controller RC (e.g., link, field, or form) as the handler for recognition results.

Finally, the grammar RC sends the controller an executed event and transitions to the Ready state.

Editorial note  

Initializing: Validate that behavior of sending a pointer to the child RC to the ASR resource. Is this acceptable, or do we need to extract the grammar data from the child RC and then send that data? The advantage of sending the RC pointer is that it makes clear what kind of grammar info it is -- inline SRGS or external reference.

Execute issues:

  • Note that a mismatch between the "mode" attribute value and any mode param returned in the media type cannot be detected at this stage because the document hasn't been fetched yet.
  • Still need to add a 'cond' capability as we have for prompts.
  • Should we allow explicit scope indication a la VoiceXML 2's "scope" attribute? How do we handle document-scoped grammars defined syntactically at a lower level? In this case should the handler be the controller RC or the controller for the document? Which RC actually executes the grammar RC?
  • How does 'as if by copy' change this for <link> grammars?

Editor will write new section 4.5 "Other" and subsections 4.5.1 "property/attribute resolution" and 4.5.2 "language resolution". Depending on the text, we may need to update the semantics to refer to section 4.5.2 when describing how xml:lang is used.

6.1.2.2 Defined Events

The Grammar RC is defined to receive the following events:

Table 11: Events received by Grammar RC
Event Source Payload Description
initialize any controller(M) causes the element and its children to be initialized
execute controller Adds the grammar to the appropriate Recognition Resource

and the events it sends:

Table 12: Events sent by Grammar RC
Event Target Payload Description
initialized controller response to initialize event indicating that it has been successfully initialized
executed controller response to execute event indicating that it has been successfully executed
6.1.2.3 External Events

The external events sent and received by the Grammar RC are those defined in this table:

Table 13: Grammar RC External Events
Event Source Target Description
addGrammar GrammarRC DTMF Recognition Resource or ASR Recognition Resource Adds grammar to list of currently active grammars
Prefetch GrammarRC DTMF Recognition Resource or ASR Recognition Resource Requests that the grammar be fetched/compiled in advance, if possible

 

6.1.2.4 State Chart Representation

6.1.3 Events

The events in this table may be raised during initialization and execution of the <grammar> element.

Table 14: <grammar> Events
Event Description State
error.semantic indicates an error with data model expressions: undefined reference, invalid expression resolution, etc. execution

Note that additional errors may occur when the grammar is fetched or added by the ASR or DTMF resource. Please check there for details.

6.1.4 Examples

[TBD: put all examples here.]

6.2 Inline SRGS Grammar Module

This module describes the syntactic and semantic features of inline SRGS grammars used in ASR and DTMF recognition.

Editorial note  

Issue: Do we need to support inline ABNF SRGS?:

The attributes and content model of Inline SRGS grammars are specified in 6.2.1 Syntax . Its semantics are specified in 6.2.2 Semantics .

6.2.1 Syntax

[See XXX for schema definitions].

The syntax of the Inline SRGS Grammar Module is precisely all of the XML markup for a legal stand-alone XML form grammar as described in SRGS ( [SRGS] ), minus the XML Prolog. Note that both elements and attributes must be in the SRGS namespace (http://www.w3.org/2001/06/grammar).

6.2.2 Semantics

6.2.2.1 Definition

The Inline SRGS grammar RC is defined in terms of a data model and state model.

The data model is composed of the following parameters:

  • controller: the grammar RC controlling this inline grammar RC
  • grammar: the text of the entire grammar
Editorial note  

Should the contents of the grammar parameter be parsed rather than the raw document text? For example, should it be the DOM representation of the grammar, or just the XML Info set, or what?

The grammar RC's state model consists of the following states: Idle, Initializing, and Ready. Unlike most of the other modules, this module is primarily a data model for storing a grammar. The module itself has no execution semantics.

While in the Idle state, the RC may receive an 'initialize' event, whose 'controller' event data is used to update the data model. The RC then transitions into the Initializing state.

In the Initializing state, the syntactic contents of the grammar are saved into the grammar parameter. The RC sends the controller an 'initialized' event and transitions to the Ready state.

 

6.2.2.2 Defined Events

The Inline SRGS Grammar RC is defined to receive the following events:

Table 15: Events received by Inline SRGS Grammar RC
Event Source Payload Description
initialize any controller(M) causes the element and its children to be initialized

and the events it sends:

Table 16: Events sent by Inline SRGS Grammar RC
Event Target Payload Description
initialized controller response to initialize event indicating that it has been successfully initialized
6.2.2.3 External Events

The Inline SRGS Grammar Module does not send or receive any external events.

6.2.2.4 State Chart Representation
6.2.2.5 SCXML Representation

6.2.3 Events

No module-specific events are raised during initialization of an Inline SRGS Grammar. Note that validity failure of the inline SRGS content would be detected at document parse time.

 

6.2.4 Examples

[TBD: put all examples here.]

6.3 External Grammar Module

This module describes the syntactic and semantic features of an <externalgrammar> element which defines external grammars used in ASR and DTMF recognition.

Editorial note  

The name of this element is still under discussion.

The attributes and content model of <externalgrammar> are specified in 6.3.1 Syntax . Its semantics are specified in 6.3.2 Semantics .

6.3.1 Syntax

[See XXX for schema definitions].

6.3.1.1 Attributes

The <externalgrammar> element has the attributes specified in Table 17.

Table 17: <externalgrammar> Attributes
Name Type Description Required Default Value
src anyURI The URI specifying the location of the grammar and optionally a rulename within that grammar, if it is external. The URI is interpreted as a rule reference as defined in Section 2.2 of the Speech Recognition Grammar Specification [SRGS] but not all forms of rule reference are permitted from within VoiceXML. The rule reference capabilities are described in detail below this table. No
srcexpr A data model expression Equivalent to src, except that the URI is dynamically determined by evaluating the content as a data model expression. No
type A data model expression

The preferred media type of the grammar. A resource indicated by the URI reference in the src attribute may be available in one or more media types. The author may specify the preferred media-type via the type attribute. When the content represented by a URI is available in many data formats, a VoiceXML platform may use the preferred media-type to influence which of the multiple formats is used. For instance, on a server implementing HTTP content negotiation, the processor may use the preferred media-type to order the preferences in the negotiation.

The resource representation delivered by dereferencing the URI reference may be considered in terms of two types. The declared media-type is the asserted value for the resource and the actual media-type is the true format of its content. The actual media-type should be the same as the declared media-type, but this is not always the case (e.g. a misconfigured HTTP server might return 'text/plain for an 'application/srgs+xml' document). A specific URI scheme may require that the resource owner always, sometimes, or never return a media-type. The declared media-type is the value returned by the resource owner or, if none is returned, the preferred media type. There may be no declared media-type if the resource owner does not return a value and no preferred type is specified. Whenever specified, the declared media-type is authoritative.

Three special cases may arise. The declared media-type may not be supported by the processor; in this case, an error.unsupported.format is thrown by the platform. The declared media-type may be supported but the actual media-type may not match; an error.badfetch is thrown by the platform. Finally, there may be no declared media-type; the behavior depends on the specific URI scheme and the capabilities of the grammar processor. For instance, HTTP 1.1 allows document intraspection introspection (see [RFC2616] , section 7.2.1), the data scheme falls back to a default media type, and local file access defines no guidelines. The following table provides some informative examples:

HTTP 1.1 request Local file access
Media-type returned by the resource owner text/plain application/srgs+xml <none> <none>
Preferred media-type appearing in the grammar Not applicable; the returned type takes precedence application/srgs+xml <none>
Declared media-type text/plain application/srgs+xml application/srgs+xml <none>
Behavior if the actual media-type is application/srgs+xml error.badfetch thrown; the declared and actual types do not match The declared and actual types match; success if application/srgs+xml is supported by the processor; otherwise an error.unsupported.format is thrown Scheme specific; the processor might intraspect introspect the document to determine the type.
No None
Editorial note  

Error messages for "type" attribute need to be updated.

See 6.3.1.2 Content Model for restrictions on occurrence of src and srcexpr attributes.

The value of the src attribute is a URI specifying the location of the grammar with an optional fragment for the rulename. Section 2.2 of the Speech Recognition Grammar Specification [SRGS] defines several forms of rule reference. The following are the forms that are permitted on a grammar element in VoiceXML:

  • Reference to a named rule in an external grammar: src attribute is an absolute or relative URI reference to a grammar which includes a fragment with a rulename. This form of rule reference to an external grammar follows the behavior defined in Section 2.2.2 of [SRGS] . If the URI cannot be fetched or if the rulename is not defined in the grammar or is not a public (activatable) rule of that grammar then an error.badfetch is thrown.
  • Reference to the root rule of an external grammar: src attribute is an absolute or relative URI reference to a grammar but does not include a fragment identifying a rulename. This form implicitly references the root rule of the grammar as defined in Section 2.2.2 of [SRGS] . If the URI cannot be fetched or if the grammar cannot be referenced by its root (see Section 4.7 of [SRGS] ) then an error.badfetch is thrown.

The following are the forms of rule reference defined by [SRGS] that are not supported in VoiceXML 3.

  • Local rule reference: a fragment-only URI is not permitted. (See definition in Section 2.2.1 of [SRGS] ). A fragment-only URI value for the src attribute causes an error.semantic event.
  • Reference to special rules: there is no support for special rule references (NULL, VOID, GARBAGE) on the <grammar> element itself. In the XML form of the SRGS specification, the only way to include NULL, VOID, and GARBAGE is via the use of the "special" attribute on the <ruleref> element. Thus, it is not possible to reference individual uses of NULL, VOID, and GARBAGE rules in a separate SRGS document, since that would require a fragment identifier to place on the end of the URI referencing the document, which in turn would require an id within that document for the given use of NULL, VOID, or GARBAGE. Note that the external grammar referenced from the <grammar> element may itself be an SRGS grammar that contains a <ruleref> element with a special attribute to reference NULL, VOID, or GARBAGE.
6.3.1.2 Content Model

The <externalgrammar> element is empty.

The <externalgrammar> element has the following co-occurrence constraints:

  • Exactly one of the "src" or "srcexpr" attributes must be specified; otherwise, an error.badfetch event is thrown.
Editorial note  

Editor: please remove the "otherwise, an error.badfetch ..." from the above and all other co-occurrence text and write general text somewhere describing what happens when a co-occurrence constraint is violated.

6.3.2 Semantics

6.3.2.1 Definition

The External Grammar RC is defined in terms of a data model and state model.

The data model is composed of the following parameters:

  • properties: src, type attribute values
  • controller: the controller RC of this <externalgrammar>
  • srcexpr: srcexpr attribute value

The External Grammar RC's state model consists of the following states: Idle, Initializing, Ready, and Executing.

While in the Idle state, the RC may receive an 'initialize' event, whose 'controller' event data is used to update the data model. The RC then transitions into the Initializing state.

In the Initializing state, the RC sends the controller an 'initialized' event and transitions to the Ready state.

In the Ready state, when the External Grammar RC receives an 'execute' event it transitions to the Executing state.

In the Executing state, if the srcexpr variable is set it is evaluated against the data model as a data model expression, and the value is placed into the src variable; if srcexpr cannot be evaluated, an error.semantic event is thrown. Otherwise, the RC sends an 'executed' event to the controller RC and transitions into the Ready state.

 

6.3.2.2 Defined Events

The External Grammar RC is defined to receive the following events:

Table 19: Events received by External Grammar RC
Event Source Payload Description
initialize any controller(M) causes the element and its children to be initialized
execute controller Evaluates srcexpr and populates src variable

and the events it sends:

Table 20: Events sent by External Grammar RC
Event Target Payload Description
initialized controller response to initialize event indicating that it has been successfully initialized
executed controller response to execute event indicating that it has been successfully executed
6.3.2.3 External Events

The External Grammar Module does not send or receive any external events.

6.3.2.4 State Chart Representation
6.3.2.5 SCXML Representation

6.3.3 Events

The events that may be raised during initialization and execution of the <externalgrammar> element are those defined in Table 21 below.

Table 21: <externalgrammar> Events
Event Description State
error.semantic indicates that there was an error in the evaluation of the srcexpr attribute.

6.3.4 Examples

[TBD: put all examples here.]

6.4 Prompt Module

This module defines the syntactic and semantic features of a <prompt> element which controls media output. The content model of this element is empty: content is defined in other modules which extend this element's content model (for example 6.5 Builtin SSML Module , 6.6 Media Module and 6.7 Parseq Module ).

The attributes and content model of <prompt> are specified in 6.4.1 Syntax . Its semantics are specified in 6.4.2 Semantics , including how the final prompt content is determined and how the prompt is queued for playback using the PromptQueue Resource ( 5.2 Prompt Queue Resource ).

6.4.1 Syntax

[See XXX for schema definitions].

6.4.1.1 Attributes

The <prompt> element has the attributes specified in Table 22.

Table 22: <prompt> Attributes
Name Type Description Required Default Value
bargein boolean Controls whether the prompt can be interrupted. No bargein property
bargeintype string On prompts that can be interrupted, determines the type of bargein, either 'speech', or 'hotword'. No bargeintype property
cond data model expression A data model expression that must evaluate to true after conversion to boolean in order for the prompt to be played. No true
count positive integer A number indicating the repetition count, allowing a prompt to be activated or not depending on the current repetition count. No 1
timeout Time Designation The time to wait for user input. No timeout property
xml:lang string The language identifier for the prompt. No document's "xml:lang" attribute
xml:base string Declares the base URI from which relative URIs in the prompt are resolved. No document's "xml:base" attribute
6.4.1.2 Content Model

The content model of the <prompt> element is empty.

Other modules can extend the content model. These modules must define how the content is evaluated and processed before being added to the prompt queue.

6.4.2 Semantics

The prompt RC is the primary RC for the <prompt> element.

6.4.2.1 Definition

The prompt RC is defined in terms of a data model and state model.

The data model is composed of the following parameters:

  • controller: the RC controlling this prompt RC
  • children: array of children's (primary) RC
  • count: count attribute value
  • cond: cond attribute expression
  • properties: bargein, bargeintype and timeout attribute values
  • xml:lang: xml:lang attribute value
  • xml:base: xml:base attribute value

The prompt RC's state model consists of the following states: Idle, Initializing, Ready, FormReady, and Executing. The initial state is the Idle state.

While in the Idle state, the prompt RC may receive an 'initialize' event, whose controller event data is used to update the data model. The prompt RC then transitions into Initiating Initializing state.

In the Initializing state, the prompt RC initializes its children: this is modeled as a separate RC (see XXX). The children may return an error for initialization. If a child sends an error, then the prompt RC returns an error. When all children are initialized, the prompt RC sends the controller an 'initialized' event and transitions to the Ready state.

In the Ready state, the prompt RC can receive a 'checkStatus' event to check whether this prompt is eligible for execution or not. The value of the cond parameter in its data model is checked against the data model resource: the status is true if the value of the cond parameter evaluates to true. The status, together with its count data, is sent in a 'checkedStatus' event to the controller RC. The controller RC then determines if the prompt is selected for execution ([vxml20: 4.1.6], see PromptSelectionRC, Section XXX). The prompt RC will then transition to the FormReady state. If the prompt RC receives an 'execute' event and the cond parameter evaluates to true, it transitions to the Executing state; if the cond parameter evaluates to false, it will send the controller the executed event and stay in the Ready state.

In the FormReady State, if the prompt RC receives a 'checkStatus' event, it will again check the cond parameter and send the 'checkedStatus' event to the controller RC as in the Ready State. In this state, if the RC receives an 'execute' event it transitions to the Executing state.

In the Executing state, the prompt RC sends an evaluate event to its children. Each child returns either an error, or content (which may include parameters) for playback. If a child sends an error, then the prompt RC returns an error. Once evaluation is complete, the RC sends a queuePrompt event to the Prompt Queue Resource with the <prompt> parameters (bargein, bargeintype, timeout) with event data consisting of the list of content returned by its children. The prompt RC then sends the controller an executed event and transitions to the Ready state.

Editorial note  

SSML validation issue: what if evaluation results in a non-valid structure?

6.4.2.2 Defined Events

The Prompt RC is defined to receive the following events:

Table 23: Events received by Prompt RC
Event Source Payload Description
initialize any controller(M) causes the element and its children to be initialized
checkStatus controller causes evaluation of the cond parameter against the data model
execute controller causes the evaluation of its content and conversion to a format suitable for queueing on the PromptQueue Resource

and the events it sends:

Table 24: Events sent by Prompt RC
Event Target Payload Description
initialized controller response to initialize event indicating that it has been successfully initialized
checkedStatus controller status (M), count (M) response to checkStatus event with count parameter and status indicating evaluation of cond parameter
executed controller response to execute event indicating that it has been successfully executed
6.4.2.3 External Events

Table 25 shows the events sent and received by the prompts RC to resources and other RCs which define the events.

Table 25: Prompt RC External Events
Event Source Target Description
evaluate PromptRC DataModel used to evaluate the cond parameter (see XXX)
queuePrompt PromptRC PromptQueue adds prompt content and properties to the Prompt Queue (see XXX)
6.4.2.4 State Chart Representation
Prompt RC in UML State Chart

Figure 9: 10: Prompt RC States

6.4.2.5 SCXML Representation
<?xml version="1.0" encoding="UTF-8"?>
<scxml initialstate="Created">
  <datamodel>
    <data id="properties"/>
    <data id="children"/>
    <data id="content"/>
    <data id="properties"/>
    <data id="count"/>
    <data id="cond"/>
    <data id="xml:lang"/>
    <data id="xml:base"/>
  </datamodel>
  <state id="Created">
    <initial id="Idle"/>
    <state id="Idle">
      <onentry>
        <assign location="$controller" expr="null"/>
        <assign location="$children" expr="null"/>
        <assign location="$content" expr="null"/>
        <assign location="$properties/bargein" expr="true"/>
        <assign location="$properties/bargeintype" expr="speech"/>
        <assign location="$properties/timeout" expr="5s"/>
        <assign location="$count" expr="1"/>
        <assign location="$cond" expr="true"/>
        <assign location="$xml:lang" expr=""/>
        <assign location="$xml:base" expr=""/>
      </onentry>
      <transition event="initialize" target="Initializing">
        <assign name="$controller" expr="_eventData/controller"/>
        <assign name="$children" expr="_eventData/children"/>
      </transition>
    </state>    <!-- end Idle -->
    <state id="Initializing">
      <datamodel>
        <data id="childcounter"/>
      </datamodel>
      <onentry>
        <assign location="$childcounter" expr="0"/>
        <foreach var="child" array="$children">
          <send target="$child/controller" event="initialize"
          namelist="$child/child"/>
        </foreach>
      </onentry>
      <transition event="Initializing.done">
        <assign location="$childcounter" expr="$childcounter + 1"/>
      </transition>
      <transition event="Initializing.error" target="Idle">
        <assign location="$childcounter" expr="$childcounter + 1"/>
        <send target="controller" event="initialize.error"
        namelist="_eventData/error_status"/>
      </transition>
      <transition event="Initializing.done" cond="$childcounter eq
      $children.size()-1" target="Ready">
        <send target="controller" event="initialized"/>
      </transition>
    </state>    <!-- end Initializing -->
    <state id="Ready">
      <datamodel>
        <data id="status"/>
      </datamodel>
      <transition event="checkStatus" target="FormReady">
        <assign location="$status" expr="checkStatus()"/>
        <send target="controller" event="checkStatus"
        namelist="$status, $count"/>
      </transition>
      <transition event="execute" cond="checkStatus() eq 'true'"
      target="Executing"/>
      <transition event="execute" cond="checkStatus() eq 'false'">
        <send target="controller" event="executed"/>
      </transition>
    </state>    <!-- end Ready -->
    <state id="FormReady">
      <datamodel>
        <data id="status"/>
      </datamodel>
      <transition event="checkStatus">
        <assign location="$status" expr="checkStatus()"/>
        <send target="controller" event="checkStatus"
        namelist="$status, $count"/>
      </transition>
      <transition event="execute" target="Executing"/>
    </state>    <!-- end FormReady -->
    <state id="Executing">
      <datamodel>
        <data id = "prompt"/>
      </datamodel>
      <onentry>
        <assign location="$counter" expr="0"/>
        <assign location="$child_return" expr="null"/>
        <foreach var="child" array="$children">
          <send target="$child/controller" event="evaluateChild"/>
        </foreach>
      </onentry>
      <transition event="Executing.done">
        <assign location="$counter" expr="$counter + 1"/>
        <insert pos="after" name="$prompt" expr="_eventData/prompts"/>
      </transition>
      <transition event="Executing.error" target="Idle">
        <send target="controller" event="Executing.error"
        namelist="_eventData/error_status"/>
      </transition>
      <transition event="Executing.done" cond="$counter eq
      $children.size()-1" target="Ready">
        <insert pos="after" name="$prompt" expr="_eventData/prompts"/>
        <send target="PromptQueue" event="/queuePrompt"
        namelist="$prompt, $properties"/>
        <send target="controller" event="executed"/>
      </transition>
    </state>    <!-- end Executing -->
  </state>
  <!-- end Created -->
</scxml>

6.4.3 Events

The events in Table 26 may be raised during initialization and execution of the <prompt> element.

Table 26: <prompt> Events
Event Description State
error.unsupported.language indicates that an unsupported language was encountered. The unsupported language is indicated in the event message variable. execution
error.unsupported. element indicates that the element within the <prompt> element is not supported initialization
error.badfetch indicates that the prompt content is malformed ... initialization, execution
error.noresource indicates that a Prompt Queue resource is not available for rendering the prompt content. execution
error.semantic indicates an error with data model expressions: undefined reference, invalid expression resolution, etc. execution
Editorial note  

The relationship between the user visible events defined in the above table, and semantic event model has yet to be defined.

Can we really determine whether errors are raised in initialization (syntax) or execution (evaluation) states? How does this fit in with errors returned when prompts are played in PromptQueue player implementation? ACTION: Clarify which specific cases are affected by 'error.badfetch' ambiguity re. initialization versus execution states.

Clarify that error.semantic doesn't apply to evaluation of src/expr with <audio> (e.g. fallback).

Clarify that errors are recorded? (vxml21??)

Should media control properties (e.g. clipBegin, speed, etc) of <media> be also available on <prompt>?

We should clarify where the error.badfetch gets thrown. For instance, if we are loading a document with malformed prompt elements, the error.badfetch may get thrown back to the calling document. If we are throwing error.badfetch during execution, then it will be thrown back to the malformed document itself?

6.4.4 Examples

[TBD: put all examples here.]

6.5 Builtin SSML Module

This module describes the syntactic and semantic features of SSML elements built into VoiceXML.

This module is designed to extend the content model of the <prompt> element defined in 6.4 Prompt Module .

The attributes and content model of SSML elements are specified in 6.5.1 Syntax . Its semantics are specified in 6.5.2 Semantics , including how elements are evaluated to yield final content for playback.

6.5.1 Syntax

[See XXX for schema definitions].

This module defines an SSML ( [SSML] ) Conforming Speech Synthesis Markup Language Fragment where:

  • there is no explicit <speak> element.
  • these elements are part of the VoiceXML namespace rather than the SSML namespace
  • the <foreach> element may appear inside these elements prior to evaluation
  • the <value> element may appear inside these elements prior to evaluation
  • The <audio> element is extended with the attributes specified in Table 27.
Table 27: <audio> Attributes defined in VoiceXML
Name Type Description Required Default Value
fetchtimeout See fetchtimeout definition No fetchtimeout property
fetchhint See fetchhint definition No audiofetchhint property
maxage See maxage definition No audiomaxage property
maxstale See maxstale definition No audiomaxstale property
expr A data model expression which determines the source of the audio to be played. The expression may be either a reference to audio previously recorded (see Record Module) or evaluate to the URI of an audio resource to fetch. No undefined

Exactly one of "src" or "expr" attributes must be specified; otherwise, an error.badfetch event is thrown.

Editorial note  

SSML 1.1 required for fetching attributes like fetchtimeout? Or profile dependent?

Support for 'say-as' extension to SSML 1.0?

Support for <enumerate>?

Note that profiles specify which media formats are required

6.5.2 Semantics

When the RC receives an evaluate event, its children are evaluated in order to return an SSML Conforming Stand-Alone Speech Synthesis Markup Language Document which can be processed by a Conforming Speech Synthesis Markup Language Processor .

Evaluation comprises of:

  • data model expressions are evaluated against the data model.
  • If a <foreach> element is present, it is evaluated so as to yield content for each defined item in the array.
  • If an <audio> element has a specified expr attribute, then the attribute value is evaluated to provide a URI value for the src attribute. If the expr evaluation results in an ECMAScript undefined value, then the <audio> element, including its alternate content, is ignored.
  • If a <value> element is present, its expr attribute is evaluated to return a CDATA value.
  • construction of a <speak> element with appropriate version, namespace, and xml:lang attributes. The xml:lang attribute value is inherited from the <prompt> element (see 6.4 Prompt Module ).
  • if an unsupported language is encountered, the platform throws an error.unsupported.language event which specifies the language in its message variable
Editorial note  

We may want to refine the description that the output of evaluation is an SSML Document. One rationale is that we don't want to prohibit that SSML extensions are lost during evaluation. The output may be another Fragment rather than a Document.

Clarify exact nature of <audio> expr value for skipping - undefined vs. null?

Need to specify further error cases

Do these elements have RCs? They are in the VoiceXML namespace but are just enhanced SSML elements.

Need to clarify unsupported languages and external (e.g. MRCP) SSML processors.

6.5.3 Examples

In this example

<prompt>
 <foreach item="item" array="array">
    <audio expr="item.audio"><value expr="item.tts"/></audio>
    <break time="300ms"/>
 </foreach>
</prompt>

evaluation returns a sequence of content for each item in <foreach> with <audio> and <value> elements.

Assume that the array consists of 2 items where each item.audio evaluates to 'one.wav' and 'two.wav' respectively, and each item.tts evaluates to 'one' and 'two' respectively. Evaluation of <foreach> is equivalent to the following

<prompt>
    <audio expr="'one.wav'"><value expr="'one'"/></audio>
    <break time="300ms"/>
    <audio expr="'two.wav'"><value expr="'two'"/></audio>
    <break time="300ms"/>
</prompt>

further evaluation of the <audio> and <value> elements result in

<prompt>
    <audio src="one.wav">one</audio>
    <break time="300ms"/>
    <audio src="two.wav">two</audio>
    <break time="300ms"/>
</prompt>

and finally the prompt content is converted into a stand-alone SSML document (assuming the <prompt>'s xml:lang attribute evaluates to 'en'):

<speak version="1.0" xml:lang="en" 
 xmlns="http://www.w3.org/2001/10/synthesis">
    <audio src="one.wav">one</audio>
    <break time="300ms"/>
    <audio src="two.wav">two</audio>
    <break time="300ms"/>
</speak>

This content is queued and played using the PromptQueue: each audio URI, or fallback content, is played, followed by a 300 millisecond break.

6.6 Media Module

The media module defines the syntax and semantics of <media> element.

The module is designed to extend the content model of <prompt> in the prompt module ( 6.4 Prompt Module ).

The <media> element can be seen as an enhanced and generalized version of the VoiceXML <audio> element. It is enhanced in that it provides additional attributes describing the type of media, conditional selection, as well as control over playback . It is a generalization of the <audio> element in that it permits media other than audio to be played; for example, media formats which contains audio and video tracks.

6.6.1 Syntax

[See XXX for schema definitions].

6.6.1.1 Attributes

The <media> element has the attributes specified in Table 28.

Table 28: <media> Attributes
Name Type Description Required Default Value
src The URI specifying the location of the media source. No None
srcexpr A data model expression which evaluates to a URI indicating the location of the media resource. No undefined
cond A data model expression that must evaluate to true after conversion to boolean in order for the media to be played. No true
type

The preferred media type of the output resource. A resource indicated by the URI reference in the src or expr attributes may be available in one or more media types. The author may specify the preferred media type via the type attribute. When the content represented by a URI is available in many data formats, a VoiceXML platform may use the preferred media-type to influence which of the multiple formats is used. For instance, on a server implementing HTTP content negotiation, the processor may use the preferred media-type to order the preferences in the negotiation.

The resource representation delivered by dereferencing the URI reference may be considered in terms of two types. The declared media-type is the asserted value for the resource and the actual media-type is the true format of its content. The actual media-type should be the same as the declared media-type, but this is not always the case (e.g. a misconfigured HTTP server might return 'text/plain' for a 'audio/x-wav' or video/3gpp' resource). A specific URI scheme may require that the resource owner always, sometimes, or never return a media-type. The declared media-type is the value returned by the resource owner or, if none is returned, the preferred media type. There may be no declared media-type if the resource owner does not return a value and no preferred type is specified. Whenever specified, the declared media-type is authoritative.

Three special cases may arise.

  1. The declared media-type may not be supported by the processor. No error is thrown by the platform and the content of the media element is played instead.
  2. The declared media-type may be supported but the actual media-type may not match. No error is thrown by the platform and the content of the media element is played instead.
  3. Finally, there may be no declared media-type; the behavior depends on the specific URI scheme and the media capabilities of the VoiceXML processor. For instance, HTTP 1.1 allows document introspection (see [RFC2616] , section 7.2.1), the data scheme falls back to a default media type, and local file access defines no guidelines.
No undefined
clipBegin Time Designation offset from start of media to begin rendering. This offset is measured in normal media playback time from the beginning of the media. No 0s
clipEnd Time Designation offset from start of media to end rendering. This offset is measured in normal media playback time from the beginning of the media. No None
repeatDur Time Designation total duration for repeatedly rendering media. This duration is measured in normal media playback time from the beginning of the media. No None
repeatCount positive Real number number of iterations of media to render. A fractional value describes a portion of the rendered media. No 1
soundLevel signed ("+" or "-") CSS2 numbers immediately followed by "dB" Decibel values are interpreted as a ratio of the squares of the new signal amplitude (a1) and the current amplitude (a0) and are defined in terms of dB: soundLevel(dB) = 20 log10 (a1 / a0) A setting of a large negative value effectively plays the media silently. A value of '-6.0dB' will play the media at approximately half the amplitude of its current signal amplitude. Similarly, a value of '+6.0dB' will play the media at approximately twice the amplitude of its current signal amplitude (subject to hardware limitations). The absolute sound level of media perceived is further subject to system volume settings, which cannot be controlled with this attribute. No +0.0dB
speed x% (where x is a positive real value) the speed at which to play the referenced media, relative to the original speed. The speed is set to the requested percentage of the speed of the original media. For audio, a change in the speed will change the rate at which recorded samples are played back and this will affect the pitch. No 100%
outputmodes space separated list of media types Determines the modes used for media output. See 8.2.4 Media Properties for further details. No outputmodes property

See occurrence constraints for restrictions on occurrence of src and srcexpr attributes.

Calculations of rendered durations and interaction with other timing properties follow SMIL 2.1 Computing the active duration where

  • <media> is a time container
  • Time Designation values for clipBegin, clipEnd, and repeatDur are a subset of SMIL Clock-value
  • If the length of an media clip is not known in advance then it is treated as indefinite. Consequently repeatCount will have no effect.
  • If clipEnd is after the end of the media, then rendering ends at the media end.
  • If clipBegin is after clipEnd, no audio will be produced.
  • If clipBegin equals clipEnd, the behavior depends upon the kind of media. For media with a concept of frames, such as video, the single frame at the beginning of clipBegin is rendered for repeatDur.
  • repeatDur takes precedence over repeatCount in determining the maximum duration of the media.

Note that not all SMIL 2.1 Timing features are supported.

Editorial note  

Use SMIL 3.0 or SMIL 2.1 reference?

should trimming and media attributes also be defined in <prompt>?

do we need expr values for type, clipBegin, clipEnd, repeatDur, repeatCount, etc? (Perhaps add implied expr for every attribute?)

when is a property evaluation error thrown?

Add fetchtimeout, fetchhint, maxage and maxstale attributes

Major attribute candidate: errormode (flexible error handling which controls whether errors are thrown or fallback is used).

Other candidate attributes: id/idref (use case?)

6.6.1.2 Content Model

The <media> element content model consists of:

  • Inline content: SSML <speak> (0 or 1). Note that this content may include <value> and <foreach> elements from the VoiceXML namespace.
  • <desc> element (0 or more)
  • <media>: for fallback in the case where the resource referenced by the mother <media> element is unavailable (0 or more)
  • <property>: so media related properties can be set (0 or more)

The <media> has the following co-occurrence constraints:

  • One of the src attribute or the srcexpr attribute or inline content must be specified; otherwise, an error.badfetch event is thrown.

Note that the type attribute does not affect inline content. The handling of inline XML content is in accordance to the namespace of the root element (such as SSML <speak>, SMIL <smil>, and so forth). CDATA, or mixed content with VoiceXML <foreach> or <value> elements must be treated as an SSML Fragment and evaluated as described in 6.6.2 Semantics .

Editorial note  

Permit other types of inline content apart from SSML?

Are child <property> elements necessary? Alternative: extended <prompt> so that <property> children are allowed?

6.6.1.2.1 Tips (informative)

Developers should be aware that there may be performance implications when using <media> depending on which attributes are specified, the media itself, its transport and processing.

Since operations like trimming, soundLevel and speed modifications are applied to media, this requires that the SSML processor begins generating output audio before these operations are applied. If the clipBegin attribute is specified, this may required SSML generation of audio prior to clipBegin, depending on the implementation. This may lead to a gap between execution of the <media> element and start of playback.

If the media is fetched with HTTP protocol and the clipBegin attribute is specified, then, unless the the resource is cached locally, the part of the media resource before the clipBegin, will still be fetched from the origin server. This may result in a gap between the execution of the <media> element and playback actually beginning.

Note also if <media> uses the RTSP protocol, and the VoiceXML platform supports this protocol, then the clipBegin attribute value may be mapped to the RTSP Range header field, thereby reducing the gap between element execution and the onset of playback.

6.6.2 Semantics

When an media RC receives an evaluate event, the following operations are performed:

  • attributes with data model expressions are evaluated against the data model
  • if inline SSML content is specified (i.e. <speak> root), then its RC is sent an evaluate event (where any <foreach> and <value> elements are evaluated).
  • if inline CDATA content (i.e. CDATA root), or <foreach> and <value> root elements which evaluate to CDATA content, then it is treated as SSML <speak> content.
  • if the media resource has not yet been fetched, the resource is fetched. If SSML content is specified, then the evaluated content is processed by the SSML processor and a media resource returned.

The resulting media resource is returned together with resolved media operation properties (clipBegin, clipEnd, soundLevel, speed, outputmodes).

Editorial note  

Semantics needs to address a mixed content model; e.g. CDATA and XML elements as children of the root.

Do we require 'application/ssml+xml' type with SSML and CDATA content?

Need to clarify where resource fetching takes place in the semantic model. Eg. in prompt initializing or executing state? or in prompt queue?

This approach assumes the prompt queue applies media processing operations. Intended to fit with the VCR/RTC approach.

What about streaming cases? Allow streams to be returned?

Specify how errors are addressed.

6.6.3 Examples

Playback of external audio media resource.

<media type="audio/x-wav" src="http://www.example.com/resource.wav"/>

Application of media operations to audio resource. The soundLevel increases the volume by approximately 50% and the speed is reduced to 50%.

<media type="audio/x-wav" soundLevel="+6.0dB" speed="50%" 
       src="http://www.example.com/resource.wav"/>

Playback of 3GPP media resource.

<media type="video/3gpp" src="http://www.example.com/resource.3gp"/>

Playback of 3GPP media resource with the speed doubled and playback ending after 5 seconds.

<media type="video/3gpp" clipEnd="5s" speed="200%" 
       src="http://www.example.com/resource.3gp"/>

Playback of external SSML document.

<media type="application/ssml+xml" 
       src="http://www.example.com/resource.ssml"/>

Inline CDATA content with a <value> element

<media>
    Ich bin ein Berliner, said <value expr="speaker"/>
</media>

which is syntactically equivalent to

<media>
   <speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis">
    Ich bin ein Berliner, said <value expr="speaker"/>
    </speak>
</media>

Inline SSML content to which gain and clipping operations are applied.

<media soundLevel="+4.0dB" clipBegin="4s">
   <speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis">
    Ich bin ein Berliner.
    </speak>
</media>

Inline SSML with audio media fallback.

<media volume="+4.0dB" clipBegin="4s">
   <speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis">
    Ich bin ein Berliner.
    </speak>
    <media type="audio/x-wav" src="ichbineinberliner.wav">
</media>

6.7 Parseq Module

This module defines the syntax and semantics of <par> and <seq> elements. The <par> element specifies playback of media in parallel, while <seq> specifies playback in sequence.

The module is designed to extend the content model of the <prompt> element ( 6.4 Prompt Module ).

This module is dependent upon the media module ( 6.6 Media Module ).

With connections which support multiple media streams, it is possible to simultaneously playback multiple media types. For media container formats like 3GPP, audio and video media can be generated simultaneously from the same media resource.

There are established use cases for simultaneous playback of multiple media which are specified in separate resources:

  • Video mail: an audio message has been left using a conventional audio only system. For playback on a system with video support, a video resource can be played simultaneously with an image of the person, or an avatar.
  • Enterprise: a video stream resource from a security camera with TTS voiceover providing additional information.
  • Education: a video resource showing medical procedure with commentary provided by lecturer in student's language.
  • Talking heads: an animated avatar together with audio or TTS voiceover.

The intention is provide support for basic use cases where audio or TTS output from one resource can be complemented with output from another resource as permitted by the connection and platform capabilities.

6.7.1 Syntax

The <par> element is derived from SMIL <par> element, a time container for parallel output of media resources. Media elements (or containers) within a <par> element are played back in parallel.

Editorial note  
SMIL reference should be added in B References . SMIL is Synchronized Multimedia Integration Language (SMIL). Reference to SMIL 1.0 (or later) Specification.

The <par> element has the attributes specified in Table 29.

Table 29: <par> Attributes
Name Type Description Required Default Value
endsync Indicates when element is considered complete. 'first' indicates that the element is complete when any media (or container) child reports that it is complete; 'last' indicates it is complete when all media children are complete. No last

The content model of <par> consists of:

  • <media> elements (0 or more)
  • <seq> elements (0 or more)

The <par> element is derived from SMIL <seq> element, a time container for sequential output of media resources. Media elements within a <seq> element are played back in parallel.

No attributes are defined for <seq>.

The content model of <seq> consists of:

  • <media> elements (0 or more)

6.7.2 Semantics

Editorial note  

Issue: how should parallel playback interact with the PromptQueue resource? The simplest assumption would be that if this module is supported, then prompt queue needs to be able to handle parallel playback.

For example when bargein event happens during the parallel execution, the synchronization between both prompt and for example video play should be handled. This information should be explained in the prompt queue resource section.

This module requires a PromptQueue resource which support playback of parallel and sequential media. The following defines its playback completion, termination and error handling.

Completion of playback of the <par> element is determined according to the value of its endsync attribute. For instance, assume a <par> element containing <media> (or <seq>) elements A and B, and that B finishes before A. If endsync has the value first, then completion is reported upon B's completion. If endsync has the value last, then completion is reported upon A's completion.

Completion of playback of the <seq> element occurs when the last <media> is complete.

If the <par> element playback is terminated, then playback of its <media> and <seq> children is terminated. Likewise, if the <seq> element playback is terminated, then playback of its (active) <media> elements is terminated.

If mark information is provided by <media> elements (for example with SSML), then, the mark information associated with last element played in sequence or parallel is exposed as described in XXX.

Editorial note  

Open issue: Clarify interaction with VCR media control model(s).

<reposition> approach would require that <par> and <seq> need to be able to restart from a specific position indicated by the markname/time of a <media> element contained within them.

RTC approach would require that for <par>, media operations are applied in parallel.

Error handling policy is inherited from the element in which <par> and <seq> element are children.

For instance if the policy is to ignore errors, then the following applies:

  • If an error occurs when playing a <media> element in <par>, then the error is ignored.
  • Likewise, if there is an error playing back a <media> element in <seq>, the error is ignored and the next <media> element in the sequence, if there is one, is played.
  • If the <media> element in which the error occurs is the final one in the <par> element, then completion of <par> playback is signaled when the error is detected.

If the policy is to terminate playback and report the error, then the any error causes immediate termination of any playback and the error is reported.

If execution of the <par> and <seq> elements requires media capabilities which are not supported by the platform or the connection, or there is an error fetching or playing any <media> element within <par> or <seq>, then error handling follows the defined policy.

6.7.3 Examples

video avatar with audio commentary. Note the use of the outputmodes attributes of <media> to ensure that only video is played.

 <par>
   <media type="audio/x-wav" src="commentary.wav"/>
   <media type="video/3gpp" src="avatar.3gp" outputmodes="video"/>
 </par>

video avatar with a sequence of audio and TTS commentary.

 <par>
   <seq>
     <media type="audio/x-wav" src="intro.wav"/>
     <media type="application/ssml+xml" src="commentary.ssml"/>
   </seq>
   <media type="video/3gpp" src="avatar.3gp" outputmodes="video"/>
 </par>

6.8 Foreach Module

This module describes the syntactic and semantic features of the <foreach> element.

This module is designed to extend the content model of an element in another module. For example, SSML elements in the 6.5 Builtin SSML Module , the <prompt> element defined in 6.4 Prompt Module , etc.

The attributes and content model of the element are specified in 6.8.1 Syntax . Its semantics are specified in 6.8.2 Semantics .

6.8.1 Syntax

[See XXX for schema definitions].

6.8.1.1 Attributes

The <foreach> element has the attributes specified in Table 30.

Table 30: <foreach> Attributes
Name Type Description Required Default Value
array A data model expression that must evaluate to an array; otherwise, an error.semantic event is thrown. Note that the <foreach> element operates on a shallow copy of the array specified by the array attribute. Yes
item A data model variable that stores each array item upon each iteration of the loop. A new variable will be declared if it is not already defined within the parent's scope. Yes

Both "array" and "item" must be specified; otherwise, an error.badfetch event is thrown.

The iteration process starts from an index of 0 and increments by one to an index of array_name.length - 1, where array_name is the name of the shallow copied array operated on by the <foreach> element. For each index, a shallow copy or reference to the corresponding array element is assigned to the item variable (i.e. <foreach> assignment is equivalent to item = array_name[index] in ECMAScript); the assigned value could be undefined for a sparse array. Undefined array items are ignored.

VoiceXML 3.0 does not provide break functionality to interrupt a <foreach>.

Editorial note  

Clarify that array items which evaluate to ECMAScript undefined are ignored?

6.8.1.2 Content Model

The content model of the <foreach> element is dependent upon the element in which it is a child. The profile in which this element is used must specify the content model(s) of this element.

6.8.2 Semantics

When the RC receives an evaluate event, the RC loops through the array to produce an evaluated content for each item in the array.

6.8.3 Examples

Editorial note  

These examples may be moved to the respective profile section later.

The vxml21 profile defines the content model for the <foreach> element so that it may appear in executable content and within <prompt> elements.

Within executable content, except within a <prompt>, the <foreach> element may contain any elements of executable content; this introduces basic looping functionality by which executable content may be repeated for each element of an array.

When <foreach> appears within a <prompt> element as part Builtin SSML content, it may contain only those elements valid within <enumerate> (i.e. the same elements allowed within <prompt> less <meta>, <metadata>, and <lexicon>); this allows for sophisticated concatenation of prompts.

In this example using Builtin SSML, each item in the array has an audio property with a URI value, and a tts property with SSML content. The element loops through the array, playing the audio URI or the SSML content as fallback, with a 300 millisecond break between each iteration.

<prompt>
 <foreach item="item" array="array">
    <audio expr="item.audio"><value expr="item.tts"/></audio>
    <break time="300ms"/>
 </foreach>
</prompt>

In the mediaserver profile, <foreach> may occurs within <prompt> elements and has the content model of 0 or more <media> elements.

Play each media resource in the array.

  <foreach item="item" array="array">
   <media type="audio/x-wav" src="item.audio"/>
  </foreach>

Play each media resource in the array.

<foreach item="item" array="array">
   <media type="audio/x-wav" src="item.wav">
   <media type="application/ssml+xml">
    <speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis">     
     <value expr="item.tts"/>
     <break time=300ms"/>
     </speak>
   </media>
 </media> 
</foreach>

6.9 Form Module

6.9.1 Syntax

Forms are the key component of VoiceXML documents. A form contains:

  • A set of form items, elements that are visited in the main loop of the form interpretation algorithm. Form items are subdivided into input items that can be 'filled' by user input and control items that cannot.
  • Declarations of non-form item variables.
  • Event handlers.
  • "Filled" actions, blocks of procedural logic that execute when certain combinations of input item variables are assigned.
Table 31: Table: <form> Attributes
id The name of the form. If specified, the form can be referenced within the document or from another document. For instance <form id="weather">, <goto next="#weather">.
scope The default scope of the form's grammars. If it is dialog then the form grammars are active only in the form. If the scope is document, then the form grammars are active during any dialog in the same document. If the scope is document and the document is an application root document, then the form grammars are active during any dialog in any document of this application. Note that the scope of individual form grammars takes precedence over the default scope; for example, in non-root documents a form with the default scope "dialog", and a form grammar with the scope "document", then that grammar is active

6.9.2 Semantics

6.9.2.1 Form RC

The Form RC is the primary RC for the <form> element.

The Form RC interacts with resource controllers of other modules so as to provide the behavior of VoiceXML 2.1/2.0 <form> tag. Input and control form items are modeled as resource controllers: for the example, the <field> RC ( 6.10.2.1 Field RC ) of the Field Module.

The behavior of the Form RC follows the VoiceXML FIA, although some aspects of this are not modeled directly in this RC: external transition handling is not part of the form RC; input items used separate RCs to manage coordination between media resources, while recognition results can be received directly by form, field or other RCs.

[This initial version does not address all aspects of FIA behavior; for example, event handling, error handling and external transitions are not covered.]

6.9.2.1.1 Definition

The form RC is defined in terms of a data model and state model.

The data model is composed of the following parameters:

  • controller: the RC controlling this form RC
  • children: array of children's (primary) RC
  • activeItem: current form item being executed
  • active: Boolean indicating whether this form is active. Default: false.
  • justFilled: array of child RC which have just been filled
  • recoResult: The recognition result of the previously executed form item.
  • previousItem: The previous form item already executed.
  • nextItem: The next form item which is presently scheduled for executed.
  • modal: The modality of the current form item being executed.
  • id: The form identifier.

The form RC's state model consists of the following states: Idle, Initializing, Ready, SelectingItem, PreparingItem, PreparingFormGrammars, PreparingOtherGrammars, Executing, Active, ProcessingFormResult, Evaluating and Exit.

In the Idle state, the form RC can receive an 'initialize' event whose 'controller' event data is used to update the data model. The RC then transitions into Initiating state.

In the Initializing state, the RC creates a dialog scope in the Datamodel Resource and then initializes its children: this is modeled as a separate RC. When all children are initialized, the RC sends an 'initialized' event to its controller and transitions to the Ready state.

In the Ready state, the form RC sets its active status to false. It can receive one of two events: 'prepareGrammars' or ‘execute’. ‘prepareGrammars’ indicates that another form is active, but this form's form-level grammars may be activated; an 'execute' event indicates that this form is active. If the RC receives a 'prepareGrammars' event, it transitions to the PreparingFormGrammars state. If the RC receives an 'execute' event, it sets its active data to true and transitions to the 'SelectingItem' state.

In the SelectingItem state, the RC determines which form item to select as the active item. This is defined by a FormItemSelection RC which iterates over the children sending each a 'checkStatus' event. If a child returns a true status (indicating that it ready for execution)), the activeItem is set to this child RC and the RC transitions to the PreparingItem state. If no child returns this status, then the RC is complete and transitions the Exit State.

In the PreparingItem state, the activeItem is sent a 'prepare' event causing it to prepare itself; for example, the field RC prepares its prompts and grammars for execution. When the activeItem returns a 'prepared' event, the event data indicates whether the item is modal or not. If the item is modal, then the form RC transitions to the Executing state. If the item is not modal (other grammars can be activated), then the form RC transitions to the PreparingFormGrammars state.

In the PreparingFormGrammars state, the RC prepares form-level grammars. This is defined by a separate RC which iterates through and executes grammar children. When this is complete, the RC transitions to the Active state if the form is not active (active data), and transitions to the PreparingOtherGrammars if the form is active.

In the PreparingOtherGrammars states, the RC sends a 'prepareGrammars' event to its controller RC (which in turn sends the event to appropriate form, document and application level RCs with grammars). When its receives a 'prepared' from its controller, the RC transitions to the Executing state.

In the Executing state, the form RC sends an 'execute' event to the active form item. If the form item is a field, then this will causes prompts to be played and recognition to take place. The RC then transitions to the Active state awaiting a result.

In the Active state, the RC re-initializes the justFilled data to a new array and waits for a recognition results (as active or non-active form), or for a signal from its selected form item that it has received the recognition result. Recognition results are divided into two types: form item level results, received and processed by the form item; and form level results which are received by the form RC which caused the grammar to be added. If a 'recoResult' event is received by the form RC, the RC transitions into the ProcessingFormResult state. If the active form item receives the recognition result (and locally updated itself), then the form RC receives a 'formItemResult' event, adds the active item to the justFilled array, and transitions into the Evaluating state.

In the ProcessingFormResult state, the recognition result is processed by iterating through the form item children, obtaining their name and slotname, and then attempting to match the slotname to the results. If the match is successful, the name variable in the data model result is updated with the value from the recognition result and the child is added to the justFilled data array. When this process is complete, the form RC transitions to the Evaluating state.

In the Evaluating state, the form RC then iterates through its children and if a child is a member of the 'JustFilled' array, it sends a 'evaluate' event to the form item RC causing the appropriate filled> filled RCs to be executed. If the child is a filled> filled RC, then it is executed if appropriate. When evaluation is complete, the form RC transitions to the 'selectformitem' state so that the next form item can be selected for execution.

6.9.2.1.2 Defined Events
Table 32: Table: Events received by <form> RC
Event Source Payload Description
initialize any controller(M) Update the data model
prepareGrammars controller Another form is active, but the current form's form-level grammars may be activated.
execute controller Current form is active
Table 33: Table: Events received by <form> RC
Event Source Payload Description
initialize controller Notification that initialization is complete
prepareGrammars controller Sent to prepare grammars to appropriate form, document and application level RCs.
execute controller Notification of complete recognition result from the field RC.
6.9.2.1.3 External Events

The following table shows the events sent and received by the form RC to resources and other RCs which define the events.

Table 34: Table: <form> RC External Events
Event Source Target Description
checkStatus FormRC FormItem RC Check if ready for execution
createScope FormRC DataModel Creates a scope.
destroyScope FormRC DataModel Delete a scope.
evaluate FormRC FormItem RC Process form item being filled.
execute FormRC FormItem RCs Start execution.
prepare FormRC FormItem RC Initiates preparation needed before execution.
formItemResult FormItemRC FormRC Results received by the form item.
prepared FormItemRC FormRC Indicates that preparation is complete.
recoResult PlayAndRecognize RC FormRC Results filled at the form level and not form item level.
6.9.2.1.4 State Chart Representation
Form RC in UML State Chart

Figure 10: 11: Form RC States

Editorial note  

Note that the chart for SelectingItem:FormItemSelection is missing. It will be defined later.

6.9.2.1.5 SCXML Representation
<?xml version="1.0" encoding="UTF-8"?>
<scxml initialstate="Created">
  <datamodel>
    <data id="controller"/>
    <data id="children"/>
    <data id="activeItem"/>
    <data id="active"/>
    <data id="previousItem"/>
    <data id="nextItem"/>
<data id="recoResult"/>
    <data id="name"/>
    <data id="JustFilled"/>
  </datamodel>
  <state id="Created">
    <initial id="Idle"/>
    <state id="Idle">
      <onentry>
        <assign loc="$controller" val="null"/>
        <assign loc="$children" val="null"/>
        <assign loc="$activeItem" val="null"/>
        <assign loc="$active" val="false"/>
        <assign loc="$previousItem" val="null"/>
        <assign loc="$nextItem" val="null"/>
        <assign loc="$recoResult" val="null"/>
        <assign loc="$name" val="null"/>
      </onentry>
      <transition event="initialize" target="Initializing">
        <assign name="$controller" expr="_eventData/controller"/>
      </transition>
    </state>    <!-- end Idle -->
    <state id="Initializing">
      <datamodel>
        <data id="childcounter"/>
      </datamodel>
      <onentry>
        <assign loc="$childcounter" val="0"/>
        <send target="datamodel" event="createScope" namelist="dialog"/>
        <foreach var="child" array="$children">
          <send target="$child/controller" event="initialize" namelist="$child/child"/>
        </foreach>
      </onentry>
      <transition event="Initializing.done">
        <assign loc="$childcounter" expr="$childcounter + 1"/>
      </transition>
      <transition event="Initializing.error">
        <assign loc="$childcounter" expr="$childcounter + 1"/>
        <send target="controller" event="initialize.error" namelist="_eventData/error_status"/>
      </transition>
      <transition event="Initializing.done" cond="$childcounter eq $children.size()-1" target="Ready">
        <send target="controller" event="initialized"/>
      </transition>
    </state>    <!-- end Initializing -->
    <state id="Ready">
      <onentry>
        <assign loc="$active" val="false"/>
      </onentry>
      <transition event="execute" target="SelectingItem:FormItemSelection">
        <assign loc="$active" value="true"/>
      </transition>
      <transition event="prepareGrammars" target="PreparingFormGrammars"/>
    </state>    <!-- end Ready -->
    <state id="SelectingItem:FormItemSelection">
      <onentry>
        <send target="FormItemSelection" event="checkStatus" namelist="$children"/>
      </onentry>
      <transition event="SelectedFormItem.done" cond="activeItem eq 'null'" target="Exit"/>
      <transition event="SelectedFormItem.done" cond="activeItem neq 'null'" target="PreparingItem"/>
    </state>    <!-- end SelectingItem:FormItemSelection -->
    <state id="PreparingItem">
      <onentry>
        <send target="activeitem" event="prepare" />
      </onentry>
      <transition event="prepared" cond="_eventData/modal eq 'true'" target="Executing"/>
      <transition event="prepared" cond="_eventData/modal eq 'false'" target="PreparingFormGrammars"/>
    </state>    <!-- end PreparingItem-->
    <state id="Exit">
      <onentry>
        <send target="datamodel" event="destroyScope" namelist="dialog"/>
        <send target="parent" event="done"/>
      </onentry>
    </state>    <!-- end Exit-->
    <state id="PreparingFormGrammars">
      <transition event="PrepareFormGrammars.done" cond="active eq 'true'" target="PreparingOtherGrammars"/>
      <transition event="PrepareFormGrammars.done" cond="active eq 'false'" target="PreparingOtherGrammars">
        <send target="controller" event="prepared"/>
      </transition>
    </state>    <!-- end PreparingFormGrammars -->
    <state id="PreparingOtherGrammars">
      <onentry>
        <send target="controller" event="prepareGrammars"/>
      </onentry>
      <transition event="PrepareOtherGrammars.done" target="Executing"/>
    </state>    <!-- end PreparingOtherGrammars -->
    <state id="Executing">
      <onentry>
        <send target="activeItem" event="execute"/>
      </onentry>
      <transition event="Executing.done" target="Active"/>
    </state>    <!-- end Executing -->

    <state id="Active">
      <onentry>
        <assign loc="$JustFilled" expr="new Array()"/>
      </onentry>
      <transition event="fieldResult" target="Evaluating">
        <insert pos="after" name="$JustFilled" val="currentitem"/>
      </transition>
      <transition event="PlayAndRecognize:RecogResult" target="ProcessingFormResult">
        <insert pos="after" name="$JustFilled" val="currentitem"/>
      </transition>
    </state>    <!-- end Active -->
    <state id="ProcessingFormResult">
      <onentry>
          <foreach var="child" array="$children">
           <if cond="$child.slotname eq _eventData/RecogResult/slotname">
                       <assign loc="$name" value="_eventData/RecogResult/name"/>
                       <insert pos="after" name="$JustFilled" val="$child"/>
                       <transition target="Evaluating"/>
           </if>
        </foreach>
      </onentry>
      <transition event="ProcessingFormResult.done" target="Evaluating"/>
    </state>    <!-- end ProcessingFormResult -->
    <state id="Evaluating">
      <onentry>
        <send target="activeItem" event="evaluate"/>
      </onentry>
      <transition event="Evaluating.done" target="SelectingItem:FormItemSelection"/>
    </state>    <!-- end Executing -->
  </state>  <!-- end Created -->
</scxml>

6.10 Field Module

6.10.1 Syntax

Table 35: Table: <field> Attributes
name The form item variable in the dialog scope that will hold the result. The name must be unique among form items in the form. If the name is not unique, then a badfetch error is thrown when the document is fetched. The name must conform to the variable naming conventions in (TODO).
expr The initial value of the form item variable; default is ECMAScript undefined. If initialized to a value, then the form item will not be visited unless the form item variable is cleared.
cond An expression that must evaluate to true after conversion to boolean in order for the form item to be visited. The form item can also be visited if the attribute is not specified.
type The type of field, i.e., the name of a builtin grammar type ( 6.11 Builtin Grammar Module ). Note that platform support for builtin grammar types is optional. If the specified builtin type is not supported by the platform, an error.unsupported.builtin event is thrown.
slot The name of the grammar slot used to populate the variable (if it is absent, it defaults to the variable name). This attribute is useful in the case where the grammar format being used has a mechanism for returning sets of slot/value pairs and the slot names differ from the form item variable names.
modal If this is false (the default) all active grammars are turned on while collecting this field. If this is true, then only the field's grammars are enabled: all others are temporarily disabled.

6.10.2 Semantics

The semantics of field elements are defined using the following resource controllers: Field ( 6.10.2.1 Field RC ), PlayandRecognize ( 6.10.2.2 PlayandRecognize RC ), ...

6.10.2.1 Field RC

The Field Resource Controller is the primary RC for the field element.

6.10.2.1.1 Definition

The field RC is defined in terms of a data model and state model.

The data model is composed of the following parameters:

  • controller: the RC controlling this field RC
  • children: array of children's (primary) RC
  • includePrompts: boolean indicating whether prompts are to be played. Default: true.
  • counter: prompt counter. Default: 1.
  • recoResult: ??
  • name:
  • expr:
  • cond:
  • modal:
  • slot:

The field RC's state model consists of the following states: Idle, Initializing, Ready, Preparing, Prepared, Executing and Evaluating.

While in the Idle state, the RC may receive an 'initialize' event, whose 'controller' event data is used to update the data model. The RC then transitions into Initiating state.

In the Initializing state, the RC creates a variable in the Datamodel Resource: the variable name corresponds to the name in the RC's data model, and the variable value is set to the value of the RC's data model expr, if this is defined. The field RC then initializes its children: this is modeled as a separate RC (see XXX). When all children are initialized, the RC transitions to the Ready state.

In the Ready state, the field RC can receive an 'checkStatus' event to check whether it can be executed or not. The value of name and cond in its data model are checked: the status is true if the name is undefined and the value of cond evaluates to true. The status is returned in a 'checkedStatus' event sent back to the controller RC. If the RC receives a 'prepare' event, it updates includePrompts in its data model using the event data, and transitions to the Preparing state.

In the Preparing state, the field prepares its prompts and grammars. Prompts are prepared only if the includePrompts data is true; otherwise, prompts within the field are not prepared (e.g. field prompts aren't queued following a <reprompt>). Preparation of prompts is modeled as a separate RC (see XXX), as is preparation of grammars (see YYY). These RCs are summarized below.

Prompts are prepared by iterating through the children array. In the iteration, each prompt RC child is sent a 'checkStatus' event. If the prompt child returns true (its cond parameter evaluates to true), then it is added to a 'correct count' list together with its count. Once the iteration is complete, the RC determines the highest count on the 'correct count' list: the highest count among those on the list less than or equal to the current count value. All child on the 'correct count' list whose count is not the highest count are removed. The RC then iterates through the 'correct count' list and sends an 'execute' event to each prompt RC, causing it to be queued on the PromptQueue Resource.

Grammars are prepared by recursing through the children array and sending each grammar RC child an 'execute' event. The grammar RC then, if appropriate, sends an 'addGrammar' event to the DTMF or ASR Recognizer Resource where the grammar itself, its properties and the field RC is sent as the handler for recognition results.

When prompts and grammars have been prepared, the prompt counter is incremented and the field RC sends a 'prepared' event to its controller with event data indicating its modal status and then transition into the Prepared state.

In the Prepared state, the field RC may receive an 'execute' event from its controller. The RC sends an 'execute' event to the PlayAndRecognize RC ( 6.10.2.2 PlayandRecognize RC ), causing any queued prompts to be played and recognition to be initiated. In the event data, the controller is set to this RC, and other data is derived from data model properties. The RC transitions to the Executing state.

In the Executing state, the PlayAndRecognize RC must send recoResults (or error events: noinput, nomatch, error.semantic) to the field RC.

If the field RC receives the recoResults, then it updates its name variable in the Datamodel Resource. The field RC then sends a 'fieldResult' event to its controller indicating that a field result has been received and processed.

If the recoResult is received by the field RC's controller, then the field receives an 'evaluate' event which causes it to transition to the Evaluating state.

In the Evaluating state, the field RC iterates through its children executing each filled RC: this is modeled by a separate RC (see XXX). When evaluation is complete, the RC sends a 'evaluated' event to its controller and transitions to the Ready state.

6.10.2.1.2 Defined Events

The Field RC is defined to receive the following events:

Table 36: Events received by Field RC
Event Source Payload Description
initialize any controller(M)
checkStatus controller
prepare controller includePrompts (M)
execute controller
evaluate controller

and the events it sends:

Table 37: Events sent by Field RC
Event Target Payload Description
initialized controller
checkedStatus controller
prepared controller
fieldResult controller
evaluated controller
6.10.2.1.3 External Events

Table 38 shows the events sent and received by the field RC to resources and other RCs which define the events.

Table 38: Field RC External Events
Event Source Target Description
create FieldRC DataModel
assign FieldRC DataModel
execute FieldRC PlayandRecognizeRC
recoResult PlayandRecognizeRC FieldRC
6.10.2.1.4 State Chart Representation
Field RC in UML State Chart

Figure 11: 12: Field RC States

6.10.2.1.5 SCXML Representation
<?xml version="1.0" encoding="UTF-8"?>
<scxml initialstate="Created">
  <datamodel>
    <data id="controller"/>
    <data id="children"/>
    <data id="counter"/>
    <data id="recoResult"/>
    <data id="cond"/>
    <data id="name"/>
    <data id="expr"/>
    <data id="modal"/>
    <data id="includePrompts"/>
  </datamodel>
  <state id="Created">
    <initial id="Idle"/>
    <state id="Idle">
      <onentry>
        <assign location="$controller" val="null"/>
        <assign location="$children" expr="new Array()"/>
        <assign location="$counter" val="1"/>
        <assign location="$recoResult" val="null"/>
        <assign location="$cond" val="null"/>
        <assign location="$expr" val="null"/>
        <assign location="$modal" val="false"/>
        <assign location="$includePrompts" val="true"/>
      </onentry>
      <transition event="initialize" target="Initializing">
        <assign name="$controller" expr="_eventData/controller"/>
      </transition>
    </state>
    <!-- end Idle -->
    <state id="Initializing">
      <datamodel>
        <data id="childcounter"/>
      </datamodel>
      <onentry>
        <if cond="expr neq 'null'">
          <send target="datamodel" event="assign" namelist="$name, $expr"/>
          <else>
            <send target="datamodel" event="create" namelist="$name"/>
          </else>
        </if>
        <assign location="$childcounter" val="0"/>
        <foreach var="child" array="$children">
          <send target="$child/controller" event="initialize"/>
        </foreach>
      </onentry>
      <transition event="Initializing.done">
        <assign location="$childcounter" expr="$childcounter + 1"/>
      </transition>
      <transition event="Initializing.error">
        <assign location="$childcounter" expr="$childcounter + 1"/>
        <send target="controller" event="initialize.error" namelist="_eventData/error_status"/>
      </transition>
      <transition event="Initializing.done" cond="$childcounter eq $children.size()-1" target="Ready">
        <send target="controller" event="initialized"/>
      </transition>
    </state>
    <!-- end Initializing -->

    <state id="Ready">
      <transition event="checkStatus" >
        <assign location="$status" expr="checkStatus()"/>
        <send target="controller" event="checkedStatus" namelist="_eventData/status"/>
      </transition>
      <transition event="prepare" target="Preparing">
        <assign location="$includePrompts" expr="_eventData/includePrompts"/>
      </transition>
    </state>
    <!-- end Ready -->
    <state id="Preparing">
      <onentry>
        <if cond="$includePrompts eq 'true'">
          <send target="Prompts RC" event="initialize"/>
        </if>
        <send target="Grammars RC" event="initialize"/>
      </onentry>
      <transition event="preparing.done" target="Prepared">
        <send target="controller" event="prepared" namelist="modal"/>
      </transition>
    </state>
    <!-- end Preparing -->
    <state id="Prepared">
      <transition event="execute" target="Executing">
        <send target="PlayAndRecognize" event="execute" namlist="self, inputmodes"/>
      </transition>
    </state>
    <!-- end Prepared-->
    <state id="Executing">
      <datamodel>
        <data id="value"/>
      </datamodel>
      <transition event="playAndReco:recoResult">
        <assign location="$value" expr="processResults($name, slot, _eventdata/result)"/>
        <send target="datamodel" event="assign" namelist="$name, $value"/>
        <send target="parent" event="fieldResult"/>
      </transition>
    </state>
    <!-- end Executing-->
    <state id="Evaluating">
      <onentry>
        <send target="filled RC" event="executeFilleds"/>
      </onentry>
      <transition event="evaluating.done" target="Ready">
        <send target="controller" event="evaluated"/>
      </transition>
    </state>
    <!-- end Evaluating-->
  </state>
  <!-- end Created -->
</scxml>

6.10.2.2 PlayandRecognize RC

The PlayandRecognize RC coordinates media input with Recognizer resources and media output with the PromptQueue Resource.

The following use cases are covered:

  1. Bargein is not active and bargeintype is speech. Prompts are played to completion and the user provides positive input, negative input or no input.
  2. Bargein is active and bargeintype is speech. Prompts are played to completion and the user provides positive input, negative input or no input.
  3. Bargein is active and bargeintype is speech. User interrupts prompts and the user provides positive input, negative input or no input.
  4. Bargein is not active and bargeintype is hotword. Prompts are played to completion and the user provides positive input, negative input or no input. User may provide a positive input after one or more negative inputs. The 'nomatch' event is never generated.
  5. Bargein is active and bargeintype is hotword. User interrupts prompts and the user provides positive input, negative input or no input. User may provide a positive input after one or more negative inputs. The 'nomatch' event is never generated.
  6. Prompt sequences alternating between bargein and no bargein.
  7. Prompt sequences alternating between speech and hotword bargeintype.
Editorial note  
Open issue: should we remove the possibility for alternating speech and hotword bargein modes within the recognition cycle?
6.10.2.2.1 Definition

The PlayandRecognize RC coordinates media input with recognition resources and media output with the PromptQueue Resource on behalf of a form item.

This RC activates prompt queue playback, activates recognition resources, manages bargein behavior and handles results from recognition resources.

The RC is defined in terms of a data model and a state model.

The data model is composed of the following parameters:

  • controller: the RC controlling this RC
  • bargein: Boolean indicates whether bargein is active or not. Default: true.
  • bargeintype: indicates the type of bargein, if active. Default: speech.
  • inputmodes: active recognition input modes. Default: voice and dtmf.
  • inputtimeout: timeout to wait for input. Default: 0s. (Required since the prompt queue may be empty).
  • dtmfProps: DTMF properties
  • asrProps: Speech recognition properties
  • maxnbest: maximum number of nbest results. Default: 1.
  • recoActive: boolean indicating whether recognition is active. Default: false.
  • markname: string indicating current markname. Default: null
  • marktime: time designator indicating current marktime. Default: 0s.
  • recoResult:
  • recoListener:
  • activeGrammars: Boolean indicating whether grammars are active. Default: false.

The RC model consists of the following states: idle, prepare recognition resources, start playing, playing prompts with bargein, playing prompts without bargein, recognizing with a timer, waiting for input, waiting for speech result and update results. The complexity of this model is partially a consequence of supporting the relationship between hotword bargein and recognition result processing.

While in the idle state, the RC may receive an 'execute' event, whose event data is used to update the data model. The event information includes: controller, inputmodes, inputtimeout, dtmfProps, asrProps and maxnbest. The RC transition to the prepare recognition resources state.

In the prepare recognition resources, the RC sends 'prepare' events to the ASR and DTMF recognition resource. Both events specify this RC as the controller parameter, while the properties parameter differs. In this state, the RC can received 'prepared' or 'notPrepared' events from either recognition resources. If neither resource returns a 'prepared' event, then activeGrammars is false (i.e. no active DTMF or speech grammar) and the RC sends an 'error.semantic' event to the controller and exits. If at least one resource returns a 'prepared' event, then the RC moves into the start playing state.

The start playing state begins by sending the PromptQueue resource a 'play' event. The PromptQueue responds with a 'playDone' event if there are no prompt in the prompt queue; as a result, this RC moves into the start recognizing with timer state. If there is at least one prompts in the queue, the PromptQueue sends this RC a 'playStarted' event whose data contains the bargein and bargeintype values for the first prompt, and the input timeout value for the last prompt in the queue. The data model is updated with this information.

Editorial note  
Open issue: PromptQueue Resource doesn't currently have playStarted event. If we don't add playStarted event, then is there a better way to get the bargein, bargeintype, and timeout information from the prompts in the PromptQueue?

Interaction with the recognizer during prompt playback is determined by the data model's bargein value. If bargein is true, then this RC transitions to the playing with bargein state. If bargein is false, the RC transitions to the playing without bargein state.

Editorial note  
Open Issue: The event "bargeinChange" as a one way notification could pose a problem, as it takes finite time for recognizer to suspend or resume. This might work better if PromptQueue Resource waited for an event "bargeinChangeAck" (or similar) from PlayandRecognize RC before starting the next play. PlayandRecognize RC will send the event "bargeinChangeAck" after it completed suspend or resume action on the recognizer.

In the playing without bargein state, recognition is suspended if it has been previously activated (recoActive parameter of the data model tracks this). Suspending recognition is conditional on the value of 'inputmodes' data parameter; if 'dtmf' is in inputmodes, then DTMF recognition is suspended; if 'voice' is in inputmodes, the ASR recognition is suspended. In this state, the PromptQueue can report to this RC changes in bargein and bargeintype as prompts are played: a 'bargeintypeChange' event with the values 'hotword' or 'speech' cause the data model parameter 'bargein' to the set to 'true' and the 'bargintype' 'bargeintype' parameter to be updated with event data value. If the PromptQueue resource sends a 'playDone' event, then the data model markname and marktime parameters are updated and the RC transitions to the start recognizing with timer state.

In the playing with bargein state, recognition is activated if it has not been previously activated (determined by recoActive parameter in the data model). Activating recognition is conditional on the value of 'inputmodes' data parameter; if 'dtmf' is in inputmodes, then DTMF recognition is activated; if 'voice' is in inputmodes, then ASR recognition is activated. In this state, the PromptQueue can report changes in bargein and bargeintype as prompts are played: a 'bargeintypeChange' event where the event data value is not 'unbargeable' causes the data model 'bargintype' 'bargeintype' parameter to be updated with the event data ('hotword' or 'speech'); while a 'bargeintypeChange' where the event data value is 'unbargeable' causes the data model 'bargein' parameter to set to false and the RC transitions to the playing without bargein state. If the PromptQueue resources sends a 'playDone' event, then the data model markname and marktime parameters are updated and the RC transitions to the start recognizing with timer state.

Recognition handling in this state depends upon the bargeintype data parameter. If the bargeintype is 'speech' and a recognizer sends a 'inputStarted' event, then the RC transition to the waiting for speech result state. If the bargeintype is 'hotword', then recognition results are processed within this state. In particular, if a recognition resource sends a 'recoResults' event, then its event data is processed to determine if the recognition result is positive or negative.

Editorial note  
Further details on recognition processing to be added in later versions. recoResults data parameter is updated with the recognition results (truncated to maxnbest). A speech result is positive iff there is at least one result whose confidence level is equal to or greater than the recognition confidence level; otherwise the result is negative. DTMF results are always positive. The recoListener data parameter is defined as the listener associated with the best result if the result is positive.

If positive, the RC sends the PromptQueue a 'halt' event, and transitions to the update results state. If negative, the RC sends a 'listen' event to the recognition resource which sent the 'recoResults' event.

In the start recognizing with timer state, an input timer is activated for the value of the inputtimeout data parameter and, if the recognition is not already active (determined by the recoActive data parameter). Recognition activation is conditional on the value of 'inputmodes' data parameter; if 'dtmf' is in inputmodes, then DTMF recognition is activated; if 'voice' is in inputmodes, the ASR recognition is activated. The RC then transitions into the waiting for input state.

In the waiting for input state, the RC waits for user input. If it receives a 'timerExpired' event, then the RC sends a 'stop' event to all recognition resources, sends a 'noinput' event to its controller and exits. Recognition handling in this state depends upon the bargeintype data parameter. If the bargeintype is 'speech' and a recognizer sends a 'inputStarted' event, then the RC transition to the waiting for speech result state. If the bargeintype is 'hotword', then recognition results are processed within this state. In particular, if a recognition resource sends a 'recoResults' event, then its event data is processed to determine if the recognition result is positive or negative. If positive, the RC cancels the timer, and transitions to the update results state. If negative, the RC sends a 'listen' event to the recognition resource which sent the 'recoResults' event.

In the waiting for speech result state, the RC waits for a 'recoResult' event whose data is used to update the recoResult data parameter and to set the recoListener data parameter if the recognition result is positive. The RC then transitions to the update results state.

In the update results state, the RC sends 'assign' events to the data model resource, so that the lastresult object in application scope is updated with recognition results as well as markname and marktime information. If the recoListener data parameter is defined, then the RC sends a 'recoResult' event to the recognition listener RC; otherwise, it sends 'nomatch' event to its controller. The RC then exits.

Editorial note  

Open issue: Behavior if one reco resource sends 'inputStarted' but other sends 'recoResults'? Race conditions between recognizers returning results? (This problem is inherent to the presence of two recognizers. For the sake of clear semantics, we could restrict only one recognizer to respond with 'inputStarted' and 'recoResults'. The other recognizer is always 'stopped'. But a better choice might be to have only one recognizer that handles both DTMF and speech, since semantically both recognizers are very similar.)

6.10.2.2.2 Defined Events

The PlayandRecognize RC is defined to receive the following events:

Table 39: Events received by PlayandRecognize RC
Event Source Payload Sequencing Description
execute any controller(M), inputmodes (O), inputtimeout (O), dtmfProps (M), recoProps (M), maxnbest (O)

and the events it sends:

Table 40: Events sent by PlayandRecognize RC
Event Target Payload Sequencing Description
recoResult any results (M) one-of: nomatch, noinput, error.*, recoResult
nomatch controller one-of: nomatch, noinput, error.*, recoResult
noinput controller one-of: nomatch, noinput, error.*, recoResult
error.semantic controller one-of: nomatch, noinput, error.*, recoResult
error.badfetch.grammar controller one-of: nomatch, noinput, error.*, recoResult
error.noresource controller one-of: nomatch, noinput, error.*, recoResult
error.unsupported.builtin controller one-of: nomatch, noinput, error.*, recoResult
error.unsupported.format controller one-of: nomatch, noinput, error.*, recoResult
error.unsupported.language controller one-of: nomatch, noinput, error.*, recoResult
6.10.2.2.3 External Events

The events in Table 41 are sent by the PlayandRecognize RC to resources which define the events.

Table 41: External Events send by PlayandRecognize RC
Event Target Payload Sequencing Description
play PromptQueue
halt PromptQueue
prepare Recognizer
listen Recognizer
suspend Recognizer
stop Recognizer

The events in Table 42 are received by this RC. Their definition is provided by the sending component.

Table 42: External Events received by PlayandRecognize RC
Event Source Payload Sequencing Description
playStarted PromptQueue bargein (O), bargeintype (O), inputtimeout (O) pq:play notification
playDone PromptQueue markname (O), marktime (O) pq:play response
bargeinChange PromptQueue bargein (M)
bargeintypeChange PromptQueue bargeintype (M)
prepared Recognizer prepare positive response
notPrepared Recognizer prepare negative response
inputStarted Recognizer
recoResult Recognizer results (M), listener (O)
6.10.2.2.4 State Chart Representation

The main states for the PlayandRecognize RC are shown in Figure 12. 13.

Play and Recognize RC States

Figure 12: 13: PlayandRecognize RC States

6.10.2.2.5 SCXML Representation
<?xml version="1.0" encoding="UTF-8"?>
<scxml initialstate="Created">
  <datamodel>
    <data id="controller"/>
    <data id="bargein"/>
    <data id="bargeintype"/>
    <data id="inputmodes"/>
    <data id="inputtimeout"/>
    <data id="dtmfProps"/>
    <data id="asrProps"/>
    <data id="maxnbest"/>
    <data id="recoActive"/>
    <data id="markname"/>
    <data id="marktime"/>
    <data id="recoResult"/>
    <data id="recoListener"/>
    <data id="activeGrammars"/>
  </datamodel>
  <state id="Created">
    <initial id="Idle"/>
    <state id="Idle">
      <onentry>
        <assign location="$controller" val="null"/>
        <assign location="$bargein" expr="true"/>
        <assign location="$bargeintype" val="speech"/>
        <assign location="$inputmodes" val="voice"/>
        <assign location="$inputtimeout" val="0s"/>
        <assign location="$dtmfProps" val="null"/>
        <assign location="$asrProps" val="null"/>
        <assign location="$maxbest" val="1"/>
        <assign location="$recoActive" val="false"/>
        <assign location="$markname" val="null"/>
        <assign location="$marktime" val="0"/>
        <assign location="$recoResult" val="null"/>
        <assign location="$recoListener" val="null"/>
        <assign location="$activeGrammars" val="false"/>
      </onentry>
      <transition event="execute" target="PrepareRecognitionResources">
        <assign name="/datamodel/data/[@name='controller']" expr="_eventData/controller"/>
        <assign name="/datamodel/data/[@name='inputmodes']" expr="_eventData/modes"/>
        <assign name="/datamodel/data/[@name='inputtimeout']" expr="_eventData/timeout"/>
        <assign name="/datamodel/data/[@name='dtmfProps']" expr="_eventData/dtmfProps"/>
        <assign name="/datamodel/data/[@name='asrProps']" expr="_eventData/asrProps"/>
        <assign name="/datamodel/data/[@name='maxnbest']" expr="_eventData/maxnbest"/>
      </transition>
    </state>
    <!-- end Idle -->
    <state id="PrepareRecognitionResources">
      <transition target="StartPlaying" cond="$activeGrammars eq 'true'"/>
      <transition target="Exit" cond="$activeGrammars eq 'false'">
        <send target="controller" event="error.semantic"/>
      </transition>
    </state>
    <!-- end PrepareRecognitionResources -->
    <state id="StartPlaying">
      <onentry>
        <send target="PromptQueue" event="pq:play"/>
      </onentry>
      <transition event="pq:playStarted" cond="$bargein eq 'true'" target="PlayingWithBargein">
        <assign location="$bargein" expr="_eventdata/bargein"/>
      </transition>
      <transition event="pq:playStarted" cond="$bargein eq 'false'" target="PlayingWithoutBargein">
        <assign location="$bargein" expr="_eventdata/bargein"/>
      </transition>
      <transition event="pq:playDone" target="StartRecognizingWithTimer"/>
    </state>
    <!-- end StartPlaying -->
    <state id="PlayingWithoutBargein">
      <onentry>
        <if cond="$recoActive eq 'true'">
          <if cond="in('dtmf',$inputmodes) ">
            <send target="DTMFRecognizer" event="rec:suspend"/>
          </if>
          <if cond="in('voice',$inputmodes) ">
            <send target="DTMFRecognizer" event="rec:suspend"/>
          </if>
        </if>
      </onentry>
      <transition event="bargeintypeChange" cond="_eventdata/value neq 'unbargeable'" target="PlayWithBargein">
        <assign location="$bargein" val="true"/>
        <assign location="$bargeintype" expr="_eventdata/value"/>
      </transition>
      <transition event="pq:playDone" target="StartRecognizingWithTimer">
        <assign location="$markname" expr="_eventdata/markname"/>
        <assign location="$marktime" expr="_eventdata/marktime"/>
      </transition>
    </state>
    <!-- end PlayingWithoutBargein -->
    <state id="PlayingWithBargein">
      <datamodel>
        <data id="negorpos"/>
      </datamodel>
      <onentry>
        <if cond="in('dtmf',$inputmodes) ">
          <send target="DTMFRecognizer" event="rec:listen"/>
        </if>
        <if cond="in('voice',$inputmodes)">
          <send target="DTMFRecognizer" event="rec:listen"/>
        </if>
        <assign location="$recoActive" val="true"/>
      </onentry>
      <transition event="bargeintypeChange" cond="_eventdata/value neq 'unbargeable'">
        <assign location="$bargeintype" expr="_eventdata/value"/>
      </transition>
      <transition event="rec:recoResult">
        <assign location="$negorpos" expr="processRecoResult()"/>
        <send target="parent" event="negorpos"/>
      </transition>
      <transition event="negativeRecoResult">
        <send target="rec_source" event="listen"/>
      </transition>
      <transition event="pq:playDone" target="StartRecognizingWithTimer">
        <assign location="$markname" expr="_eventdata/markname"/>
        <assign location="$marktime" expr="_eventdata/marktime"/>
      </transition>
      <transition event="positiveRecoResult">
        <send target="PromptQueue" event="pq:halt"/>
      </transition>
      <transition event="rec:inputStarted" cond="$bargeintype eq 'speech'">
        <send target="PromptQueue" event="pq:halt"/>
      </transition>
    </state>
    <!-- end PlayingWithBargein -->
    <state id="StartRecognizingWithTimer">
      <onentry>
        <send target="Timer" event="start" namelist="$inputtimeout"/>
        <if cond="$recoActive eq 'false'">
          <if cond="in('dtmf',$inputmodes) ">
            <send target="DTMFRecognizer" event="rec:listen"/>
          </if>
          <if cond="in('voice',$inputmodes)">
            <send target="DTMFRecognizer" event="rec:listen"/>
          </if>
          <assign location="$recoActive" val="true"/>
        </if>
      </onentry>
      <transition event="execute" target="Executing">
        <send target="PlayAndRecognize" event="execute" namlist="self, inputmodes"/>
      </transition>
    </state>
    <!-- end StartRecognizingWithTimer-->
    <state id="WaitingForInput">
      <datamodel>
        <data id="negorpos"/>
      </datamodel>
      <transition event="rec:recoResult">
        <assign location="$negorpos" expr="processResults()"/>
        <send target="parent" event="negorpos"/>
      </transition>
      <transition event="negativeRecoResult">
        <send target="rec_source" event="listen"/>
      </transition>
      <transition event="timerExpired">
        <send target="Recognizer" event="rec:stop"/>
        <send target="controller" event="noinput"/>
      </transition>
      <transition event="rec:inputStarted" cond="$bargeintype eq 'speech'" target="WaitingForSpeechResult">
        <send target="Timer" event="cancel"/>
      </transition>
    </state>
    <!-- end WaitingForInput-->
    <state id="WaitingForSpeechResult">
      <datamodel>
        <data id="negorpos"/>
      </datamodel>
      <!--TBD: the original diagram seems put the event at the wrong place-->
      <transition event="rec:recoResult" target="UpdateResults">
        <assign location="$negorpos" expr="processResults()"/>
        <send target="parent" event="negorpos"/>
      </transition>
    </state>
    <!-- end WaitingForSpeechResult-->
    <state id="UpdateResults">
      <onentry>
        <send target="datamodel" namelist="application, lastresult$, recoResults"/>
        <if cond="$negorpos neq 'null'">
          <send target="recoListener" event="recoResult" namelist="recoResults"/>
          <else/>
          <send target="controller" event="nomatch"/>
        </if>
      </onentry>
      <transition target="Exit"/>
    </state>
    <!-- end UpdateResults-->
  </state>
  <!-- end Created -->
</scxml>

6.11 Builtin Grammar Module

6.11.1 Usage of Platform Grammars

VoiceXML developers are commonly required to sketch out an application for the purpose of a demo or other proof of concept. In such cases, it is convenient to use placeholder grammars for frequent dialogs like collecting a date, asking a yes/no question, etc. Builtin grammars (provided by the platform) are designed to serve this purpose.

Once the prototyping phase is complete, however, it is good practice to replace the builtin grammar references with developer written grammars. There are several reasons behind this suggestion:

  • There is little consistency of the builtin implementations across platforms. Relying on builtins complicates portability.
  • Application developers typically have no control over the coverage in the builtin grammar. This means that any modification to the accepted phrase(s) would require a completely new grammar. Discovering a limitation like this post-deployment could be disruptive.
  • Similar to the above, builtins are limited in their ability to handle underspecified spoken input. For instance, "20 peso" cannot be resolved to a specific [ISO4217] currency code because the "peso" is the name of the currency of numerous nations. In such cases the platform may return a specific currency code according to the language or may omit the currency code. Edge cases like this are not likely to be handled by platform builtins.

6.11.2 Platform Requirements

Support for builtin grammars is not required for conformance. But if a platform does support builtin types, then it MUST follow the description given in this module as closely as possible. This includes:
  • Supporting all builtin types for a given language. In other words, if a platform supports one builtin in a language, then it ought to support the others as well.
  • Following the type descriptions listed in the "Syntax and Semantics" section below. This requirement is primarily designed to ensure at least some amount of consistency on how the NL from the grammar is accessed.
  • Supporting both voice and DTMF modes for each type.
  • Supporting the corresponding <say-as> class for the grammar type.

6.11.3 Syntax and Semantics

Builtin grammars may be specified in one of two ways:

  • Through the "type" attribute on the <field> tag (see Table 35). For example <field type="boolean">. Note that when a builtin is specified in this way, it is in addition to any <grammar> under the <field>
  • Using the "builtin" protocol for the grammar URI. For example: <grammar src="builtin:boolean"/>

Each builtin type has a convention for the format of the value returned. These are independent of language and of the implementation. The return type for builtin fields is a string except for the boolean field type. To access the actual recognition result, the author can reference the <field> shadow variable "name$.utterance". Alternatively, the developer can access application.lastresult$, where application.lastresult$.interpretation has the same string value as application.lastresult$.utterance.

Table 43: Builtin Grammar Types
Type Description
boolean Inputs include affirmative and negative phrases appropriate to the current language. DTMF 1 is affirmative and 2 is negative. The result is ECMAScript true for affirmative or false for negative. The value will be submitted as the string "true" or the string "false". If the field value is subsequently used in <say-as> with the interpret-as value "vxml:boolean", it will be spoken as an affirmative or negative phrase appropriate to the current language.
date Valid spoken inputs include phrases that specify a date, including a month day and year. DTMF inputs are: four digits for the year, followed by two digits for the month, and two digits for the day. The result is a fixed-length date string with format yyyymmdd, e.g. "20000704". If the year is not specified, yyyy is returned as "????"; if the month is not specified mm is returned as "??"; and if the day is not specified dd is returned as "??". If the value is subsequently used in <say-as> with the interpret-as value "vxml:date", it will be spoken as date phrase appropriate to the current language.
digits Valid spoken or DTMF inputs include one or more digits, 0 through 9. The result is a string of digits. If the result is subsequently used in <say-as> with the interpret-as value "vxml:digits", it will be spoken as a sequence of digits appropriate to the current language. A user can say for example "two one two seven", but not "twenty one hundred and twenty-seven". A platform may support constructs such as "two double-five eight".
currency Valid spoken inputs include phrases that specify a currency amount. For DTMF input, the "*" key will act as the decimal point. The result is a string with the format UUUmm.nn, where UUU is the three character currency indicator according to ISO standard 4217 [ISO4217], or mm.nn if the currency is not spoken by the user or if the currency cannot be reliably determined (e.g. "dollar" and "peso" are ambiguous). If the field is subsequently used in <say-as> with the interpret-as value "vxml:currency", it will be spoken as a currency amount appropriate to the current language.
number Valid spoken inputs include phrases that specify numbers, such as "one hundred twenty-three", or "five point three". Valid DTMF input includes positive numbers entered using digits and "*" to represent a decimal point. The result is a string of digits from 0 to 9 and may optionally include a decimal point (".") and/or a plus or minus sign. ECMAScript automatically converts result strings to numerical values when used in numerical expressions. The result must not use a leading zero (which would cause ECMAScript to interpret as an octal number). If the field is subsequently used in <say-as> with the interpret-as value "vxml:number", it will be spoken as a number appropriate to the current language.
phone Valid spoken inputs include phrases that specify a phone number. DTMF asterisk "*" represents "x". The result is a string containing a telephone number consisting of a string of digits and optionally containing the character "x" to indicate a phone number with an extension. For North America, a result could be "8005551234x789". If the field is subsequently used in <say-as> with the interpret-as value "vxml:phone", it will be spoken as a phone number appropriate to the current language.
time Valid spoken inputs include phrases that specify a time, including hours and minutes. The result is a five character string in the format hhmmx, where x is one of "a" for AM, "p" for PM, "h" to indicate a time specified using 24 hour clock, or "?" to indicate an ambiguous time. Input can be via DTMF. Because there is no DTMF convention for specifying AM/PM, in the case of DTMF input, the result will always end with "h" or "?". If the field is subsequently used in <say-as> with the interpret-as value "vxml:time", it will be spoken as a time appropriate to the current language.

Both the "boolean" and "digits" types can be parameterized as follows:

Table 44: Digit and Boolean Grammar Parameterization
digits?minlength=n A string of at least n digits. Applicable to speech and DTMF grammars. If minlength conflicts with either the length or maxlength attributes then a error.badfetch event is thrown.
digits?maxlength=n A string of at most n digits. Applicable to speech and DTMF grammars. If maxlength conflicts with either the length or minlength attributes then a error.badfetch event is thrown.
digits?length=n A string of exactly n digits. Applicable to speech and DTMF grammars. If length conflicts with either the minlength or maxlength attributes then a error.badfetch event is thrown.
boolean?y=d A grammar that treats the keypress d as an affirmative answer. Applicable only to the DTMF grammar.
boolean?n=d A grammar that treats the keypress d as a negative answer. Applicable only to the DTMF grammar.

Note that more than one parameter may be specified separated by the ";" character. This is illistrated illustrated in the last example below.

6.11.4 Examples

A <field> element with a builtin grammar type. In this example, the boolean type indicates that inputs are various forms of true and false. The value actually put into the field is either true or false. The field would be read out using the appropriate affirmative or negative response in prompts.

            <field name="lo_fat_meal" type="boolean">
                
                <prompt>
                    Do you want a low fat meal on this flight?
                </prompt>
                <help>
                    Low fat means less than 10 grams of fat, and under
                    250 calories.
                </help>
                <filled>
                    <prompt>
                        I heard <emphasis><say-as interpret-as="vxml:boolean">
                            <value expr="lo_fat_meal"/></say-as></emphasis>.
                    </prompt>
                </filled>
            </field>            
            

In the next example, digits indicates that input will be spoken or keyed digits. The result is stored as a string, and rendered as digits using the <say-as> with "vxml:digits" as the value for the interpret-as attribute, i.e., "one-two-three", not "one hundred twenty-three". The <filled> action tests the field to see if it has 12 digits. If not, the user hears the error message.

                <field name="ticket_num" type="digits">
                    <prompt>
                        Read the 12 digit number from your ticket.
                    </prompt>
                    <help>The 12 digit number is to the lower left.</help>
                    <filled>
                        <if cond="ticket_num.length != 12">
                            <prompt>
                                Sorry, I didn't hear exactly 12 digits.
                            </prompt>
                            <assign name="ticket_num" expr="undefined"/>
                            <else/>
                            <prompt>I heard <say-as interpret-as="vxml:digits"> 
                                <value expr="ticket_num"/></say-as>
                            </prompt>
                        </if>
                    </filled>
                </field>            
            

The builtin boolean grammar and builtin digits grammar can be parameterized. This is done by explicitly referring to builtin grammars using a platform-specific builtin URI scheme and using a URI-style query syntax of the form type?param=value in the src attribute of a <grammar> element, or in the type attribute of a <field>. In this example, the <grammar> parameterizes the builtin DTMF grammar, the first <field> parameterizes the builtin DTMF grammar (the speech grammar will be activated as normal) and the second <field> parameterizes both builtin DTMF and speech grammars. Parameters which are undefined for a given grammar type will be ignored; for example, "builtin:grammar/boolean?y=7".

            <grammar src="builtin:dtmf/boolean?y=7;n=9"/>
            
            <field type="boolean?y=7;n=9">
                <prompt>
                    If this is correct say yes or press seven, if not, say no or press nine.
                </prompt>
            </field>
            
            <field type="digits?minlength=3;maxlength=5">
                <prompt>Please enter your passcode</prompt>
            </field>            
        

6.12 Data Access and Manipulation Module

6.12.1 Overview

Information in the Data layer must be easily accessible and easily editable throughout the VoiceXML 3.0 document. The Data Access and Manipulation Module describes the necessary mechanics by which application developers can express such interactions with the Data layer. Implementers must augment the data access and manipulation languages supported to provide the capabilities described in this section.

The remainder of this Section covers the semantics of the Data Access and Manipulation Module in Section 2.2 and the corresponding syntax in Section 2.3. Backward compatibility with VoiceXML 2.1 is discussed in Section 2.4.

6.12.2 Semantics

The semantics of Data Access and Manipulation can be described in terms of the various scopes in VoiceXML 3.0, the relevance to platform properties, the corresponding implicit variables that platforms must support, the variable resolution mechanism, standard session and application variables and the set of legal data values and expressions.

6.12.2.1 The scope stack

Access to data is controlled by means of scopes, which are conceptually stored in a stack. Data is always accessed within a particular scope, which may be specified by name but defaults to being the top scope in the stack. At initialization time, a single scope named "session" is created. Thereafter scopes are explicitly created and destroyed by the data model resource's clients as necessary. Likewise, during the lifetime of each scope, data is added, read, updated and deleted by the data model resource's clients as necessary.

Implementation note: The API is defined in 5.1.1 Data Model Resource API .

At any given point in time, based on the VoiceXML document structure and the execution state, the stack may contain the following scopes whose semantics are described in VoiceXML 3.0 as follows (bottom to top):

Table 45: Variable Scopes
session These are read-only variables that pertain to an entire user session. They are declared and set by the interpreter context. New session variables cannot be declared by VoiceXML documents.
application These are declared with <var> and <script> elements that are children of the application root document's <vxml> element. They are initialized when the application root document is loaded. They exist while the application root document is loaded, and are visible to the root document and any other loaded application leaf document. Note that while executing inside the application root document document.x is equivalent to application.x.
document These variables are declared with <var> and <script> elements that are children of the document's <vxml> element. They are initialized when the document is loaded. They exist while the document is loaded. They are visible only within that document, unless the document is an application root, in which case the variables are visible by leaf documents through the application scope only.
dialog Each dialog (<form> or <menu>) has a dialog scope that exists while the user is visiting that dialog, and which is visible to the elements of that dialog. Dialog scope contains the following variables: variables declared by <var> and <script> child elements of <form>, form item variables, and form item shadow variables. The child <var> and <script> elements of <form> are initialized when the form is first visited, as opposed to <var> elements inside executable content which are initialized when the executable content is executed.
(anonymous) Each <block>, <filled>, and <catch> element defines a new anonymous scope to contain variables declared in that element.
6.12.2.2 Relevance of scope stack to properties

Properties are discussed in detail in 8.2 Properties . Properties may be defined for the whole application, for the whole document at the <vxml> level, for a particular dialog at the <form> or <menu> level, or for a particular form item. Thus, access to properties is also controlled by means of the same scope stack that is used by named variables.

VoiceXML 3.0 provides a consistent mechanism to unambiguously read these properties in any scope using the data access and manipulation language in a manner similar to accessing and manipulating named variables. This is described in the two sections below.

6.12.2.3 Implicit variables

VoiceXML 3.0 provides several implicit variables in the data access and manipulation language to unambiguously identify the various scopes in the scope stack. Whenever the corresponding scopes are available, they can be referenced under specific names, which are always the same regardless of the location in the VoiceXML document. Additionally, an implicit variable "properties$" is available in each scope which points to the defined properties for that scope.

Table 46: Implicit Variables
session This implicit variable refers to the session scope.
application This implicit variable refers to the application scope.
document This implicit variable refers to the document scope.
dialog This implicit variable refers to the dialog scope.
properties$ This read-only implicit variable refers to the defined properties which affect platform behavior in a given scope. The value is an ECMAScript object with multiple ECMAScript properties as necessary where each ECMAScript property has the name of an existing platform property in that scope and value corresponding to the value of the platform property.

Note that in some data access expression languages (such as XPath), it may be necessary to expose the semantics of implicit variables as expression language functions instead of variables.

Also note that there is no implicit variable corresponding to the anonymous scope since it is not necessary given the variable resolution mechanism described in the next section. Where scope qualifiers are functions, a function to identify the anonymous scope may be necessary.

Finally, the use of the "properties$" implicit variable in VoiceXML 3.0 means that the variable "properties$" is now reserved in all scopes with the semantics described above.

6.12.2.4 Variable resolution

This section describes how named variables are resolved in VoiceXML 3.0. Named variables in expressions may be scope-qualified (using implicit variables) or scope-unqualified.

Some examples of scope-qualified variables that may occur in expressions are listed in the table below.

Table 47: Resolution examples (ECMAScript)
Expression Result
application.hello The value of the "hello" named variable in the application scope.
dialog.retries The value of the "retries" named variable in the dialog scope.
dialog.properties$.bargein The value of the "bargein" platform property defined at the current "dialog" scope.

The above table assumes that all the named variables used in the expressions exist. If any of the named variables do not exist, an error.semantic will result.

In cases where the named variables are unqualified i.e. there is no implicit variable indicating the scope in use, the following variable resolution mechanism is used:

  • The anonymous scope is checked for the named variable, and its value is returned if the variable is found
  • Otherwise, the dialog scope is checked for the named variable, and its value is returned if the variable is found
  • Otherwise, the document scope is checked for the named variable, and its value is returned if the variable is found
  • Otherwise, the application scope is checked for the named variable, and its value is returned if the variable is found
  • Otherwise, the session scope is checked for the named variable, and its value is returned if the variable is found
  • Otherwise, an error.semantic is thrown

The steps corresponding to any scopes that do not exist at the time of expression evaluation are ignored. The resolution mechanism begins with the closest enclosing scope in the given document structure.

6.12.2.5 Standard session variables

The following standard variables are available in the session scope:

session.connection.local.uri
This variable is a URI which addresses the local interpreter context device.
session.connection.remote.uri
This variable is a URI which addresses the remote caller device.
session.connection.protocol.name
This variable is the name of the connection protocol. The name also represents the subobject name for protocol specific information. For instance, if session.connection.protocol.name is 'q931', session.connection.protocol.q931.uui might specify the user-to-user information property of the connection.
session.connection.protocol.version
This variable is the version of the connection protocol.
session.connection.redirect
This variable is an array representing the connection redirection paths. The first element is the original called number, the last element is the last redirected number. Each element of the array contains a uri, pi (presentation information), si (screening information), and reason property. The reason property can be either "unknown", "user busy", "no reply", "deflection during alerting", "deflection immediate response", "mobile subscriber not reachable".
session.connection.aai
This variable is application-to-application information passed during connection setup.
session.connection.originator
This variable directly references either the local or remote property (For instance, the following ECMAScript would return true if the remote party initiated the connection: var caller_initiate = connection.originator == connection.remote).
6.12.2.6 Standard application variables

The following standard variables are available in the application scope:

application.lastresult$
This variable holds information about the last recognition to occur within this application. It is an array of elements where each element, application.lastresult$[i], represents a possible result through the following variables:
application.lastresult$.confidence
The whole utterance confidence level for this interpretation from 0.0-1.0. A value of 0.0 indicates minimum confidence, and a value of 1.0 indicates maximum confidence. More specific interpretation of a confidence value is platform-dependent.
application.lastresult$.utterance
The raw string of words that were recognized for this interpretation. The exact tokenization and spelling is platform-specific (e.g. "five hundred thirty" or "5 hundred 30" or even "530"). In the case of a DTMF grammar, this variable will contain the matched digit string.
application.lastresult$.inputmode
For this interpretation,the mode in which user input was provided: dtmf or voice.
application.lastresult$.interpretation
An ECMAScript variable containing the interpretation as described in the Semantic Interpretation for Speech Recognition specification [SISR].
application.lastresult$.markname
The name of the mark last executed by the SSML processor before barge-in occurred or the end of audio playback occurred. If no mark was executed, this variable is undefined.
application.lastresult$.marktime
The number of milliseconds that elapsed since the last mark was executed by the SSML processor until barge-in occurred or the end of audio playback occurred. If no mark was executed, this variable is undefined.
application.lastresult$.recording
The variable that stores a reference to the recording, or undefined if no audio is collected. Like the input item variable associated with a <record> element as described in section 2.3.6 of [VXML2], the implementation of this variable may vary between platforms.
application.lastresult$.recordingsize
The size of the recording in bytes, or undefined if no audio is collected.
application.lastresult$.recordingduration
The duration of the recording in milliseconds, or undefined if no audio is collected.

Interpretations are sorted by confidence score, from highest to lowest. Interpretations with the same confidence score are further sorted according to the precedence relationship among the grammars producing the interpretations. Different elements in application.lastresult$ will always differ in their utterance, interpretation, or both.

The number of application.lastresult$ elements is guaranteed to be greater than or equal to one and less than or equal to the system property "maxnbest". If no results have been generated by the system, then "application.lastresult$" shall be ECMAScript undefined.

Additionally, application.lastresult$ itself contains the properties confidence, utterance, inputmode, and interpretation corresponding to those of the 0th element in the ECMAScript array.

All of the shadow variables described above are set immediately after any recognition. In this context, a <nomatch> event counts as a recognition, and causes the value of "application.lastresult$" to be set, though the values stored in application.lastresult$ are platform dependent. In addition, the existing values of field variables are not affected by a <nomatch>. In contrast, a <noinput> event does not change the value of "application.lastresult$". After the value of "application.lastresult$" is set, the value persists (unless it is modified by the application) until the browser enters the next waiting state, when it is set to undefined. Similarly, when an application root document is loaded, this variable is set to the value undefined . The variable application.lastresult$ and all of its components are writeable and can be modified by the application.

6.12.2.7 Legal variable values and expressions

Any data language available on a VoiceXML 3.0 platform must specify the structure of the underlying data model. For example, with XPath, the variable values (and hence, the constituents of the data model) are XML trees. Such a specification of the data model implicitly defines a set of "legal variable values", namely the objects that can be part of such a data model.

Similarly, any data access and manipulation language available on a VoiceXML 3.0 platform must specify the complete set of valid value expressions via the expression language syntax.

6.12.3 Syntax

The syntax of the Data Access and Manipulation Module is described in terms of full support for CRUD operations (Create, Read, Update, Delete) on the Data layer in sections 2.3.1 through 2.3.4. The relevance of this syntax for properties is described in section 2.3.5.

6.12.3.1 Creating variables: the <var> element

The declaration of named variables is done using the <var> element. It can occur in executable content or as a child of <form> or <vxml>.

If it occurs in executable content, it declares a variable in the anonymous scope associated with the enclosing <block>, <filled>, or catch element. This declaration is made only when the <var> element is executed. If the variable is already declared in this scope, subsequent declarations act as assignments, as in ECMAScript.

If a <var> is a child of a <form> element, it declares a variable in the dialog scope of the <form>. This declaration is made during the form's initialization phase.

If a <var> is a child of a <vxml> element, it declares a variable in the document scope; and if it is the child of a <vxml> element in a root document then it also declares the variable in the application scope. This declaration is made when the document is initialized; initializations happen in document order.

Attributes of <var>

Table 48: <var> Attributes
name The name of the variable that will hold the result. This attribute must not specify a scope-qualified variable (if a variable is specified with a scope prefix, then an error.semantic event is thrown). The default scope in which the variable is defined is determined from the position in the document at which the element is declared.
expr The initial value of the variable (optional). If there is no expr attribute, the variable retains its current value, if any. Variables start out with the default value determined by the data access expression language in use if they are not given initial values (for example, with ECMAScript the initial value is undefined ).
scope The scope within which the named variable must be created (optional). Must be one of session , application , document or dialog . If the specified scope does not exist, then an error.semantic event is thrown.

The addition of the "scope" attribute in VoiceXML 3.0 adds more flexibility for the creation of variables, and allows creation to be decoupled from document location of the <var> element, if desired by the application.

Children of <var>

The children of the <var> element represent an in-line specification of the value of the variable.

If "expr" attribute is present, then the element must not have any children. Thus "expr" and children are mutually exclusive for the <var> element.

<var> examples

This section is informative.

    <var name="phone" expr="'6305551212'"/>
    <var name="y" expr="document.z+1"/>
    <var name="foo" scope="application" expr="dialog.bar * 2"/>
    <var name="itinerary">
      <root xmlns="">
        <flight>SW123</flight>
        <origin>JFK</origin>
        <depart>2009-01-01T14:32:00</depart>
        <destination>SFO</destination>
        <arrive>2009-01-01T18:14:00</arrive>
      </root>
    </var>
    

The above examples have the following result, in order:

  1. Creates a variable with name "phone" and String value "6305551212" in the closest enclosing scope as determined by the position of this <var> element in the document. If a variable named "phone" is already present in the mentioned scope, its value is updated to the String value "6305551212" (since this is always true, the rest of this section will not repeat this for each example).
  2. Creates a variable with name "y" in the closest enclosing scope as determined by the position of this <var> element in the document and value corresponding to the result of the expression "document.z+1", evaluated when this <var> element is executed.
  3. Creates a variable with name "foo" in the application scope and value corresponding to the result of the expression "dialog.bar * 2", evaluated when this <var> element is executed.
  4. Creates a variable with name "itinerary" in the closest enclosing scope as determined by the position of this <var> element in the document and value specified by the following in-line XML tree (the internal representation may be the corresponding DOM node, for example):
        <root xmlns="">
          <flight>SW123</flight>
          <origin>JFK</origin>
          <depart>2009-01-01T14:32:00</depart>
          <destination>SFO</destination>
          <arrive>2009-01-01T18:14:00</arrive>
        </root>
              
    

Translating to the Data Model Resource API

Implementation Notes: This section illustrates how the above examples translate to the 5.1.1 Data Model Resource API .

The above examples result in the following Data Model Resource API calls, in order:

  1. At the time of <var> execution, first the value is obtained for the new variable with name "phone" by evaluating the expression "'6305551212'", and subsequently the variable is created in the scope on top of the stack:
    1. Obtain variable value by calling EvaluateExpression("'6305551212'")
    2. Create variable by calling CreateVariable("name", value) where value is the result obtained in a. above. The optional scope parameter is not specified since the scope on the top of the stack is chosen by default.
  2. At the time of <var> execution, first the value is obtained for the new variable with name "y" by evaluating the expression "document.z+1", and subsequently the variable is created in the scope on top of the stack:
    1. Obtain variable value by calling EvaluateExpression("document.z+1")
    2. Create variable by calling CreateVariable("y", value) where value is the result obtained in a. above.
  3. At the time of <var> execution, first the value is obtained for the new variable with name "foo" by evaluating the expression "dialog.bar * 2", and subsequently the variable is created in application scope:
    1. Obtain variable value by calling EvaluateExpression("dialog.bar * 2")
    2. Create variable by calling CreateVariable("foo", value, "application") where value is the result obtained in a. above.
  4. At the time of <var> execution, the new variable with name "itinerary" is created in the scope on top of the stack using the in-line specification for the value in the body of the <var> element:
    1. Process the in-line specification below into an internal representation for the data model. For example, the assumed XML data model in this example may choose to internally represent this in-line specification as a DOM node.
          <root xmlns="">
            <flight>SW123</flight>
            <origin>JFK</origin>
            <depart>2009-01-01T14:32:00</depart>
            <destination>SFO</destination>
            <arrive>2009-01-01T18:14:00</arrive>
          </root>
                  
      
    2. Create variable by calling CreateVariable("itinerary", node) where node is the DOM node from a. above.
6.12.3.2 Reading variables: "expr" and "cond" attributes and the <value> element

The values of the named variables in the existing scopes in the scope stack are available for introspection and for further computation. These values can be read wherever expressions can be specified in the VoiceXML 3.0 document. Important examples include the "expr" and "cond" attributes on various syntactic elements. The "expr" attribute values are legal expressions as defined by the syntax of the data access and manipulation language (see Section 2.2.7 for details). The "cond" attribute values function as predicates, and in addition to being expressions, must evaluate to a boolean value.

6.12.3.2.1 Inserting variable values in prompts: The <value> element

The <value> element is used to insert the value of an expression into a prompt. 6.4 Prompt Module specifies prompts in detail.

Attributes of <value>

Table 49: <value> attributes
expr The expression to render. See Section 2.2.7 for legal values of expressions.
scope The scope within which the named variables in the expression are resolved (optional). Must be one of session , application , document or dialog . If the specified scope does not exist, then an error.semantic event is thrown.

<value> examples

    <value expr="application.duration + dialog.duration"/>
    <value expr="foo * bar"/>
    <value expr="foo + bar + application.baz" scope="document"/>
    

The above examples render the following, in order:

  1. The value corresponding to the sum (or concatenation, as the case may be) of the "duration" named variable in the application scope and the "duration" named variable in the dialog scope.
  2. The value corresponding to the product of the "foo" and "bar" named variables in the closest enclosing scope (the top of the scope stack).
  3. The value corresponding to the sum (or concatenation, as the case may be) of the "foo" and "bar" named variables in the document scope, and the "baz" named variable in the application scope.

Translating to the Data Model Resource API

Implementation Notes: This section illustrates how the above examples translate to the 5.1.1 Data Model Resource API .

The above examples result in the following Data Model Resource API calls:

    1. Evaluate the expression "application.duration + dialog.duration" in the closest enclosing scope by calling EvaluateExpression("application.duration + dialog.duration")
    2. The expression evaluator in turn resolves the scope-qualified variables in the expression by calling ReadVariable("duration", "application") ReadVariable("duration", "dialog")
    1. Evaluate the expression "foo * bar" in the closest enclosing scope by calling EvaluateExpression("foo * bar")
    2. The expression evaluator in turn resolves the scope-unqualified variables in the expression by calling ReadVariable("foo") ReadVariable("bar")
    1. Evaluate the expression "foo + bar + application.baz" in the document scope in the expression by calling EvaluateExpression("foo + bar + application.baz", "document")
    2. The expression evaluator in turn resolves the variables by calling ReadVariable("foo", "document") ReadVariable("bar", "document") ReadVariable("baz", "application")
6.12.3.3 Updating variables: the <assign> and <data> elements
6.12.3.3.1 The <assign> element

The <assign> element assigns a value to a variable.

It is illegal to make an assignment to a variable that has not been explicitly declared using a <var> element or a var statement within a <script>. Attempting to assign to an undeclared variable causes an error.semantic event to be thrown.

Note that when an ECMAScript object, say "obj", has been properly initialized then its properties, for instance "obj.prop1", can be assigned without explicit declaration (in fact, an attempt to declare ECMAScript object properties such as "obj.prop1" would result in an error.semantic event being thrown).

Attributes of <assign>

Table 50: <assign> attributes
name The name of the variable being assigned to. The corresponding variable must have been previously declared otherwise an error.semantic event is thrown. By default, the scope in which the variable is resolved is the closest enclosing scope of the currently active element. To remove ambiguity, the variable name may be prefixed with a scope name.
expr The expression evaluating to the new value of the variable (optional).

Children of <assign>

The children of the <assign> element represent an in-line specification of the new value of the variable.

If "expr" attribute is present, then the element must not have any children. Thus "expr" and children are mutually exclusive for the <assign> element.

<assign> examples

This section is informative.

    <assign name="phone" expr="'6305551212'"/>
    <assign name="y" expr="document.z+1"/>
    <assign name="application.foo" expr="dialog.bar * 2"/>
    <assign name="itinerary">
      <root xmlns="">
        <flight>SW123</flight>
        <origin>JFK</origin>
        <depart>2009-01-01T14:32:00</depart>
        <destination>SFO</destination>
        <arrive>2009-01-01T18:14:00</arrive>
      </root>
    </var>
    

The above examples have the following result, in order:

  1. Updates the variable with name "phone" to a new String value "6305551212" in the closest enclosing scope as determined by the position of this <assign> element in the document. If a variable named "phone" is not already defined in the mentioned scope, an error.semantic is thrown (since this is always true if variables that are not already defined are attempted to be updated using <assign>, the rest of this section will not repeat this for each example).
  2. Updates the variable with name "y" in the closest enclosing scope as determined by the position of this <assign> element in the document to the value corresponding to the result of the expression "document.z+1", evaluated when this <assign> element is executed.
  3. Updates the variable with name "foo" in the application scope to the value corresponding to the result of the expression "dialog.bar * 2", evaluated when this <assign> element is executed.
  4. Updates the variable with name "itinerary" in the closest enclosing scope as determined by the position of this <assign> element in the document to the value specified by the following in-line XML tree (the internal representation may be the corresponding DOM node, for example):
        <root xmlns="">
          <flight>SW123</flight>
          <origin>JFK</origin>
          <depart>2009-01-01T14:32:00</depart>
          <destination>SFO</destination>
          <arrive>2009-01-01T18:14:00</arrive>
        </root>
            
    

Translating to the Data Model Resource API

Implementation Notes: This section illustrates how the above examples translate to the 5.1.1 Data Model Resource API .

The above examples result in the following Data Model Resource API calls, in order:

  1. At the time of <assign> execution, first the new value is obtained for the variable with name "phone" by evaluating the expression "'6305551212'", and subsequently the variable is updated in the scope on top of the stack:
    1. Obtain new variable value by calling EvaluateExpression("'6305551212'")
    2. Update variable value by calling UpdateVariable("name", value) where value is the result obtained in a. above. The optional scope parameter is not specified since the scope on the top of the stack is chosen by default.
  2. At the time of <assign> execution, first the new value is obtained for the variable with name "y" by evaluating the expression "document.z+1", and subsequently the variable is updated in the scope on top of the stack:
    1. Obtain new variable value by calling EvaluateExpression("document.z+1")
    2. Update variable value by calling UpdateVariable("y", value) where value is the result obtained in a. above.
  3. At the time of <assign> execution, first the new value is obtained for the variable with name "foo" by evaluating the expression "dialog.bar * 2", and subsequently the variable is updated in application scope:
    1. Obtain new variable value by calling EvaluateExpression("dialog.bar * 2")
    2. Update variable value by calling UpdateVariable("foo", value, "application") where value is the result obtained in a. above.
  4. At the time of <assign> execution, the variable with name "itinerary" is updated in the scope on top of the stack using the in-line specification for the new value specified in the body of the <assign> element:
    1. Process the in-line specification below into an internal representation for the data model. For example, the assumed XML data model in this example may choose to internally represent this in-line specification as a DOM node.
          <root xmlns="">
            <flight>SW123</flight>
            <origin>JFK</origin>
            <depart>2009-01-01T14:32:00</depart>
            <destination>SFO</destination>
            <arrive>2009-01-01T18:14:00</arrive>
          </root>
                  
      
    2. Update variable value by calling UpdateVariable("itinerary", node) where node is the DOM node from a. above.
6.12.3.3.2 The <data> element

The <data> element allows a VoiceXML application to fetch an in-line specification of a new value for a named variable from a document server without transitioning to a new VoiceXML document. The data fetched is bound to the named variable.

Attributes of <data>

Table 51: <data> Attributes
src The URI specifying the location of the in-line data specification to retrieve (optional). This specification depends on the data language in use for the VoiceXML document (XML, JSON).
name The name of the variable that the data fetched will be bound to.
scope The scope within which the named variable to bind the data is found (optional). Must be one of session , application , document or dialog . If the specified scope does not exist, then an error.semantic event is thrown.
srcexpr Like src, except that the URI is dynamically determined by evaluating the given expression when the data needs to be fetched (optional). If srcexpr cannot be evaluated, an error.semantic event is thrown.
method The request method: get (the default) or post (optional).
namelist The list of variables to submit (optional). By default, no variables are submitted. If a namelist is supplied, it may contain individual variable references which are submitted with the same qualification used in the namelist. Declared VoiceXML variables can be referenced.
enctype The media encoding type of the submitted document (optional). The default is application/x-www-form-urlencoded. Interpreters must also support multipart/form-data [RFC2388] and may support additional encoding types.
fetchaudio See Section 6.1 of [VXML2] (optional). This defaults to the fetchaudio property described in Section 6.3.5 of [VXML2].
fetchhint See Section 6.1 of [VXML2] (optional). This defaults to the datafetchhint property described in Section 2.3.3.2.3.
fetchtimeout See Section 6.1 of [VXML2] (optional). This defaults to the fetchtimeout property described in Section 6.3.5 of [VXML2].
maxage See Section 6.1 of [VXML2] (optional). This defaults to the datamaxage property described in Section 2.3.3.2.3.
maxstale See Section 6.1 of [VXML2] (optional). This defaults to the datamaxstale property described in Section 2.3.3.2.3.

Exactly one of "src" or "srcexpr" must be specified; otherwise, an error.badfetch event is thrown. If the content cannot be retrieved, the interpreter throws an error as specified for fetch failures in Section 5.2.6 of [VXML2].

If the value of the src or srcexpr attribute includes a fragment identifier, the processing of that fragment identifier is platform-specific.

Platforms should support parsing XML data into a DOM. If an implementation does not support DOM, the name attribute must not be set, and any retrieved content must be ignored by the interpreter. If the name attribute is present, these implementations will throw error.unsupported.data.name.

If the name attribute is present, and the returned document is XML as identified by [RFC3023], the VoiceXML interpreter must expose the retrieved content via a read-only subset of the DOM as specified in Appendix D of [VXML2.1]. An interpreter may support additional data formats by recognizing additional media types. If an interpreter receives a document in a data format that it does not understand, or the data is not well-formed as defined by the specification of that format, the interpreter throws error.badfetch. If the media type of the retrieved content is one of those defined in [RFC3023] but the content is not well-formed XML, the interpreter throws error.badfetch.

If use of the DOM causes an uncaught DOMException to be thrown, the VoiceXML interpreter throws error.semantic.

Before exposing the data in an XML document referenced by the <data> element via the DOM, the interpreter should check that the referring document is allowed to access the data. If access is denied the interpreter must throw error.noauthorization.

Note: One strategy commonly implemented in voice browsers to control access to data is the "access-control" processing instruction described in the WG Note: Authorizing Read Access to XML Content Using the <?access-control?> Processing Instruction 1.0 [DATA_AUTH].

Like the <var> element, the <data> element can occur in executable content or as a child of <form> or <vxml>. In addition, it shares the same default scoping rules as the <var> element. If a <data> element has the same name as a variable already declared in the same scope, the variable is assigned a reference to the new value exposed by the <data> element.

Like the <submit> element, when variable data is submitted to the server its value is first converted into a string before being submitted. If the variable is a DOM Object, its serialized as the corresponding XML. If the variable is an ECMAScript Object, the mechanism by which it is submitted is not currently defined. If a <data> element's namelist contains a variable which references recorded audio but does not contain an enctype of multipart/form-data [RFC2388], the behavior is not specified. It is discouraged to attempt to URL-encode large quantities of data.

<data> example

The example discussed in this section uses XML as the data language and fetches the following XML document using the <data> element:

    <?xml version="1.0" encoding="UTF-8"?>
    <quote xmlns="http://www.example.org">
      <ticker>F</ticker>
      <name>Ford Motor Company</name>
      <change>0.10</change>
      <last>3.00</last>
    </quote>
    

The above stock quote is retrieved in one dialog, the document element is cached in a variable at document scope and used to playback the quote in another dialog. The data access and manipulation language in the example is XPath 2.0 [XPATH20].

    <?xml version="1.0" encoding="UTF-8"?>
    <vxml xmlns="http://www.w3.org/2001/vxml" 
      version="2.1"
      xmlns:ex="http://www.example.org"
      xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
      xsi:schemaLocation="http://www.w3.org/2001/vxml 
      http://www.w3.org/TR/2007/REC-voicexml21-20070619/vxml.xsd">
      <var name="quote"/>
      <var name="tickers">
        <tickers xmlns="">
          <ford>f</ford>
          <!-- etc., the dialog below hardcodes ford -->
        </tickers>
      </var>
      <form id="get_quote">
        <block>
          <data name="quote" scope="document"
            srcexpr="'http://www.example.org/getquote?ticker=' + document('tickers')/ford"/>
          <goto next="#play_quote"/>         
        </block>
      </form>
      <form id="play_quote">
        <block>
          <var name="name" expr="document('quote')/ex:name"/>
          <var name="change" expr="document('quote')/ex:change"/>
          <var name="last" expr="document('quote')/ex:last"/>
          <var name="dollars" expr="fn:floor(last)"/>
          <var name="cents" expr="fn:substring(last,fn:string-length(last)-1)"/>
          <!--play the company name -->
          <audio expr="document('tickers')/ford + '.wav'"><value expr="name"/></audio>
          <!-- play 'unchanged, 'up', or 'down' based on zero, positive, or negative change -->
          <if cond="change = 0">
            <audio src="unchanged_at.wav"/>
          <else/>
            <if cond="change &gt; 0">
              <audio src="up.wav"/>
            <else/> <!-- negative -->
              <audio src="down.wav"/>
            </if>
            <audio src="by.wav"/>
            <!-- play change in value as positive number -->
            <audio expr="fn:abs(change) + '.wav'"><value expr="fn:abs(change)"/></audio>
            <audio src="to.wav"/>
          </if>
          <!-- play the current price per share -->
          <audio expr="dollars + '.wav'"><value expr="dollars"/></audio>
          <if cond="cents &gt; 0">
            <audio src="point.wav"/>
            <audio expr="cents + '.wav'"><value expr="cents"/></audio>
          </if>
        </block>
      </form>
    </vxml>
    

Translating to the Data Model Resource API

Implementation Notes: This section illustrates how the above examples translate to the 5.1.1 Data Model Resource API .

The single <data> usage in the above example results in the following behavior and Data Model Resource API calls:

At the time of <data> execution, the variable with name "quote" is updated in the document scope using the in-line specification for the new value retrieved from the URI expression 'http://www.example.org/getquote?ticker=' + document('tickers')/ford which evaluates to http://www.example.org/getquote?ticker=f

  1. Obtain the URI to request the in-line specification of the new value from, by evaluating the "srcexpr" attribute value EvaluateExpression("'http://www.example.org/getquote?ticker=' + document('tickers')/ford")
  2. Request the in-line specification from the URI resulting from a. which happens to be http://www.example.org/getquote?ticker=f in this example.
  3. Process the response received (the in-line data specification below) into an internal representation for the data model. The XML data model internally represents this in-line specification as a DOM node (the document element).
        <quote xmlns="http://www.example.org">
          <ticker>F</ticker>
          <name>Ford Motor Company</name>
          <change>0.10</change>
          <last>3.00</last>
        </quote>
            
    
  4. Update the "quote" variable value in document scope by calling UpdateVariable("quote", node, "document") where node is the DOM node in c. above.

<data> Fetching Properties

These properties pertain to documents fetched by the <data> element.

Table 52: <data> Fetching Properties
datafetchhint Tells the platform whether or not data documents may be pre-fetched. The value is either prefetch (the default), or safe.
datamaxage Tells the platform the maximum acceptable age, in seconds, of cached documents. The default is platform-specific.
datamaxstale Tells the platform the maximum acceptable staleness, in seconds, of expired cached data documents. The default is platform-specific.
6.12.3.4 Deleting variables: the <clear> element

The <clear> element resets one or more variables, including form items.

For each specified variable name, the variable is resolved relative to the current scope by default (to remove ambiguity, each variable name in the namelist may be prefixed with a scope name). Once a declared variable has been identified, its value is assigned the default initial value defined by the data access expression language in use (for example, when using ECMAScript, the variables are reset to the undefined value). In addition, if the variable name corresponds to a form item, then the form item's prompt counter and event counter are reset.

Attributes of <clear>

Table 53: <clear> attributes
namelist The list of variables to be reset; this can include variable names other than form items. If an undeclared variable is referenced in the namelist, then an error.semantic is thrown. When not specified, all form items in the current form are cleared.
scope The scope within which the named variables must be resolved (optional). Must be one of session , application , document or dialog . If the specified scope does not exist, then an error.semantic event is thrown.

<clear> examples

This section is informative.

  <clear namelist="city state zip"/>
  <clear namelist="application.foo dialog.bar baz"/>
  <clear namelist="alpha beta application.gamma" scope="document"/>
  <clear/>
  <clear scope="dialog"/>
    

The above examples have the following result, in order:

  1. The variables "city", "state" and "zip" in the closest enclosing scope are reset. If any of these are form items, the associated prompt and event counters are also reset (since this is always true if variables are form items, the rest of this section will not repeat this for each example).
  2. The variable "foo" is reset in application scope, the variable "bar" is reset in the dialog scope and the variable "baz" is reset in the closest enclosing scope.
  3. The scope-unqualified "alpha" and "beta" variables are reset in the specified document scope default, and the "gamma" variable is reset in the application scope.
  4. All variables in the closest enclosing scope are reset.
  5. All variables in the dialog scope are reset.

Translating to the Data Model Resource API

Implementation Notes: This section illustrates how the above examples translate to the 5.1.1 Data Model Resource API .

The above examples result in the following Data Model Resource API calls, in order:

  1. The namelist "city state zip" is tokenized into variable names and each variable is reset in the scope on top of the stack (the optional scope parameter is not specified in the calls since the scope on the top of the stack is chosen by default):
    1. Tokenize namelist "city state zip" into tokens "city", "state" and "zip"
    2. Reset variable "city" by calling DeleteVariable("city")
    3. Reset variable "state" by calling DeleteVariable("state")
    4. Reset variable "zip" by calling DeleteVariable("zip")
  2. The namelist "application.foo dialog.bar baz" is tokenized into variable names and each scope-qualified variable is reset in the mentioned scope and scope-unqualified variables are reset in the scope on top of the stack:
    1. Tokenize namelist "application.foo dialog.bar baz" into tokens "application.foo", "dialog.bar" and "baz"
    2. Reset variable "foo" in application scope by calling DeleteVariable("foo", "application")
    3. Reset variable "bar" in dialog scope by calling DeleteVariable("bar", "dialog")
    4. Reset variable "baz" in current scope by calling DeleteVariable("baz")
  3. The namelist "alpha beta application.gamma" is tokenized into variable names and each scope-qualified variable is reset in the mentioned scope and scope-unqualified variables are reset in the document scope:
    1. Tokenize namelist "alpha beta application.gamma" into tokens "alpha", "beta" and "application.gamma"
    2. Reset variable "alpha" by calling DeleteVariable("alpha", "document")
    3. Reset variable "beta" by calling DeleteVariable("beta", "document")
    4. Reset variable "gamma" by calling DeleteVariable("gamma", "application")
  4. In the absence of a namelist, each variable in the scope on top of the stack is reset. For each variable var in the closest enclosing scope:
    1. Reset variable DeleteVariable(var)
    2. If var is a form item, reset prompt and event counters
  5. In the absence of a namelist, each variable in the dialog scope is reset. For each variable var in dialog scope:
    1. Reset variable DeleteVariable(var, "dialog")
    2. If var is a form item, reset prompt and event counters

Issue (ResetVar):

ResetVariable()?

While this section uses the DeleteVariable() method, a ResetVariable() method that better aligns with <clear> semantics should be considered for addition to the 5.1.1 Data Model Resource API . This will allow resetting variable values to the initial in-line specification when such is present, for instance.

Resolution:

None recorded.

6.12.3.5 Relevance for properties

Platform properties are discussed in detail in 8.2 Properties . VoiceXML 3.0 provides a consistent mechanism to unambiguously read these properties in any scope using the data access and manipulation language in a manner similar to accessing and manipulating named variables as illustrated in section 2.3.2. However, properties cannot be created, updated or deleted using any of the syntax described in this module. The <property> element syntax must be used for such operations.

6.12.4 Backward compatibility with VoiceXML 2.1

VoiceXML 3.0 adds some new features to data access and manipulation but does not change any existing behavior. Thus, this module is backwards compatible with Voice XML 2.1.

Likewise, the VoiceXML 2.1 profile in VoiceXML 3.0 is not required to support any of the new features added in this module. In particular, the following features may be excluded by implementors supporting the VoiceXML 2.1 profile.

Table 54: VoiceXML 2.1 profile exclusions
properties$ implicit variable The VoiceXML 2.1 profile does not require the properties$ implicit variable to be supported in any scopes.
"scope" attribute The optional "scope" attribute is not required to be supported in the VoiceXML 2.1 profile for the <var>, <value>, <assign>, <data> and <clear> elements.
<var> and <assign> children The VoiceXML 2.1 profile does not require in-line specifications for the initial value using the children of the <var> element or in-line specifications for the new value using the children of the <assign> element to be supported.

6.12.5 Implicit functions using XPath

Implicit variables described in section 2.2.3 to qualify scope are amenable to certain data access and manipulation languages (such as ECMAScript) but are not as elegant while incorporating in the syntax of others, such as XPath. VoiceXML 3.0 permits the use of functions rather than variables to address this. The following table illustrates how scope qualifiers are exposed as XPath functions.

Table 55: Table 11: Implicit XPath functions and variables
session() This single String argument function retrieves the value of the variable named in the argument from the session scope.
application() This single String argument function retrieves the value of the variable named in the argument from the application scope.
document() This single String argument function retrieves the value of the variable named in the argument from the document scope.
dialog() This single String argument function retrieves the value of the variable named in the argument from the dialog scope.
anonymous() This single String argument function retrieves the value of the variable named in the argument from the anonymous scope.
properties$ This read-only implicit variable refers to the defined properties which affect platform behavior in a given scope. The value is an XML tree with a <properties> root element and multiple children as necessary where each child element has the name of an existing platform property in that scope and body content corresponding to the value of the platform property. CDATA sections are used if necessary.

The following table shows how these qualifier functions are used, and the examples are XPath variants of the examples illustrated in Table 47.

Table 56: Resolution examples (XPath)
Usage Result
application('hello') The value of the "hello" named variable in the application scope.
dialog('retries') The value of the "retries" named variable in the dialog scope.
dialog('properties$')/bargein The value of the "bargein" platform property defined at the current "dialog" scope.

Issue (DataProfiles):

Data profiles?

Consider moving contents into a profile that includes an XML Data Model and XPath Data Access and Manipulation. Generically, consider profiles for various data access and manipulation languages, including an ECMA profile.

Resolution:

None recorded.

6.13 External Communication Module

This module supports the sending and receiving of external messages by a voice application by introducing the <send> and <receive> elements into VoiceXML. The application developer chooses to send and receive external messages synchronously or asynchronously. When sending a message, the developer chooses whether or not it should represent a named event. The developer also chooses whether or not to include a payload. These choices can be made statically or dynamically at run-time.

Note that this section only covers receiving messages that the interpreter does not handle. In other words, at application level events. Some events, like lifecycle events targeted at creating or destroying sessions are not targetted at the application author but instead are handled by the browser itself. The complete list of all these interpreter level events is TBD but might include events such as "create session", "pause", "resume", or "disconnect".

Although this section handles many of the easy and moderately difficult cases, for certain very complicated cases it may be appropriate to put a gatekeeper filter between the VXML interpreter and the external events to filter and only allow certain events to interrupt the processing of the VXML document. For example if someone wanted a "operator" event to only be allowed to interrupt the VXML document if its data variable held a certain value, or if they wanted the "operator" event but not the "caller" event, then a filter might be appropriate. SCXML is one method that is suitable for providing these type of more advanced filters.

6.13.1 Receiving external messages within a voice application

Because external messages can arrive at any time, they can be disruptive to a voice application. A voice application developer decides whether these messages are delivered to the application synchronously or asynchronously using the "externalevents.enable" property. The property can be set to one of the following values:

Table 57: externalevents.enable values
true External messages are delivered asynchronously as VoiceXML events.
false External messages are delivered synchronously. This is the default.

When external messages are delivered synchronously, an application developer decides whether these messages are preserved or discarded by setting the "externalevents.queue" property. The property can be set to one of the following values:

Table 58: externalevents.queue values
true External messages are queued.
false An external messages that is not delivered as a VoiceXML event is discarded. This is the default.
6.13.1.1 External Message Reflection

If "externalevents.enable" is set to true and an external message arrives, the external message is reflected to the application in the application.lastmessage$ variable. application.lastmessage$ is an ECMAScript object with the following properties:

Table 59: application.lastmessage$ properties
contenttype The media type of the external message.
event The event name, if any, or ECMAScript undefined if no event name was included in the external message.
content The content of the message, if any, or ECMAScript undefined. If the Content-Type of the message is one of the media types described in [RFC 3023], the VoiceXML interpreter must expose the retrieved content via a read-only subset of the DOM as described in [VXML21]. An interpreter may support additional data formats by recognizing additional media types. If an interpreter receives an external message with a payload in a data format that it does not understand, or the payload is not well-formed as defined by the specification of that format, the interpreter throws "error.badfetch".

If no external messages have been received, application.lastmessage$ is ECMAScript undefined. Only the last received message is available. To preserve a message for future reference during the lifetime of the application, the application developer can copy the data to an application-scoped variable.

6.13.1.2 Receiving External Messages Asynchronously

To receive an external message asynchronously, an application defines an "externalmessage" event handler. The event handler must be declared within the appropriate scope since the user-defined <catch> handler is selected using the algorithm described in section 5.2.4 of [VXML2].

If the payload of an external message includes an event name, the name is appended to the name of the event that is thrown to the application separated by a dot (e.g. "externalmessage.ready"). This allows applications to handle external messages using different event handlers.

Asynchronous external messages are processed in the same manner that a disconnect event is handled in VXML2 .

Events are dispatched to the application serially. Since the interpreter only reflects the data associated with a single external message at a time, it is the application's responsibility to manage the data associated with each external message once that message has been delivered.

The following example demonstrates asynchronous receipt of an external message. The catch handler copies the reflected external message into an array at application scope.

<vxml version="2.1"
  xmlns="http://www.w3.org/2001/vxml">
  <property name="externalevents.enable" value="true"/>
  <var name="myMessages" expr="new Array()"/>
  <catch event="externalmessage">
    <var name="lm" expr="application.lastmessage$"/>
    <if cond="lm.contenttype == 'text/xml' || lm.contenttype == 'application/xml'">
      <log>received XML with root document element
        <value expr="lm.content.documentElement.nodeName"/>
      </log>
    <elseif cond="typeof lm.content == 'string'"/>
      <log>received <value expr="lm.content"/></log>
    <else/>
      <log>received unknown external message type
        <value expr="typeof lm.content"/>
      </log>
    </if>
    <script>
      myMessages.push({'content' : lm.content, 'ctype' : lm.contenttype});
    </script>
  </catch>
  <form>
  <field name="num" type="digits">
    <prompt>pick a number any number</prompt>
    <catch event="noinput nomatch">
      sorry. didn't get that.
      <reprompt/>      
    </catch>
    <filled>
      you said <value expr="num"/>
      <clear/>
    </filled>
  </field>
  </form>
</vxml>
6.13.1.3 Receiving External Messages Synchronously

To receive an external message synchronously set the "externalevents.enable" property to false and the "externalevents.queue" property to true, and use the <receive> element to pull messages off the queue. <receive> blocks until an external message is received or the timeout specified by the maxtime attribute is exceeded.

6.13.1.3.1 <receive>

To support receipt of external messages within a voice application, use the <receive> element. <receive> is allowed wherever executable content is allowed in [VXML21], for example a <block> element.

<receive> supports the following attributes:

Table 60: <receive> properties
Name Description Required Default
fetchaudio See Section 6.1 of [VXML2]. This defaults to the fetchaudio property described in Section 6.3.5 of [VXML2]. No N/A
fetchaudioexpr An ECMAScript expression evaluating to the fetchaudio URI. If evaluation of the expression fails, the interpreter throws "error.semantic". No N/A
maxtime A W3C time specifier indicating the maximum amount of time the interpreter waits to receive an external message. If the timeout is exceeded, the interpreter throws "error.badfetch." A value of "none" indicates the interpreter blocks indefinitely. No 0s
maxtimeexpr An ECMAScript expression evaluating to the maxtime value. If evaluation of the expression fails, the interpreter throws "error.semantic". No 0s

Only one of fetchaudio and fetchaudioexpr can be specified or "error.badfetch" is thrown.

Only one of maxtime and maxtimeexpr can be specified or "error.badfetch" is thrown.

When present, the attributes fetchaudioexpr and maxtimeexpr are evaluated when the <receive> is executed.

The following example demonstrates synchronously receiving an external message. In this example, the interpreter blocks for up to 15 seconds waiting for an external message to arrive. If no external message is received during that interval, the interpreter throws "error.badfetch". If a message is received, the interpreter proceeds by executing the <log> element.

<vxml version="2.1"
  xmlns="http://www.w3.org/2001/vxml">
  <property name="externalevents.queue" value="true"/>
  <form>
    <catch event="error.badfetch">
      <log>timed out waiting for external message</log>
    </catch>
  
    <block>
      Hold on ...
      <receive maxtime="15s" 
        fetchaudio="http://www.example.com/audio/fetching.wav"/>
      <log>got <value expr="application.lastmessage$.content"/></log>
    </block>  
  </form>
</vxml>

6.13.2 Sending messages from a voice application

To send a message from a VoiceXML application to a remote endpoint, use the <send> element. <send> is allowed within executable content . Implementations must support the following attributes:

Table 61: <send> Attributes
Name Description Required Default
async A boolean indicating whether or not to block until the final response to the transaction created by sending the external event is received, or a timeout. No true
asyncexpr An ECMAScript expression evaluating to the value of the async attribute. If evaluation of the expression fails, the interpreter throws "error.semantic". No N/A
body A string representing the data to be sent in the body of the message. No N/A
bodyexpr An ECMAScript expression evaluating to the body of the message to be sent. If evaluation of the expression fails, the interpreter throws "error.semantic". No N/A
contenttype A string indicating the media type of the body being sent, if any. The set of content types may be limited by the underlying platform. If an unsupported media type is specified, the interpreter throws "error.badfetch.<protocol>.400." The interpreter is not required to inspect the data specified in the body to validate that it conforms to the specified media type. No text/plain
contenttypeexpr An ECMAScript expression evaluating to the media type of the body. If evaluation of the expression fails, the interpreter throws "error.semantic". No N/A
event The name of the event to send. The value is a string which only includes alphanumeric characters and the "." (dot) character. The first character must be a letter. If the value is invalid, then an "error.badfetch" event is thrown. No N/A
eventexpr An ECMAScript expression evaluating to the name of the event to be sent. If evaluation of the expression fails, the interpreter throws "error.semantic". No N/A
fetchaudio See Section 6.1 of [VXML2]. This defaults to the fetchaudio property described in Section 6.3.5 of [VXML2]. No N/A
fetchaudioexpr An ECMAScript expression evaluating to the fetchaudio URI. If evaluation of the expression fails, the interpreter throws "error.semantic". No N/A
namelist A list of zero or more whitespace-separated variable names to send. By default, no variables are submitted. Values for these variables are evaluated when the <send> element is executed. Only declared variables can be referenced; otherwise, "error.semantic" is thrown. Variables must be submitted to the server with the same qualification used in the namelist. When an ECMAScript variable is submitted to the server, its value must be converted first into a string before being sent. If the variable is an ECMAScript object, the mechanism by which it is submitted is platform-specific. Instead of submitting an ECMAScript object directly, the application developer can explicitly submit the individual properties of the object (e.g. "date.month date.year"). No N/A
target Specifies the URI to which the event is sent. If the attribute is not specified, the event is sent to the component which invoked the VoiceXML session. No Invoking component
targetexpr An ECMAScript expression evaluating to the target URI. If evaluation of the expression fails, the interpreter throws "error.semantic". No N/A
timeout See 6.13.2.1 sendtimeout . This defaults to the sendtimeout property. No N/A
timeoutexpr An ECMAScript expression evaluating to the timeout interval for a synchronous <send>. If evaluation of the expression fails, the interpreter throws "error.semantic" No N/A

Only one of async and asyncexpr can be specified or "error.badfetch" is thrown.

Only one of event or eventexpr can be specified or "error.badfetch" is thrown.

Only one of body, bodyexpr, namelist, event, or eventexpr must be specified or "error.badfetch" is thrown.

Only one of contenttype and contenttypeexpr can be specified or "error.badfetch" is thrown.

Only one of fetchaudio and fetchaudioexpr can be specified or "error.badfetch" is thrown.

Only one of target and targetexpr can be specified or "error.badfetch" is thrown.

Only one of timeout and timeoutexpr can be specified or "error.badfetch" is thrown.

When present, the attributes asyncexpr, bodyexpr, contenttypeexpr, eventexpr, fetchaudioexpr, targetexpr, and timeoutexpr are evaluated when the <send> is executed.

If a synchronous <send> succeeds, execution proceeds according to the Form Interpretation Algorithm. If the <send> times out, the interpreter throws "error.badfetch" to the application. If the interpreter encounters an error upon sending the external message, the interpreter throws "error.badfetch.<protocol>.<status_code>" to the application. If no status code is available, the interpreter throws "error.badfetch.<protocol>".

The following example demonstrates the use of <send> synchronously:

<vxml version="2.1"
  xmlns="http://www.w3.org/2001/vxml">
  <form>
  <field name="user_id" type="digits">
    <prompt>please type your five digit i d</prompt>
    <filled>
      <send async="false" 
            bodyexpr="'&lt;userinfo&gt;&lt;id&gt;' + user_id + '&lt;/id&gt;&lt;/userinfo&gt;'" 
            contenttype="text/xml"/>
      <goto next="mainmenu.vxml"/>
    </filled>
  </field>
  </form>
</vxml>

Upon executing an asynchronous <send>, the interpreter continues execution of the voice application immediately and disregards the disposition of the message that was sent.

The following example demonstrates the use of <send> asynchronously:

<vxml version="2.1"
  xmlns="http://www.w3.org/2001/vxml">
  <form>
    <var name="tasktarget" expr="'http://www.example.com/taskman.pl'"/>
    <var name="taskname" expr="'cc'"/>
    <var name="taskstate"/>
    <block>
      <assign name="taskstate" expr="'start'"/>     
      <send async="true" 
            targetexpr="tasktarget" 
            namelist="taskname taskstate"/>
    </block>
    <field name="ccnum"/>
    <field name="expdate"/>
    <block>
      <assign name="taskstate" expr="'end'"/>     
      <send async="true" 
            targetexpr="tasktarget" 
            namelist="taskname taskstate"/>
    </block>
  </form>
</vxml>
6.13.2.1 sendtimeout

The sendtimeout property controls the interval to wait for a synchronous <send> to return before an "error.badfetch" event is thrown. The value is a Time Designation as specified in Section 6.5 of [VXML2]. If not specified, the value is derived from the innermost sendtimeout property.

6.14 Session Root Module

The session root module allows for a VXML document to exist across a VXML session (I.e., transition from one application to another) similar to the way an application root document allows for a VXML document to exist across VXML document transitions.

6.14.1 Syntax

The syntax of the session root module defines the addition of two attributes on the root <vxml> element. These two new attributes are summarized below.

Table 62: Table: Two new <vxml> Attributes for Session Root Module
Name Type Description Required Default Value
session URI URI location of document to be loaded as session root No N/A
requiresession Boolean if the error of a duplicated new session root should fail the document No false

6.14.2 Semantics

The session attribute is an optional attribute on vxml tag. It is a URI reference just like the application attribute is (same URI resolution). If a VXML session has not yet encountered a document with a session root then upon encountering the first vxml document that has a session root, the session root document is loaded and parsed just like how a normal vxml document would load and parse an application root document. If a VXML session has already loaded a different session root then the behavior when a future session attributes is encountered is controlled by the requiresession attribute. If the requiresession attribute is true then encountering a session root attribute with a different URL then the already loaded session root is an error and an error.badfetch is generated. If the requiresession attribute is false then the new session attribute is ignored and the old one is used. The requiresession attribute defaults to false if not present. The behavior of the session root is completely the same as the behavior of the application root, except that while executing in the session root the vxml browser is allowed to write to the javascript session scope, and variables declared as child of the vxml tag thus become session scope variables. In particular, in VXML 2.0 section 5.1.2 when talking about the variable scopes the text for application in table 40 is also appropriate for session (new text "These are declared with <var> and <script> elements that are children of the session root document's <vxml> element. They are initialized when the session root document is loaded. They exist while the session document is loaded, and are visible to the session root document, the application root document, and any other loaded application leaf document.").

This session document then is loaded and active in the hierarchy of documents that follows the javascript scope chaining (that is a document is below an application root is below a session root). This means that if a variable is declared in the session root and then in some local form in the leaf document the variable would be shadowed (just like how the shadowing from the application root).

This also implies that the catch selection algorithm as described in VXML 2.0 section 5.2.4 would have to change to include the session root document as a potential source of catch handlers (new text "Form an ordered list of catches consisting of all catches in the current scope and all enclosing scopes (form item, form, document, application root document, session root document, interpreter context), ordered first by scope (starting with the current scope), and then within each scope by document order."). Then all catch handling would remain the same, in particular the as-if-by copy semantics are retained so if an event from a leaf document was handled by a catch handler from the session root the catch handler wouldn't execute within the context of the session root document but would instead execute as if by copy into the local leaf document context.

This also implies that property lookup from section 6.3 of VXML 2.0 would have to change to say that property value lookup can also go to the session root, if a more local value for the property isn't found (new text "Properties may be defined for the whole session, for the whole application, for the whole document at the <vxml> level, for a particular dialog at the <form> or <menu> level, or for a particular form item.). This doesn't change the usual way properties work where a property at a lower level override one at a higher level.

This also implies that the behavior for link's that are document-level link of session roots are active which would be a change to section 2.5 of VXML 2.0 (new text "If an application root document has a document-level link, its grammars are active no matter what document of the application is being executed. If an session root document has a document-level link, its grammars are active no matter what document of the session is being executed. If execution is in a modal form item, then link grammars at the form, document, application or session level are not active.").

Similar to for links, the scope of grammars from section 3.1.3 of VXML 2.0 would be changed to specify what happens when a grammar from a session root has document scope (new text "Form grammars are by default given dialog scope, so that they are active only when the user is in the form. If they are given scope document, they are active whenever the user is in the document. If they are given scope document and the document is the application root document, then they are also active whenever the user is in another loaded document in the same application. If they are given scope document and the document is the session root document, then they are also active throughout the session.". Note that this active throughout the session can still be trumped by modal listen states (just like the application root can). Section 3.1.4 of VXML 2.0 also changes the activation of grammars bulleted list to include the session root (new text: "grammars contained in links in its application root document or session root document, and grammars for menus and forms in its application root document or session root document which are given document scope.").

6.14.3 Examples

For the sake of compactness assume throughout this example that the single letters used are actually fully qualified URIs. A VXML document "A" transitions to VXML document "B" which is partially represented below:

<vxml session="C" application="D" … >

Before "B" can finish initialization of "B" it loads, parses, and initializes the VXML documents at both "C" and "D". While executing in "B" any grammars, properties, links, and variables included from either "C" or "D" influence execution. Document "B" then transitions to document "E", with no session attribute, partially represented below:

<vxml application="D" … >

While executing in "E" having come from "B", everything from both "C" and "D" are still active. "D" is still active as we haven't left the application yet. "C" is still active as we are part of the same session. Document "E" now transitions to document "F" partially represented below:

<vxml application="G" … >

Now, since we have changed applications, the application root document form "D" is unloaded and grammars, variables, properties, etc. from "D" are no longer influencing our execution. Document "G" defines our application root and it, along with "C" which is still active since we are in the same session, now influence our execution. Document "F" now transitions to "H" partially represented below:

<vxml session="I" … >

Now, since there is already "C" as our session root document defined we cannot load document "I" and treat this as our session root. In the absence of requiresession "I" is ignored and "H" is executed using "C" as our session document. If instead "H" looked as below:

<vxml session="I" requiresession="true" … >

Now "H" would fail to load and execution would revert to document "F" where the appropriate error.badfetch for "H" would be thrown.

7 Profiles

VoiceXML 3.0, like SMIL, is a specification that contains a variety of functional modules. Not all implementers of VoiceXML will be interested in implementing all of the functionality defined in the document. For example, an implementer may have no interest in speech or DTMF recognition but still be interested in speech output. An example might be an implementer of book reading products for the visually impaired. Also, the syntax defined for each of the VoiceXML 3.0 modules is fairly low-level, and authors familiar with a more declarative language may wish to have higher-level syntax that is easier to program in.

To address these interests while maintaining sufficiently precise behavior definition to enhance portability, we encourage the use of profiles.

A profile is an implementation of VoiceXML 3 that

This specification defines the following profiles:

It should be possible for other profiles to be created, perhaps by modifying an existing profile, combining different modules, or even adding new module functionality and syntax.

Implementers may differ in their choice of which profiles they implement. Implementers must support a designated set of modules in order to claim support for VoiceXML 3. That designated set is TBD.

[ISSUES:

7.1 VoiceXML 2.1 Legacy Profile

Editorial note   The name of this profile may change. [Motivation: tutorial, PoC, transitional, and that vxml3 is a superset of VoiceXML 2.1.]

The VoiceXML 2.1 Legacy profile is included demonstrating how profiles are defined in VoiceXML 3.0. Using existing elements from the [VOICEXML21] specification is helpful as the semantics of these elements are already well defined and well understood. Thus changes in how they are presented are a result of the module and profile style of VoiceXML 3.0 and of making more explicit and formal the precise detailed semantics.

The VoiceXML 2.1 Legacy profile also plays a transitional role as VoiceXML 3.0 as a whole is built on top of VoiceXML 2.1. VoiceXML 3.0 is a superset of VoiceXML 2.1 and includes the traditional 2.1 functionality plus some new modules. The VoiceXML 2.1 Legacy profile is the set of modules that were always present in VoiceXML 2.1 but that weren't expressed in the specification as individual modules. This also allows a clear path for the VoiceXML application developer as applications authored in version 2.1 of VoiceXML will continue to work and the application developer will not need to learn substantial new syntax or semantics when they develop in the VoiceXML 2.1 Legacy profile of VoiceXML 3.0.

The VoiceXML 2.1 Legacy profile also represents a proof of concept to ensure that the new modular profile method of describing the specification is in no way limited. VoiceXML 3.0 in its entirety will be in no way limited or constrained in any way because of the use of profiles and modules and formalized semantic models. Anything that was standardized in VoiceXML 2.1 can be standardized in this new format and the VoiceXML 2.1 Legacy profile reveals that.

This profile uses the prompt module ( 6.4 Prompt Module ) extended with the Builtin SSML module ( 6.5 Builtin SSML Module ) and ( 6.8 Foreach Module ).

7.1.1 Conformance

This section defines semantics of how different modules coexist with each other to simulate the behavior of VoiceXML 2.1. It outlines all the required modules and any additions/deletions from each of these modules to make it conform to this profile. It also talks about the interaction amongst various modules so that behavior similar to that in VoiceXML 2.1 is achieved.

To conform with this profile, processors must implement the following modules:

The schema for the Legacy Profile is given in D.8 Schema for Legacy Profile .
Editorial note  

The following content is missing from the Vxml 3.0 specification and needs to be defined:

  • Vxml Root Module
  • Event Handling and throwing (event handlers like <catch>, <noinput>,<nomatch>, etc.)
  • FIA? (although talked about in the Form module, it is not addressed completely)
  • Executable Content
  • <reprompt>
  • <disconnect>
  • <exit>
  • <if> <elseif> <else>
  • <goto>
  • <submit>
  • <log>
  • <return>
  • <script>
  • <throw>
  • All of the VoiceXML 2.0/2.1 properties (Should go in 6.12)
  • <script> Script Module
  • <transfer>
  • <initial>
  • <link>
  • <record>
  • <object>
  • Transitions: <goto>, <submit>, <subdialog>
  • Specify transitions amongst inter-module clearly
  • author controlled transitions within form
  • author controlled transitions outside form
  • automatic transition behavior outside form
  • <filled>

7.1.2 Vxml Root Module Requirements

Eliminate the need to specify Session Root and Platform Root.

7.1.3 Form Module Requirements

Eliminate Capture phase of DOM Level 3 eventing to support Legacy Profile.

7.1.4 Field Module Requirements

Eliminate Capture phase of DOM Level 3 eventing to support Legacy Profile.

7.1.5 Prompt Module Requirements

Eliminate Capture phase of DOM Level 3 eventing to support Legacy Profile.

7.1.6 Grammar Module Requirements

Eliminate Capture phase of DOM Level 3 eventing to support Legacy Profile.

7.1.7 Data Access and Manipulation Module Requirements

Eliminate Capture phase of DOM Level 3 eventing to support Legacy Profile.

7.2 Basic Profile

This profile provides full media capabilities, but omits higher-level flow control constructs such as <form> and <field>. It is intended for single-turn prompt and collect applications. Applications needing fine-grained media control may include the Parseq Module in this profile.

This profile serves two different purposes:

  • To provide core media services in high-volume applications, particularly in telecom networks and NGN, IMS, etc.
  • To support mobile devices with embed speech technology.

This profile uses the prompt module ( 6.4 Prompt Module ) extended with the media module ( 6.6 Media Module ), the grammar module 6.1 Grammar Module , which includes in-line 6.2 Inline SRGS Grammar Module and external grammars 6.3 External Grammar Module , the foreach module ( 6.8 Foreach Module ) and, optionally, the parseq module ( 6.7 Parseq Module ).

This profile will define the amount of ECMAScript and data access capability.

[Issue: how much ECMAScript/data capability be included?]

7.3 Maximal Profile

The maximal server profile represents the closure over the VoiceXML 3.0 platform feature set and is intended for applications providing feature-rich voice user interfaces. This profile provides data access and manipulation capabilities, full media capabilities, higher-level flow control constructs such as <form> and <field>, and full support for environment properties. We believe that control flow capabilities such as those provided by SCXML and CCXML are necessary to take full advantage of the features in the maximal server profile.

Specifically, the maximal server profile provides support for [... enumerate all modules in Section 6 of WD2 draft ...]

[Issue: Should support for SCXML be required when implementing the profile? Should support for CCXML be required when implementing the profile? What are the interoperability ramifications of requiring one, multiple, or no flow control languages for support of this profile?]

7.4 Convenience Syntax (Syntactic Sugar)

Profiles can provide convenience syntax to simplify authoring for that profile without decreasing portability. Convenience Syntax, as we define it here, can be implemented via a straightforward text mapping from the convenience syntax to profile code that uses only the syntax defined by the modules in the profile. Convenience syntax cannot add functionality. It only makes existing functionality easier to code.

A Convenience syntax definition must include

  • one or more new XML attributes and/or elements
  • for each possible use of the new attributes and elements, a non- cyclical mapping from the code containing the element(s) or attribute(s) to code containing other convenience syntax or module syntax, such that the behavior of the original code can be completely described in terms of module syntax.
  • (optional) "initial" code to be executed before each application begins that sets up variables, etc. needed by the mapped code.

The existence and definition of the mapping above means that an author can write VoiceXML applications using the (presumably simpler) convenience syntax, while being assured that the code will execute *as if* the code had been replaced by the (presumably more complex but well-defined) module syntax. This allows authors to code simple cases in the convenience syntax, and make use of other VoiceXML syntax elements and attributes only as needed.

The following two examples show how the VoiceXML 2.1 <menu> and pre-defined catch handlers could be coded as convenience syntax. The third example shows how convenience syntax definitions can be based upon other convenience syntax.

[Examples TBD.]

8 Environment

8.1 Resource Fetching

8.1.1 Fetching

A VoiceXML interpreter context needs to fetch VoiceXML documents, and other resources, such as media files, grammars, scripts, and XML data. Each fetch of the content associated with a URI is governed by the following attributes:

Table 63: Fetch Attributes
fetchtimeout The interval to wait for the content to be returned before throwing an error.badfetch event. The value is a Time Designation . If not specified, a value derived from the innermost fetchtimeout property is used.
fetchhint Defines when the interpreter context should retrieve content from the server. prefetch indicates a file may be downloaded when the page is loaded, whereas safe indicates a file that should only be downloaded when actually needed. If not specified, a value derived from the innermost relevant fetchhint property is used.
maxage Indicates that the document is willing to use content whose age is no greater than the specified time in seconds (cf. 'max-age' in HTTP 1.1 [RFC2616] ). The document is not willing to use stale content, unless maxstale is also provided. If not specified, a value derived from the innermost relevant maxage property, if present, is used.
maxstale Indicates that the document is willing to use content that has exceeded its expiration time (cf. 'max-stale' in HTTP 1.1 [RFC2616] ). If maxstale is assigned a value, then the document is willing to accept content that has exceeded its expiration time by no more than the specified number of seconds. If not specified, a value derived from the innermost relevant maxstale property, if present, is used.

When content is fetched from a URI, the fetchtimeout attribute determines how long to wait for the content (starting from the time when the resource is needed), and the fetchhint attribute determines when the content is fetched. The caching policy for a VoiceXML interpreter context utilizes the maxage and maxstale attributes and is explained in more detail below.

The fetchhint attribute, in combination with the various fetchhint properties, is merely a hint to the interpreter context about when it may schedule the fetch of a resource. Telling the interpreter context that it may prefetch a resource does not require that the resource be prefetched; it only suggests that the resource may be prefetched. However, the interpreter context is always required to honor the safe fetchhint.

When transitioning from one dialog to another, through either a <subdialog>, <goto>, <submit>, <link>, or <choice> element, there are additional rules that affect interpreter behavior. If the referenced URI names a document (e.g. "doc#dialog"), or if query data is provided (through POST or GET), then a new document is obtained (either from a local cache, intermediate cache, or from a origin Web server). When it is obtained, the document goes through its initialization phase (i.e., obtaining and initializing a new application root document if needed, initializing document variables, and executing document scripts). The requested dialog (or first dialog if none is specified) is then initialized and execution of the dialog begins.

Generally, if a URI reference contains only a fragment (e.g., "#my_dialog"), then no document is fetched, and no initialization of that document is performed. However, <submit> always results in a fetch, and if a fragment is accompanied by a namelist attribute there will also be a fetch.

Another exception is when a URI reference in a leaf document references the application root document. In this case, the root document is transitioned to without fetching and without initialization even if the URI reference contains an absolute or relative URI (see 4.4.2.2 4.5.2.2 Application Root and [RFC2396] ). However, if the URI reference to the root document contains a query string or a namelist attribute, the root document is fetched.

Elements that fetch VoiceXML documents also support the following additional attribute:

Table 64: Additional Fetch Attribute
fetchaudio The URI of the audio clip to play while the fetch is being done. If not specified, the fetchaudio property is used, and if that property is not set, no audio is played during the fetch. The fetching of the audio clip is governed by the audiofetchhint, audiomaxage, audiomaxstale, and fetchtimeout properties in effect at the time of the fetch. The playing of the audio clip is governed by the fetchaudiodelay, and fetchaudiominimum properties in effect at the time of the fetch.

The fetchaudio attribute is useful for enhancing a user experience when there may be noticeable delays while the next document is retrieved. This can be used to play background music, or a series of announcements. When the document is retrieved, the audio file is interrupted if it is still playing. If an error occurs retrieving fetchaudio from its URI, no badfetch event is thrown and no audio is played during the fetch.

8.1.2 Caching

The VoiceXML interpreter context, like [HTML] visual browsers, can use caching to improve performance in fetching documents and other resources; audio recordings (which can be quite large) are as common to VoiceXML documents as images are to HTML pages. In a visual browser it is common to include end user controls to update or refresh content that is perceived to be stale. This is not the case for the VoiceXML interpreter context, since it lacks equivalent end user controls. Thus enforcement of cache refresh is at the discretion of the document through appropriate use of the maxage, and maxstale attributes.

The caching policy used by the VoiceXML interpreter context must adhere to the cache correctness rules of HTTP 1.1 ( [RFC2616] ). In particular, the Expires and Cache-Control headers must be honored. The following algorithm summarizes these rules and represents the interpreter context behavior when requesting a resource:

  • If the resource is not present in the cache, fetch it from the server using get.
  • If the resource is in the cache,
    • If a maxage value is provided,
      • If age of the cached resource <= maxage,
        • If the resource has expired,
          • Perform maxstale check.
        • Otherwise, use the cached copy.
      • Otherwise, fetch it from the server using get.
    • Otherwise,
      • If the resource has expired,
        • Perform maxstale check.
      • Otherwise, use the cached copy.

The "maxstale check" is:

  • If maxstale is provided,
    • If cached copy has exceeded its expiration time by no more than maxstale seconds, then use the cached copy.
    • Otherwise, fetch it from the server using get.
  • Otherwise, fetch it from the server using get.

Note: it is an optimization to perform a "get if modified" on a document still present in the cache when the policy requires a fetch from the server.

The maxage and maxstale properties are allowed to have no default value whatsoever. If the value is not provided by the document author, and the platform does not provide a default value, then the value is undefined and the 'Otherwise' clause of the algorithm applies. All other properties must provide a default value (either as given by the specification or by the platform).

While the maxage and maxstale attributes are drawn from and directly supported by HTTP 1.1, some resources may be addressed by URIs that name protocols other than HTTP. If the protocol does not support the notion of resource age, the interpreter context shall compute a resource's age from the time it was received. If the protocol does not support the notion of resource staleness, the interpreter context shall consider the resource to have expired immediately upon receipt.

8.1.2.1 Controlling the Caching Policy

VoiceXML allows the author to override the default caching behavior for each use of each resource (except for any document referenced by the <vxml> element's application attribute: there is no markup mechanism to control the caching policy for an application root document).

Each resource-related element may specify maxage and maxstale attributes. Setting maxage to a non-zero value can be used to get a fresh copy of a resource that may not have yet expired in the cache. A fresh copy can be unconditionally requested by setting maxage to zero.

Using maxstale enables the author to state that an expired copy of a resource, that is not too stale (according to the rules of HTTP 1.1), may be used. This can improve performance by eliminating a fetch that would otherwise be required to get a fresh copy. It is especially useful for authors who may not have direct server-side control of the expiration dates of large static files.

8.1.3 Prefetching

Prefetching is an optional feature that an interpreter context may implement to obtain a resource before it is needed. A resource that may be prefetched is identified by an element whose fetchhint attribute equals "prefetch". When an interpreter context does prefetch a resource, it must ensure that the resource fetched is precisely the one needed. In particular, if the URI is computed with an expr attribute, the interpreter context must not move the fetch up before any assignments to the expression's variables. Likewise, the fetch for a <submit> must not be moved prior to any assignments of the namelist variables.

The expiration status of a resource must be checked on each use of the resource, and, if its fetchhint attribute is "prefetch", then it is prefetched. The check must follow the caching policy specified in Section 6.1.2.

8.1.4 Protocols

The "http" URI scheme must be supported by VoiceXML platforms, the "https" protocol should be supported and other URI protocols may be supported.

8.2 Properties

Properties are used to set values that affect platform behavior, such as the recognition process, timeouts, caching policy, etc.

The following types of properties are defined: speech recognition ( 8.2.1 Speech Recognition Properties ), DTMF recognition ( 8.2.2 DTMF Recognition Properties ), prompt and collect ( 8.2.3 Prompt and Collect Properties ), media ( 8.2.4 Media Properties ), fetching ( 8.2.5 Fetch Properties ) and miscellaneous ( 8.2.6 Miscellaneous Properties ) properties.

Editorial note  

Open issue: should the specification provide specific default values rather than platform-specific?

Open issue: Should we add a 'type' column for all properties?

8.2.1 Speech Recognition Properties

The following generic speech recognition properties are defined.

Table 65: Speech Recognition Properties
Name Description Default
confidencelevel The speech recognition confidence level, a float value in the range of 0.0 to 1.0. Results are rejected (a nomatch event is thrown) when application.lastresult$.confidence is below this threshold. A value of 0.0 means minimum confidence is needed for a recognition, and a value of 1.0 requires maximum confidence. The value is a Real Number Designation (see 8.4 Value Designations ). 0.5
sensitivity Set the sensitivity level. A value of 1.0 means that it is highly sensitive to quiet input. A value of 0.0 means it is least sensitive to noise. The value is a Real Number Designation (see 8.4 Value Designations ). 0.5
speedvsaccuracy A hint specifying the desired balance between speed vs. accuracy. A value of 0.0 means fastest recognition. A value of 1.0 means best accuracy. The value is a Real Number Designation (see 8.4 Value Designations ). 0.5
completetimeout The length of silence required following user speech before the speech recognizer finalizes a result (either accepting it or throwing a nomatch event). The complete timeout is used when the speech is a complete match of an active grammar. By contrast, the incomplete timeout is used when the speech is an incomplete match to an active grammar. A long complete timeout value delays the result completion and therefore makes the computer's response slow. A short complete timeout may lead to an utterance being broken up inappropriately. Reasonable complete timeout values are typically in the range of 0.3 seconds to 1.0 seconds. The value is a Time Designation (see 8.4 Value Designations ). See 8.3 Speech and DTMF Input Timing Properties . Although platforms must parse the completetimeout property, platforms are not required to support the behavior of completetimeout. Platforms choosing not to support the behavior of completetimeout must so document and adjust the behavior of the incompletetimeout property as described below. platform-dependent
incompletetimeout The required length of silence following user speech after which a recognizer finalizes a result. The incomplete timeout applies when the speech prior to the silence is an incomplete match of all active grammars.  In this case, once the timeout is triggered, the partial result is rejected (with a nomatch event). The incomplete timeout also applies when the speech prior to the silence is a complete match of an active grammar, but where it is possible to speak further and still match the grammar. By contrast, the complete timeout is used when the speech is a complete match to an active grammar and no further words can be spoken. A long incomplete timeout value delays the result completion and therefore makes the computer's response slow. A short incomplete timeout may lead to an utterance being broken up inappropriately. The incomplete timeout is usually longer than the complete timeout to allow users to pause mid-utterance (for example, to breathe). See 8.3 Speech and DTMF Input Timing Properties Platforms choosing not to support the completetimeout property (described above) must use the maximum of the completetimeout and incompletetimeout values as the value for the incompletetimeout. The value is a Time Designation (see 8.4 Value Designations ). undefined?
maxspeechtimeout The maximum duration of user speech. If this time elapsed before the user stops speaking, the event "maxspeechtimeout" is thrown. The value is a Time Designation (see 8.4 Value Designations ). platform-dependent

8.2.2 DTMF Recognition Properties

The following generic DTMF recognition properties are defined.

Table 66: DTMF Recognition Properties
Name Description Default
interdigittimeout The inter-digit timeout value to use when recognizing DTMF input. The value is a Time Designation (see 8.4 Value Designations ). See 8.3 Speech and DTMF Input Timing Properties . platform-dependent
termtimeout The terminating timeout to use when recognizing DTMF input. The value is a Time Designation (see 8.4 Value Designations ). 8.3 Speech and DTMF Input Timing Properties . 0s
termchar The terminating DTMF character for DTMF input recognition. See 8.3 Speech and DTMF Input Timing Properties . #

8.2.3 Prompt and Collect Properties

The following properties are defined to apply to the fundamental platform prompt and collect cycle.

Table 67: Prompt and Collect Properties
Name Description Default
bargein The bargein attribute to use for prompts. Setting this to true allows bargein by default. Setting it to false disallows bargein. true
bargeintype Sets the type of bargein to be speech or hotword. See "Bargein type" (link TBD). platform-specific
timeout The time after which a noinput event is thrown by the platform. The value is a Time Designation (see 8.4 Value Designations ). See 8.3 Speech and DTMF Input Timing Properties . platform-dependent

8.2.4 Media Properties

The following properties are defined to apply to output media.

Table 68: Media Properties
Name Description Default
outputmodes

Determines which modes may be used for media output. The value is a space separated list of media types (see media 'type' in TBD).

This property is typically used with container file formats, such as "video/3gpp", which support storage of multiple media types. For example, to play both audio and video to the remote connection, the property would be set to "audio video". To play only the video, the property is set to "video".

If the value contains a media type which is not supported by the platform, the connection or the value of the <media> element type property, then that media type is ignored.

The default value depends on the negotiated media between the local and remote devices. It is the space separated list of media types specified in the session.connection.media array elements' type property where the associated direction property is sendrecv or recvonly .

8.2.5 Fetch Properties

The following properties pertain to the fetching of new documents and resources.

Note that maxage and maxstale properties may have no default value - see 8.1.2 Caching .

Table 69: Fetch Properties
Name Description Default
audiofetchhint This tells the platform whether or not it can attempt to optimize dialog interpretation by pre-fetching audio. The value is either safe to say that audio is only fetched when it is needed, never before; or prefetch to permit, but not require the platform to pre-fetch the audio. prefetch
audiomaxage Tells the platform the maximum acceptable age, in seconds, of cached audio resources. platform-specific
audiomaxstale Tells the platform the maximum acceptable staleness, in seconds, of expired cached audio resources. platform-specific
documentfetchhint Tells the platform whether or not documents may be pre-fetched. The value is either safe (the default), or prefetch. safe
documentmaxage Tells the platform the maximum acceptable age, in seconds, of cached documents. platform-specific
documentmaxstale Tells the platform the maximum acceptable staleness, in seconds, of expired cached documents. platform-specific
grammarfetchhint Tells the platform whether or not grammars may be pre-fetched. The value is either prefetch (the default), or safe. prefetch
grammarmaxage Tells the platform the maximum acceptable age, in seconds, of cached grammars. platform-specific
grammarmaxstale Tells the platform the maximum acceptable staleness, in seconds, of expired cached grammars. platform-specific.
objectfetchhint Tells the platform whether the URI contents for <object> may be pre-fetched or not. The values are prefetch, or safe. prefetch
objectmaxage Tells the platform the maximum acceptable age, in seconds, of cached objects. platform-specific
objectmaxstale Tells the platform the maximum acceptable staleness, in seconds, of expired cached objects. platform-specific
scriptfetchhint Tells whether scripts may be pre-fetched or not. The values are prefetch (the default), or safe. prefetch
scriptmaxage Tells the platform the maximum acceptable age, in seconds, of cached scripts. platform-specific
scriptmaxstale Tells the platform the maximum acceptable staleness, in seconds, of expired cached scripts. platform-specific.
fetchaudio The URI of the audio to play while waiting for a document to be fetched. The default is not to play any audio during fetch delays. There are no fetchaudio properties for audio, grammars, objects, and scripts. The fetching of the audio clip is governed by the audiofetchhint, audiomaxage, audiomaxstale, and fetchtimeout properties in effect at the time of the fetch. The playing of the audio clip is governed by the fetchaudiodelay, and fetchaudiominimum properties in effect at the time of the fetch. undefined
fetchaudiodelay The time interval to wait at the start of a fetch delay before playing the fetchaudio source. The value is a Time Designation (see 8.4 Value Designations ). The default interval is platform-dependent, e.g. "2s".  The idea is that when a fetch delay is short, it may be better to have a few seconds of silence instead of a bit of fetchaudio that is immediately cut off. platform-specific
fetchaudiominimum The minimum time interval to play a fetchaudio source, once started, even if the fetch result arrives in the meantime. The value is a Time Designation (see 8.4 Value Designations ). The default is platform-dependent, e.g., "5s".  The idea is that once the user does begin to hear fetchaudio, it should not be stopped too quickly. platform-specific
fetchtimeout The timeout for fetches. The value is a Time Designation (see 8.4 Value Designations ). platform-specific

8.2.6 Miscellaneous Properties

The following miscellaneous properties are defined.

Table 70: Miscellaneous Properties
Name Description Default
inputmodes This property determines which input modality to use. The input modes to enable: dtmf and voice. On platforms that support both modes, inputmodes defaults to "dtmf voice". To disable speech recognition, set inputmodes to "dtmf". To disable DTMF, set it to "voice". One use for this would be to turn off speech recognition in noisy environments. Another would be to conserve speech recognition resources by turning them off where the input is always expected to be DTMF. This property does not control the activation of grammars. For instance, voice-only grammars may be active when the inputmode is restricted to DTMF. Those grammars would not be matched, however, because the voice input modality is not active. ???
universals Platforms may optionally provide platform-specific universal command grammars, such as "help", "cancel", or "exit" grammars, that are always active (except in the case of modal input items - see "Activation of Grammars" (link TBD)) and which generate specific events. Note that relying on platform-provided grammars is not good practice for production-grade applications (see 6.11 Builtin Grammar Module ). Applications choosing to migrate from universals grammars to a more the robust developer-specified grammars should replace the universals <property> with one or more <link> (TODO, hyperlink) element(s). Because <link>s can also generate the same events as universal grammars, and because the <catch> handlers for the universal grammars persist outside the universals <property>, the migration should be seamless. The value "none" is the default, and means that all platform default universal command grammars are disabled. The value "all" turns them all on. Individual grammars are enabled by listing their names separated by spaces; for example, "cancel exit help". none
maxnbest This property controls the maximum size of the "application.lastresult$" array; the array is constrained to be no larger than the value specified by 'maxnbest'. This property has a minimum value of 1. 1

8.3 Speech and DTMF Input Timing Properties

The various timing properties for speech and DTMF recognition work together to define the user experience. The ways in which these different timing parameters function are outlined in the timing diagrams below. In these diagrams, the start for wait of DTMF input, or user speech both occur at the time that the last prompt has finished playing.

8.3.1 DTMF Grammars

DTMF grammars use timeout, interdigittimeout, termtimeout and termchar as described in 8.2.2 DTMF Recognition Properties to tailor the user experience. The effects of these are shown in the following timing diagrams.

8.3.1.1 timeout, No Input Provided

The timeout parameter determines when the <noinput> event is thrown because the user has failed to enter any DTMF. Once the first DTMF has been entered, this parameter has no further effect.

Timing diagram for timeout when no input provided

Figure 13: 14: Timing diagram for timeout when no input provided.

8.3.1.2 interdigittimeout, Grammar is Not Ready to Terminate

In the following diagram, the interdigittimeout determines when the nomatch event is thrown because a DTMF grammar is not yet recognized, and the user has failed to enter additional DTMF.

Timing diagram for interdigittimeout, grammar is not ready to terminate

Figure 14: 15: Timing diagram for interdigittimeout, grammar is not ready to terminate.

8.3.1.3 interdigittimeout, Grammar is Ready to Terminate

The example below shows the situation when a DTMF grammar could terminate, or extend by the addition of more DTMF input, and the user has elected not to provide any further input.

Timing diagram for interdigittimeout, grammar is ready to terminate

Figure 15: 16: Timing diagram for interdigittimeout, grammar is ready to terminate.

8.3.1.4 termchar and interdigittimeout, Grammar Can Terminate

In the example below, a termchar is non-empty, and is entered by the user before an interdigittimeout expires, to signify that the users DTMF input is complete; the termchar is not included as part of the recognized value.

Timing diagram for termchar and interdigittimeout, grammar can terminate

Figure 16: 17: Timing diagram for interdigittimeout, Timing diagram for termchar and interdigittimeout, grammar can terminate.

8.3.1.5 termchar Empty When Grammar Must Terminate

In the example below, the entry of the last DTMF has brought the grammar to a termination point at which no additional DTMF is expected. Since termchar is empty, there is no optional terminating character permitted, thus the recognition ends and the recognized value is returned.

Timing diagram for termchar empty when grammar must terminate

Figure 17: 18: Timing diagram for termchar empty when grammar must terminate.

8.3.1.6 termchar Non-Empty and termtimeout When Grammar Must Terminate

In the example below, the entry of the last DTMF has brought the grammar to a termination point at which no additional DTMF is allowed by the grammar. If the termchar is non-empty, then the user can enter an optional termchar DTMF. If the user fails to enter this optional DTMF within termtimeout, the recognition ends and the recognized value is returned. If the termtimeout is 0s (the default), then the recognized value is returned immediately after the last DTMF allowed by the grammar, without waiting for the optional termchar. Note: the termtimeout applies only when no additional input is allowed by the grammar; otherwise, the interdigittimeout applies.

Timing diagram for termchar non-empty and termtimeout when grammar must terminate

Figure 18: 19: Timing diagram for termchar non-empty and termtimeout when grammar must terminate.

8.3.1.7 termchar Non-Empty and termtimeout When Grammar Must Terminate

In this example, the entry of the last DTMF has brought the grammar to a termination point at which no additional DTMF is allowed by the grammar. Since the termchar is non-empty, the user enters the optional termchar within termtimeout causing the recognized value to be returned (excluding the termchar).

Timing diagram for termchar non-empty when grammar must terminate

Figure 19: 20: Timing diagram for termchar non-empty when grammar must terminate.

8.3.1.8 Invalid DTMF Input

While waiting for the first or additional DTMF, three different timeouts may determine when the user's input is considered complete. If no DTMF has been entered, the timeout applies; if some DTMF has been entered but additional DTMF is valid, then the interdigittimeout applies; and if no additional DTMF is legal, then the termtimeout applies. At each point, the user may enter DTMF which is not permitted by the active grammar(s). This causes the collected DTMF string to be invalid. Additional digits will be collected until either the termchar is pressed or the interdigittimeout has elapsed. A nomatch event is then generated.

8.3.2 Speech Grammars

Speech grammars use timeout, completetimeout, and incompletetimeout as described in 8.2.3 Prompt and Collect Properties and 8.2.1 Speech Recognition Properties to tailor the user experience. The effects of these are shown in the following timing diagrams.

8.3.2.1 timeout When No Speech Provided

In the example below, the timeout parameter determines when the noinput event is thrown because the user has failed to speak.

Timing diagram for timeout when no speech provided

Figure 20: 21: Timing diagram for timeout when no speech provided.

8.3.2.2 completetimeout With Speech Grammar Recognized

In the example above, the user provided a utterance that was recognized by the speech grammar. After a silence period of completetimeout has elapsed, the recognized value is returned.

Timing diagram for completetimeout with speech grammar recognized

Figure 21: 22: Timing diagram for completetimeout with speech grammar recognized.

8.3.2.3 incompletetimeout with Speech Grammar Unrecognized

In the example above, the user provided a utterance that is not as yet recognized by the speech grammar but is the prefix of a legal utterance. After a silence period of incompletetimeout has elapsed, a nomatch event is thrown.

Timing diagram for incompletetimeout with speech grammar unrecognized

Figure 22: 23: Timing diagram for incompletetimeout with speech grammar unrecognized.

8.4 Value Designations

Several VoiceXML parameter values follow the conventions used in the W3C's Cascading Style Sheet Recommendation [CSS2] .

8.4.1 Integers

Integers are specified in decimal notation only. Integers may be preceded by a "-" or "+" to indicate the sign.

An integer consists of one or more digits "0" to "9".

8.4.2 Real Numbers

Real numbers are specified in decimal notation only. Real numbers may be preceded by a "-" or "+" to indicate the sign.

A real number may be an integer, or it may be zero or more digits followed by a dot (.) followed by one or more digits.

8.4.3 Times

Time designations consist of a non-negative real number followed by a time unit identifier. The time unit identifiers are:

  • ms: milliseconds
  • s: seconds

Examples include: "3s", "850ms", "0.7s", ".5s" and "+1.5s".

9 Integration with Other Markup Languages

This section presents some initial thoughts on how VoiceXML might be embedded within SCXML and how flow control languages such as SCXML and CCXML might be integrated into VoiceXML.

9.1 Embedding of VoiceXML within SCXML

The following bank application example demonstrates how external vxml application could be invoked by a scxml script and vice versa. The state machine and flow control was implemented in the BankApp.scxml. The call is started from the BankApp.scxml, it will first call BankApp.vxml's form "getAccountNum" to collect the account number, then query database for checking and saving balance. The BankApp.scxml will then invoke the form "playBalance". If this form finds the accountype accountType is not defined, it will invoke the AccountType.scxml, which will call BankApp.vxml form "getAccountType" to get the accountType. The "playBlance" "playBalance" will then play the balance on the corresponding account and return the call back to the BankApp.scxml.

BankApp.scxml

<?xml version="1.0" encoding="UTF-8"?>
<scxml xmlns="http://www.w3.org/2005/07/scxml" xmlns:my="http://scxml.example.org/" version="1.0" initial="getAccountNum" profile="ecmascript" >
 
  <state id="getAccountNum">
      <invoke targettype="vxml3" src="BankApp.vxml#getAccountNum" />
      <transition event="vxml3.gotAccountNum" target="getBalance"/>
  </state>
 
  <state id="getBalance">
      <datamodel>
           <data name="method" expr="'getBalance'"/>
           <data name="accountNum" expr="_data.accountNum"/>
      </datamodel>
      <send targettype="basichttp" target="BankDB.do" namelist="method accountNum" />
      <transition event="basichttp.gotBalance" target="playingBalance"/>
  </state>
 
  <state id="playBalance">
      <datamodel>
           <data name="checking_balance" expr="_data.checking.balance" />
           <data name="saving_balance" expr="_data.saving.balance" />
      </datamodel>
      <invoke targettype="vxml3" target="BankApp.vxml#playBalance" namelist="checking_balance saving_balance" />
      <transition event="vxml3.playedBalance" target="exit" />
  </state>
 
  <final id="exit"/>
</scxml>

AccountType.scxml

<?xml version="1.0" encoding="UTF-8"?>
<scxml xmlns="http://www.w3.org/2005/07/scxml" xmlns:my="http://scxml.example.org/" version="1.0" initial="getAccountType" profile="ecmascript" >
 
  <state id="getAccountType">
      <invoke targettype="vxml3" src="BankApp.vxml#getAccountType" />
  <transition event="vxml3.gotAccountType" target="exit"/>
  </state>
 
  <final id="exit"/>
</scxml>

BankApp.vxml

<?xml version="1.0" encoding="UTF-8"?>
 
<!-- TODO: need to add final namespace, schema, etc. for vxml element. -->
<vxml version="3.0">
 
      <form id="getAccountNum">
            <field name="accountNum">
                  <grammar src=“accountNum.grxml" type="application/grammar+xml"/>
                  <prompt>
                        Please tell me your account number.
                  </prompt>
                  <filled>
                        <exit namelist="accountNum"/>
                  </filled>
            </field>
      </form>
 
      <form id="getAccountType">
            <field name="accountType">
                  <grammar src=“accountType.grxml" type="application/grammar+xml"/>
                  <prompt>
                        Do you want the balance on checking or saving account?
                  </prompt>
                  <filled>
                        <exit namelist="accountType"/>
                  </filled>
            </field>
      </form>
     
      <form id="playBalance">
            <var name="checking_balance"/>
            <var name="saving_balance"/>
            <block>
            <if cond="accountType == undefined">
 
<!--Here we are trying to invoke the external scxml script. At the time this example is written,
    the syntax to do this has not yet been decided. -->
                  <goto next="AccountType.scxml#getAccountType"/>
            </if>
                 
            <if cond="accountType == 'checking'">
                  <prompt>
                    The checking account balance is <value expr="checking_balance"/>.
                  </prompt>
            <else>
                  <prompt>
                    The saving account balance is <value expr="saving_balance"/>.
                  </prompt>
            </if>
                 
            </block>
      </form>
</vxml>

9.2 Integrating Flow Control Languages into VoiceXML

State Chart XML (SCXML) could be used as the controller for managing the dialog in VoiceXML 3.0 applications. A recursive MVC technique allows SCXML controllers to be placed at session, document and form levels. Examples of resulting compound documents (containing V3 and SCXML namespaced elements) appear below for illustration. A graceful degradation / fallback approach could be used to ensure backwards compatibility with V2 applications. Note that the examples below use a new v3:scxmlform element.

[ISSUE: It has been suggested that using the existing v3:form element instead of a new v3:scxmlform element would be simpler and more elegant. Although the working group currently knows of no particular reason why the existing v3:form couldn't be used instead of a new v3:scxmlform element, the group has not yet discussed this in detail or agreed that using v3:form in this way is desirable. The group plans to discuss this and is interested in receiving public feedback on this possibility.]

9.2.1 SCXML for Dialog Management

Example application scenario:

  • This is an airline travel itinerary modification application
  • First order of business is to retrieve an itinerary to be modified
  • Itinerary may be identified using either a record locator or traveler's lastname and other information

Below are two flavors of this application using SCXML as the form-level controller, a system-driven and a user-driven approach. These use similar set of fields in the form but different dialog management styles. In the simpler example here, the VUI might appear similar though there is a system vs. user driven flavor.

9.2.1.1 System-driven Dialog
  • Starts off by asking if the record locator is available
  • If the locator is available, it's requested
  • If the locator isn't available, the last name and some other pieces of information are requested to uniquely identify the itinerary
  • Once the itinerary is identified, we proceed with application functions

Consider the following sketch of a V3 form for this purpose:

<v3:scxmlform>
  <scxml:scxml initial="choice">
    <scxml:state id="choice">
      <scxml:invoke type="vxml3field" src="#choicefield"/>
      <scxml:transition event="filled.choice" cond="choicefield"
                    target="locator"/>
      <scxml:transition event="filled.choice" cond="!choicefield"
                    target="lastname"/>
    </scxml:state>
    <scxml:state id="locator">
      <scxml:invoke type="vxml3field" src="#locatorfield"/>
      <!-- Retrieve record, transition to app menu -->
    </scxml:state>
    <scxml:state id="lastname">
      <scxml:invoke type="vxml3field" src="#lastnamefield"/>
      <!-- Collect other information needed to retrieve record,
           then retrieve record and go to app menu -->
    </scxml:state>
    <!-- Remaining dialog control flow logic omitted -->
  </scxml:scxml>
  <v3:field name="choicefield">
    <v3:grammar src="boolean.grxml" type="application/srgs+xml"/>
    <v3:prompt>
      Welcome. Do you have the record locator for your itinerary?
    <v3:prompt>
    <v3:filled>
      <v3:throw event="filled.choice"/>
    </v3:filled>
  </v3:field>
  <v3:field name="locatorfield">
    <v3:grammar src="locator.grxml" type="application/srgs+xml"/>
    <v3:prompt>
      What is the record locator for the itinerary?
    <v3:prompt>
    <v3:filled>
      <v3:throw event="filled.locator"/>
    </v3:filled>
  </v3:field>
  <v3:field name="lastnamefield">
    <v3:grammar src="lastname.grxml" type="application/srgs+xml"/>
    <v3:prompt>
      Please say or spell your last name.
    <v3:prompt>
    <v3:filled>
      <v3:throw event="filled.lastname"/>
    </v3:filled>
  </v3:field>
  <!-- Other form items, such as the subsequent application menu omitted
-->
</v3:scxmlform>
9.2.1.2 User-driven Dialog
  • Starts off by asking what information the user would like to supply to identify the itinerary
  • If the user indicates the record locator will be provided, it's retrieved
  • If the user indicates the last name will be provided, it's retrieved (some other pieces of information may be retrieved to uniquely identify the itinerary)
  • Once the itinerary is identified, we proceed with application functions

Consider the following sketch of a V3 form for this purpose:

<v3:scxmlform>
  <scxml:scxml initial="choice">
    <scxml:state id="choice">
      <scxml:invoke type="vxml3field" src="#choicefield"/>
      <scxml:transition event="filled.choice" cond="choicefield == 'locator'"
                    target="locator"/>
      <scxml:transition event="filled.choice" cond="choicefield == 'lastname'"
                    target="lastname"/>
    </scxml:state>
    <scxml:state id="locator">
      <scxml:invoke type="vxml3field" src="#locatorfield"/>
      <!-- Retrieve record, transition to app menu -->
    </scxml:state>
    <scxml:state id="lastname">
      <scxml:invoke type="vxml3field" src="#lastnamefield"/>
      <!-- Collect other information needed to retrieve record,
           then retrieve record and go to app menu -->
    </scxml:state>
    <!-- Remaining dialog control flow logic omitted -->
  </scxml:scxml>
  <v3:field name="choicefield">
    <v3:grammar src="choice.grxml" type="application/srgs+xml"/>
    <v3:prompt>
      Welcome. How would you like to look up your itinerary?
    <v3:prompt>
    <v3:filled>
      <v3:throw event="filled.choice"/>
    </v3:filled>
  </v3:field>
  <v3:field name="locatorfield">
    <v3:grammar src="locator.grxml" type="application/srgs+xml"/>
    <v3:prompt>
      What is the record locator for the itinerary?
    <v3:prompt>
    <v3:filled>
      <v3:throw event="filled.locator"/>
    </v3:filled>
  </v3:field>
  <v3:field name="lastnamefield">
    <v3:grammar src="lastname.grxml" type="application/srgs+xml"/>
    <v3:prompt>
      Please say or spell your last name.
    <v3:prompt>
    <v3:filled>
      <v3:throw event="filled.lastname"/>
    </v3:filled>
  </v3:field>
  <!-- Other form items, such as the subsequent application menu omitted
-->
</v3:scxmlform>

9.2.2 Graceful Degradation

One possibility with this approach is that the absence of an <scxml:scxml> child element in a <v3:scxmlform> could revert behavior to be identical to a <v2:form> element where the V2 Form Interpretation Algorithm would be in charge. In the presence of a <scxml:scxml> child, the FIA would be suppressed and the more expressive SCXML controller used, which would allow application developers to design the form VUI in a very flexible manner. In other words, the following <v3:scxmlform> below:

<v3:scxmlform>
  <!-- No SCXML child -->
  <!-- Various form items etc. -->
</v3:scxmlform>

behaves as would:

<v2:form>
  <!-- Various form items etc. -->
</v2:form>

9.2.3 SCXML as Basis for Recursive MVC

The above example illustrated a form-level SCXML controller. SCXML could perhaps also be used as a document level controller, where it would be managing the interaction across v3:forms, rather than v3:fields. To illustrate:

<v3:vxml>
  <scxml:scxml ...>
    <!-- document level controller managing interaction
         across form1, form2 and form3 -->
  </scxml:scxml>
  <v3:form id="form1">
    <!-- form1 content, might also have a form level SCXML controller -->
  </v3:form>
  <v3:form id="form2">
    <!-- form2 content, might also have a form level SCXML controller -->
  </v3:form>
  <v3:form id="form3">
    <!-- form3 content, might also have a form level SCXML controller -->
  </v3:form>
</v3:vxml>

A Acknowledgements

This version of VoiceXML was written with the participation of members of the W3C Voice Browser Working Group. The work of the following members has significantly facilitated the development of this specification:

The W3C Voice Browser Working Group would like to thank the W3C team, especially Kazuyuki Ashimura and Matt Womer, for their invaluable administrative and technical support.

B References

B.1 Normative References

DFP
The Voice Browser DFP Framework W3C Informative Note, February 2006. (See http://www.w3.org/Voice/2006/DFP.)
MMI
Multimodal Architecture and Interfaces W3C Working Draft Note, October 2008. (See http://www.w3.org/TR/mmi-arch/.)
DOM3Events
Document Object Model (DOM) Level 3 Events Specification Höhrmann, Le Hégaret and Pixley. W3C Working Draft, April 2006. (See http://www.w3.org/TR/2006/WD-DOM-Level-3-Events-20060413/.)
RFC2119
Key words for use in RFCs to Indicate Requirement Levels IETF RFC 2119, 1997. (See http://www.ietf.org/rfc/rfc2119.txt.)
ECMASCRIPT
Standard ECMA-262 ECMAScript Language Specification , Standard ECMA-262, December 1999. (See http://www.ecma-international.org/publications/standards/Ecma-262.htm.)
VOICEXML20
Voice Extensible Markup Language (VoiceXML) Version 2.0 McGlashan et al. W3C Recommendation, March 2004. (See http://www.w3.org/TR/voicexml20/.)
VOICEXML21
Voice Extensible Markup Language (VoiceXML) Version 2.1 Oshry et al. W3C Recommendation, May 2007. (See http://www.w3.org/TR/voicexml21/.)
SSML
Speech Synthesis Markup Language Version 1.0 Burnett, Walker and Hunt. W3C Recommendation, September 2004. (See http://www.w3.org/TR/speech-synthesis/.)
SRGS
Speech Recognition Grammar Specification Version 1.0 Hunt and McGlashan. W3C Recommendation, March 2004. (See http://www.w3.org/TR/speech-grammar/.)
RFC2616
Hypertext Transfer Protocol -- HTTP/1.1 IETF RFC 2616, 1999. (See http://www.ietf.org/rfc/rfc2616.txt.)
RFC2396
Uniform Resource Identifiers (URI): Generic Syntax IETF RFC 2396, 1998. (See http://www.ietf.org/rfc/rfc2396.txt.)

B.2 Informative References

C Glossary of Terms

active grammar
A speech or DTMF grammar that is currently active. This is based on the currently executing element, and the scope elements of the currently defined grammars.
application
A collection of VoiceXML documents that are tagged with the same application name attribute.
ASR
Automatic speech recognition.
author
The creator of a VoiceXML document.
catch element
A <catch> block or one of its abbreviated forms. Certain default catch elements are defined by the VoiceXML interpreter .
control item
A form item whose purpose is either to contain a block of procedural logics (<block>) or to allow initial prompts for a mixed initiative dialog (<initial>).
CSS W3C Cascading Style Sheet specification.
See [CSS2]
dialog
An interaction with the user specified in a VoiceXML document . Types of dialogs include forms and menus .
DTMF (Dual Tone Multi-Frequency)
Touch-tone or push-button dialing. Pushing a button on a telephone keypad generates a sound that is a combination of two tones, one high frequency and the other low frequency.
ECMAScript
A standard version of JavaScript backed by the European Computer Manufacturer's Association. See [ECMASCRIPT]
event
A notification "thrown" by the implementation platform , VoiceXML interpreter context , VoiceXML interpreter , or VoiceXML code. Events include exceptional conditions (semantic errors), normal errors (user did not say something recognizable), normal events (user wants to exit), and user defined events.
executable content
Procedural logic that occurs in <block>, <filled>, and event handlers .
form
A dialog that interacts with the user in a highly flexible fashion with the computer and the user sharing the initiative.
FIA (Form Interpretation Algorithm)
An algorithm implemented in a VoiceXML interpreter which drives the interaction between the user and a VoiceXML form or menu. See vxml2: Section 2.1.6, vxml2: Appendix C.
form item
An element of <form> that can be visited during form execution: <initial>, <block>, <field>, <record>, <object>, <subdialog>, and <transfer>.
form item variable
A variable, either implicitly or explicitly defined, associated with each form item in a form . If the form item variable is undefined, the form interpretation algorithm will visit the form item and use it to interact with the user.
implementation platform
A computer with the requisite software and/or hardware to support the types of interaction defined by VoiceXML.
input item
A form item whose purpose is to input a input item variable. Input items include <field>, <record>, <object>, <subdialog>, and <transfer>.
[ Definition : language identifier]
A language identifier labels information content as being of a particular human language variant. Following the XML specification for language identification [XML] , a legal language identifier is identified by an RFC 3066 [RFC3066] code. A language code is required by RFC 3066. A country code or other subtag identifier is optional by RFC 3066.
link
A set of grammars that when matched by something the user says or keys in, either transitions to a new dialog or document or throws an event in the current form item.
menu
A dialog presenting the user with a set of choices and takes action on the selected one.
mixed initiative
A computer-human interaction in which either the computer or the human can take initiative and decide what to do next.
JSGF
Java API Speech Grammar Format. A proposed standard for representing speech grammars. See [JSGF]
object
A platform-specific capability with an interface available via VoiceXML.
request
A collection of data including: a URI specifying a document server for the data, a set of name-value pairs of data to be processed (optional), and a method of submission for processing (optional).
script
A fragment of logic written in a client-side scripting language, especially ECMAScript , which is a scripting language that must be supported by any VoiceXML interpreter .
session
A connection between a user and an implementation platform , e.g. a telephone call to a voice response system. One session may involve the interpretation of more than one VoiceXML document .
SRGS (Speech Recognition Grammar Specification)
A standard format for context-free speech recognition grammars being developed by the W3C Voice Browser group. Both ABNF and XML formats are defined [SRGS] .
SSML (Speech Synthesis Markup Language)
A standard format for speech synthesis being developed by the W3C Voice Browser group [SSML] .
subdialog
A VoiceXML dialog (or document) invoked from the current dialog in a manner analogous to function calls.
tapered prompts
A set of prompts used to vary a message given to the human. Prompts may be tapered to be more terse with use (field prompting), or more explicit (help prompts).
throw
An element that fires an event .
TTS
text-to-speech; speech synthesis.
user
A person whose interaction with an implementation platform is controlled by a VoiceXML interpreter .
URI
Uniform Resource Indicator.
URL
Uniform Resource Locator.
VoiceXML document
An XML document conforming to the VoiceXML specification.
VoiceXML interpreter
A computer program that interprets a VoiceXML document to control an implementation platform for the purpose of conducting an interaction with a user.
VoiceXML interpreter context
A computer program that uses a VoiceXML interpreter to interpret a VoiceXML Document and that may also interact with the implementation platform independently of the VoiceXML interpreter .
W3C
World Wide Web Consortium http://www.w3.org/

D VoiceXML 3.0 XML Schema

D.1 Schema for VXML Root Module

D.2 Schema for Form Module

D.3 Schema for Field Module

D.4 Schema for Prompt Module

D.5 Schema for Builtin SSML Module

D.6 Schema for Foreach Module

D.7 Schema for Data Access and Manipulation Module

D.8 Schema for Legacy Profile


<?xml version="1.0" encoding="UTF-8"?>
<xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns="TBD"
                   targetNamespace="TBD" blockDefault="#all">
    <xsd:annotation>
        <xsd:documentation>
              This is the XML Schema driver for Legacy Profile of Vxml 3.0 specification.
              Please use this namespace for the Legacy Profile:
              "TBD:URL to schema"
        </xsd:documentation>
        <xsd:documentation source="vxml3-copyright.xsd"/>
    </xsd:annotation>
 
    <xsd:annotation>
        <xsd:documentation>
            This is the Schema Driver file for Legacy Profile of Vxml 3.0 Specification
            This schema
                + sets the namespace for Legacy Profile of Vxml 3.0 Specification
                + imports external schemas (xml.xsd)
                + imports schema modules 
 
                  Legacy Profile includes the following Modules
 
                   * Vxml Root module 
                   * Form module 
                   * Field module 
                   * Prompt module 
                   * Grammar module 
                   * Data Access and Manipulation Module
          </xsd:documentation>
    </xsd:annotation>
    <xsd:import namespace="http://www.w3.org/XML/1998/namespace"
        schemaLocation="http://www.w3.org/2001/xml.xsd">
        <xsd:annotation>
            <xsd:documentation>
                This import brings in the XML namespace attributes
                The XML attributes are used by various modules.
            </xsd:documentation>
        </xsd:annotation>
    </xsd:import>
 
    <xsd:include schemaLocation="vxml-datatypes.xsd">
        <xsd:annotation>
            <xsd:documentation>
                    This imports brings in the common datatypes for Vxml.
            </xsd:documentation>
        </xsd:annotation>
    </xsd:include>
    <xsd:include schemaLocation="vxml-attribs.xsd">
        <xsd:annotation>
            <xsd:documentation>
                This imports brings in the common attributes for Vxml.
            </xsd:documentation>
        </xsd:annotation>
    </xsd:include>
    <xsd:include schemaLocation="vxml3-module-vxmlroot.xsd">
        <xsd:annotation>
            <xsd:documentation>
                This imports the Vxml Root module for Vxml 3.0
            </xsd:documentation>
        </xsd:annotation>
    </xsd:include>
    <xsd:include schemaLocation="vxml3-module-form.xsd">
        <xsd:annotation>
            <xsd:documentation>
                This imports the Form module for Vxml 3.0
            </xsd:documentation>
        </xsd:annotation>
    </xsd:include>
    <xsd:include schemaLocation="vxml3-module-field.xsd">
        <xsd:annotation>
            <xsd:documentation>
                This imports the Field module for Vxml 3.0
            </xsd:documentation>
        </xsd:annotation>
    </xsd:include>
    <xsd:include schemaLocation="vxml3-module-prompt.xsd">
        <xsd:annotation>
            <xsd:documentation>
                This imports the Prompt module for Vxml 3.0
            </xsd:documentation>
        </xsd:annotation>
    </xsd:include>
    <xsd:include schemaLocation="vxml3-module-grammar.xsd">
        <xsd:annotation>
            <xsd:documentation>
                This imports the Grammar module for Vxml 3.0
            </xsd:documentation>
        </xsd:annotation>
    </xsd:include>
    <xsd:include schemaLocation="vxml3-module-dataacces.xsd">
        <xsd:annotation>
            <xsd:documentation>
                This imports the Data Access and Manipulation module for Vxml 3.0
            </xsd:documentation>
        </xsd:annotation>
    </xsd:include>
</xsd:schema>
Editorial note  
The schema is incomplete. It merely imports the schemas for various modules, but doesn't contain parent/child relationships between modules or constraints on them. These all need to be specified in the future.

E Major changes since the last Working Draft