W3C

Voice Extensible Markup Language (VoiceXML) 3.0

W3C Working Draft 17 31 June August 2010

This version:
http://www.w3.org/TR/2010/WD-voicexml30-20100617/ http://www.w3.org/TR/2010/WD-voicexml30-20100831/
Latest version:
http://www.w3.org/TR/voicexml30/
Previous version:
http://www.w3.org/TR/2010/WD-voicexml30-20100304/ http://www.w3.org/TR/2010/WD-voicexml30-20100617/
Editors:
Scott McGlashan, Hewlett-Packard (co-Editor-in-Chief)
Daniel C. Burnett, Voxeo (co-Editor-in-Chief)
Rahul Akolkar, IBM
RJ Auburn, Voxeo
Paolo Baggia, Loquendo
Jim Barnett, Genesys Telecommunications Laboratories
Michael Bodell, Microsoft
Jerry Carter, Nuance
Matt Oshry, Microsoft
Kenneth Rehor, Cisco
Xu Yang, Aspect
Milan Young, Nuance
Rafah Hosn (until 2008, when at IBM)

Abstract

This document specifies VoiceXML 3.0, a modular XML language for creating interactive media dialogs that feature synthesized speech, recognition of spoken and DTMF key input, telephony, mixed initiative conversations, and recording and presentation of a variety of media formats including digitized audio, and digitized video.

Its major goal is to bring the advantages of Web-based development and content delivery to interactive voice response applications.

Status of this Document

This section describes the status of this document at the time of its publication. Other documents may supersede this document. A list of current W3C publications and the latest revision of this technical report can be found in the W3C technical reports index at http://www.w3.org/TR/.

This is the 17 June 31 August 2010 Sixth Seventh Public Working Draft of "Voice Extensible Markup Language (VoiceXML) 3.0". The main differences from the previous draft are described in Appendix F Major changes since the last Working Draft . A diff-marked version of this document is also available for comparison purposes.

This document is very much a work in progress. Many sections are incomplete, only stubbed out, or missing entirely. To get early feedback, the group focused on defining enough functionality, modules, and profiles to demonstrate the general framework. To complete the specification, the group expects to introduce additional functionality (for example speaker identification and verification, external eventing) and describe the existing functionality at the level of detail given for the Prompt and Field modules. We explicitly request feedback on the framework, particularly any concerns about its implementability or suitability for expected applications. By late 2010 the group expects all key capabilities to be present in the specification, with details worked out by early 2011.

Applications written as 2.1 documents can be used under a 3.0 processor using the 2.1 profile. As an example, the Implementation Report tests for 2.1 (which includes the IR tests for 2.0) will be supported on a 3.0 processor. Exceptions will be clarifications and changes needed to improve interoperability.

This document is a W3C Working Draft . It has been produced as part of the Voice Browser Activity . The authors of this document are participants in the Voice Browser Working Group . For more information see the Voice Browser FAQ . The Working Group expects to advance this Working Draft to Recommendation status.

Comments are welcome on www-voice@w3.org ( archive ). See W3C mailing list and archive usage guidelines .

Publication as a Working Draft does not imply endorsement by the W3C Membership. This is a draft document and may be updated, replaced or obsoleted by other documents at any time. It is inappropriate to cite this document as other than work in progress.

This document was produced by a group operating under the 5 February 2004 W3C Patent Policy . W3C maintains a public list of any patent disclosures made in connection with the deliverables of the group; that page also includes instructions for disclosing a patent. An individual who has actual knowledge of a patent which the individual believes contains Essential Claim(s) must disclose the information in accordance with section 6 of the W3C Patent Policy .

Table of Contents

1 Terminology
2 Overview
    2.1 Structure of VoiceXML 3.0
    2.2 Structure of this document
    2.3 How to read this document
3 Data Flow Presentation (DFP) Framework
    3.1 Data
    3.2 Flow
    3.3 Presentation
4 Core Concepts
    4.1 Syntactic and Semantic descriptions
    4.2 Resources, Resource Controllers, and Events
        4.2.1 Top Level Controller
    4.3 Syntax
    4.4 Event Model
        4.4.1 Internal Events
            4.4.1.1 Event Interfaces
                4.4.1.1.1 Event
                4.4.1.1.2 EventTarget
                4.4.1.1.3 EventListener
            4.4.1.2 Event Flow
                4.4.1.2.1 Event Listener Registration
                4.4.1.2.2 Event Listener Activation
            4.4.1.3 Event Categories         4.4.2 External Events
    4.5 Document Initialization and Execution
        4.5.1 Initialization
            4.5.1.1 DOM Processing
            4.5.1.2 Preparation for Execution
        4.5.2 Execution
            4.5.2.1 Subdialogs
            4.5.2.2 Application Root
            4.5.2.3 Summary of Syntax/Semantics Interaction
        4.5.3 Transition Controllers
5 Resources
    5.1 Datamodel Resource
        5.1.1 Data Model Resource API
    5.2 Prompt Queue Resource
        5.2.1 State Chart Representation
        5.2.2 SCXML Representation
        5.2.3 Defined Events
        5.2.4 Device Events
        5.2.5 Open Issue
    5.3 Recognition Resources
        5.3.1 Definition
        5.3.2 Defined Events
        5.3.3 Device Events
        5.3.4 State Chart Representation
        5.3.5 SCXML Representation
    5.4 SIV Resource     5.5 Connection Resource
        5.5.1         5.4.1 Definition
        5.5.2         5.4.2 Final Processing State
        5.5.3         5.4.3 Defined Events
        5.5.4         5.4.4 State Chart Representation
        5.5.5         5.4.5 SCXML Representation
    5.6     5.5 Timer Resource
        5.6.1         5.5.1 Definition
        5.6.2         5.5.2 Defined Events
        5.6.3         5.5.3 Device Events
        5.6.4         5.5.4 State Chart Representation
6 Modules
    6.1 Grammar Module
        6.1.1 Syntax
            6.1.1.1 Attributes
            6.1.1.2 Content Model
        6.1.2 Semantics
            6.1.2.1 Definition
            6.1.2.2 Defined Events
            6.1.2.3 External Events
            6.1.2.4 State Chart Representation
            6.1.2.5 SCXML Representation
        6.1.3 Events
        6.1.4 Examples
    6.2 Inline SRGS Grammar Module
        6.2.1 Syntax
        6.2.2 Semantics
            6.2.2.1 Definition
            6.2.2.2 Defined Events
            6.2.2.3 External Events
            6.2.2.4 State Chart Representation
            6.2.2.5 SCXML Representation
        6.2.3 Events
        6.2.4 Examples
    6.3 External Grammar Module
        6.3.1 Syntax
            6.3.1.1 Attributes
            6.3.1.2 Content Model
        6.3.2 Semantics
            6.3.2.1 Definition
            6.3.2.2 Defined Events
            6.3.2.3 External Events
            6.3.2.4 State Chart Representation
            6.3.2.5 SCXML Representation
        6.3.3 Events
        6.3.4 Examples
    6.4 Prompt Module
        6.4.1 Syntax
            6.4.1.1 Attributes
            6.4.1.2 Content Model
        6.4.2 Semantics
            6.4.2.1 Definition
            6.4.2.2 Defined Events
            6.4.2.3 External Events
            6.4.2.4 State Chart Representation
            6.4.2.5 SCXML Representation
        6.4.3 Events
        6.4.4 Examples
    6.5 Builtin SSML Module
        6.5.1 Syntax
        6.5.2 Semantics
        6.5.3 Examples
    6.6 Media Module
        6.6.1 Syntax
            6.6.1.1 Attributes
            6.6.1.2 Content Model
                6.6.1.2.1 Tips (informative)
        6.6.2 Semantics
        6.6.3 Examples
    6.7 Parseq Module
        6.7.1 Syntax
        6.7.2 Semantics
        6.7.3 Examples
    6.8 Foreach Module
        6.8.1 Syntax
            6.8.1.1 Attributes
            6.8.1.2 Content Model
        6.8.2 Semantics
        6.8.3 Examples
    6.9 Form Module
        6.9.1 Syntax
        6.9.2 Semantics
            6.9.2.1 Form RC
                6.9.2.1.1 Definition
                6.9.2.1.2 Defined Events
                6.9.2.1.3 External Events
                6.9.2.1.4 State Chart Representation
                6.9.2.1.5 SCXML Representation
    6.10 Field Module
        6.10.1 Syntax
        6.10.2 Semantics
            6.10.2.1 Field RC
                6.10.2.1.1 Definition
                6.10.2.1.2 Defined Events
                6.10.2.1.3 External Events
                6.10.2.1.4 State Chart Representation
                6.10.2.1.5 SCXML Representation
            6.10.2.2 PlayandRecognize RC
                6.10.2.2.1 Definition
                6.10.2.2.2 Defined Events
                6.10.2.2.3 External Events
                6.10.2.2.4 State Chart Representation
                6.10.2.2.5 SCXML Representation
    6.11 Builtin Grammar Module
        6.11.1 Usage of Platform Grammars
        6.11.2 Platform Requirements
        6.11.3 Syntax and Semantics
        6.11.4 Examples
    6.12 Data Access and Manipulation Module
        6.12.1 Overview
        6.12.2 Semantics
            6.12.2.1 The scope stack
            6.12.2.2 Relevance of scope stack to properties
            6.12.2.3 Implicit variables
            6.12.2.4 Variable resolution
            6.12.2.5 Standard session variables
            6.12.2.6 Standard application variables
            6.12.2.7 Legal variable values and expressions
        6.12.3 Syntax
            6.12.3.1 Creating variables: the <var> element
            6.12.3.2 Reading variables: "expr" and "cond" attributes and the <value> element
                6.12.3.2.1 Inserting variable values in prompts: The <value> element
            6.12.3.3 Updating variables: the <assign> and <data> elements
                6.12.3.3.1 The <assign> element
                6.12.3.3.2 The <data> element
            6.12.3.4 Deleting variables: the <clear> element
            6.12.3.5 Relevance for properties
        6.12.4 Backward compatibility with VoiceXML 2.1
        6.12.5 Implicit functions using XPath
    6.13 External Communication Module
        6.13.1 Receiving external messages within a voice application
            6.13.1.1 External Message Reflection
            6.13.1.2 Receiving External Messages Asynchronously
            6.13.1.3 Receiving External Messages Synchronously
                6.13.1.3.1 <receive>
        6.13.2 Sending messages from a voice application
            6.13.2.1 sendtimeout
    6.14 Session Root Module
        6.14.1 Syntax
        6.14.2 Semantics
        6.14.3 Examples
    6.15 Run Time Control Module
        6.15.1 <rtc>
            6.15.1.1 Syntax
        6.15.2 <cancelrtc>
            6.15.2.1 Syntax
        6.15.3 Semantics
        6.15.4 Examples
    6.16 SIV Module
        6.16.1 SIV Core Functions
        6.16.2 Syntax
        6.16.3 Semantics
            6.16.3.1 Definition
            6.16.3.2 Defined Events
            6.16.3.3 External Events
            6.16.3.4 State Chart Representation
        6.16.4 Events
        6.16.5 Examples
    6.17 Subdialog Module
        6.17.1 Syntax
        6.17.2 Semantics
        6.17.3 Examples
    6.18 Disconnect Module
        6.18.1 Syntax
            6.18.1.1 Attributes
            6.18.1.2 Content Model
        6.18.2 Semantics
            6.18.2.1 Definition
            6.18.2.2 Defined Events
            6.18.2.3 External Events
            6.18.2.4 State Chart Representation
            6.18.2.5 SCXML Representation
        6.18.3 Example
    6.19 Play Module
        6.19.1 Semantics
            6.19.1.1 Definition
            6.19.1.2 Defined Events
            6.19.1.3 External Events
            6.19.1.4 State Chart Representation
            6.19.1.5 SCXML Representation
    6.20 Record Module
        6.20.1 Syntax
            6.20.1.1 Attributes
            6.20.1.2 Content Model
            6.20.1.3 Data Model Variables
        6.20.2 Semantics
            6.20.2.1 RecordInputItem RC
                6.20.2.1.1 Definition
                6.20.2.1.2 Defined Events
                6.20.2.1.3 External Events
                6.20.2.1.4 State Chart Representation
                6.20.2.1.5 SCXML Representation
            6.20.2.2 Record RC
                6.20.2.2.1 Definition
                6.20.2.2.2 Defined Events
                6.20.2.2.3 External Events
                6.20.2.2.4 State Chart Representation
                6.20.2.2.5 SCXML Representation
    6.21 Property Module
        6.21.1 Syntax
            6.21.1.1 Attributes
            6.21.1.2 Content Model
        6.21.2 Semantics
            6.21.2.1 Definition
            6.21.2.2 Defined Events
            6.21.2.3 External Events
            6.21.2.4 State Chart Representation
            6.21.2.5 SCXML Representation
        6.21.3 Events
        6.21.4 Examples
    6.22 Transition Controller Module
        6.22.1 Syntax
            6.22.1.1 Attributes
            6.22.1.2 Content Model
        6.22.2 Semantics
            6.22.2.1 Definition
            6.22.2.2 Defined Events
            6.22.2.3 External Events
            6.22.2.4 State Chart Representation
            6.22.2.5 SCXML Representation
        6.22.3 Events
        6.22.4 Examples
7 Profiles
    7.1 Legacy Profile
        7.1.1 Conformance
        7.1.2             7.1.1.1 Vxml Root Module Requirements
        7.1.3             7.1.1.2 Form Module Requirements
        7.1.4             7.1.1.3 Field Module Requirements
        7.1.5             7.1.1.4 Prompt Module Requirements
        7.1.6             7.1.1.5 Grammar Module Requirements
        7.1.7             7.1.1.6 Data Access and Manipulation Module Requirements
        7.1.2 Convenience Syntax
        7.1.3 Default Handlers and Transition Controllers
    7.2 Basic Profile
        7.2.1 Introduction
        7.2.2 What the Basic Profile includes
            7.2.2.1 SIV functions
            7.2.2.2 Presentation functions
            7.2.2.3 Capture functions
            7.2.2.4 Other modules
        7.2.3 Returned results
        7.2.4 What the Basic Profile does not include
        7.2.5 Examples
    7.3 Maximal Profile
    7.4 Enhanced Profile
    7.5 Convenience Syntax (Syntactic Sugar)
8 Environment
    8.1 Resource Fetching
        8.1.1 Fetching
        8.1.2 Caching
            8.1.2.1 Controlling the Caching Policy
        8.1.3 Prefetching
        8.1.4 Protocols
    8.2 Properties
        8.2.1 Speech Recognition Properties
        8.2.2 DTMF Recognition Properties
        8.2.3 Prompt and Collect Properties
        8.2.4 Media Properties
        8.2.5 Fetch Properties
        8.2.6 Miscellaneous Properties
    8.3 Speech and DTMF Input Timing Properties
        8.3.1 DTMF Grammars
            8.3.1.1 timeout, No Input Provided
            8.3.1.2 interdigittimeout, Grammar is Not Ready to Terminate
            8.3.1.3 interdigittimeout, Grammar is Ready to Terminate
            8.3.1.4 termchar and interdigittimeout, Grammar Can Terminate
            8.3.1.5 termchar Empty When Grammar Must Terminate
            8.3.1.6 termchar Non-Empty and termtimeout When Grammar Must Terminate
            8.3.1.7 termchar Non-Empty and termtimeout When Grammar Must Terminate
            8.3.1.8 Invalid DTMF Input
        8.3.2 Speech Grammars
            8.3.2.1 timeout When No Speech Provided
            8.3.2.2 completetimeout With Speech Grammar Recognized
            8.3.2.3 incompletetimeout with Speech Grammar Unrecognized
    8.4 Value Designations
        8.4.1 Integers
        8.4.2 Real Numbers
        8.4.3 Times
9 Integration with Other Markup Languages
    9.1 Embedding of VoiceXML within SCXML
    9.2 Integrating Flow Control Languages into VoiceXML
        9.2.1 SCXML for Dialog Management
            9.2.1.1 System-driven Dialog
            9.2.1.2 User-driven Dialog
        9.2.2 Graceful Degradation
        9.2.3 SCXML as Basis for Recursive MVC

Appendices

A Acknowledgements
B References
    B.1 Normative References
    B.2 Informative References
C Glossary of Terms
D VoiceXML 3.0 XML Schema
    D.1 Schema for VXML Root Module
    D.2 Schema for Form Module
    D.3 Schema for Field Module
    D.4 Schema for Prompt Module
    D.5 Schema for Builtin SSML Module
    D.6 Schema for Foreach Module
    D.7 Schema for Data Access and Manipulation Module
    D.8 Schema for Legacy Profile
E Convenience Syntax in VoiceXML 2.x
    E.1 Simplified Dialog Structure
    E.2 Examples
        E.2.1 <menu> with <choice>
        E.2.2 Equivalent <form>, <field>, <option>
        E.2.3 Equivalent <form>, <field>, <grammar>
F Major changes since the last Working Draft


1 Terminology

In this document, the key words "must", "must not", "required", "shall", "shall not", "should", "should not", "recommended", "may", and "optional" are to be interpreted as described in [RFC2119] and indicate required levels for compliant VoiceXML 3.0 implementations.

Terms used in this specification are defined in Appendix C Glossary of Terms .

2 Overview

How does one build a successor to VoiceXML 2.0/2.1? Requests for improvements to VoiceXML fell into two main categories: extensibility and new functionality.

To accommodate both, the Voice Browser Working Group

  1. Developed the detailed semantic descriptions of VoiceXML functionality that versions 2.0 and 2.1 lacked. The semantic descriptions clarify the meaning of the VoiceXML 2.0 and 2.1 functionalities and how they relate to each other. The semantic descriptions are represented in this document as English text, UML state chart visual diagrams [ref] [ <xref>UML 2 State Machine Diagrams</xref> and/or textual SCXML representations [ref]. [ <xref>State Chart XML (SCXML): State Machine Notation for Control Abstraction</xref> ]. Figure 1 illustrates the VoiceXML 3.0 framework which contains illustrates some abstract UML state chart visual diagrams representing some existing VoiceXML functionality. Functionality from VoiceXML 2.0

    Figure 1: VoiceXML 3.0 Framework - The red-filled cells indicates some functionality from VoiceXML 2.0 expressed as state charts

  2. Described the detailed semantics for new functionality. New functions include, for example, speaker identification and verification, video capture and replay, and a more powerful prompt queue. These semantic descriptions for these new functions are also represented in this document as English text, UML state chart visual diagrams [ref] and/or textual SCXML representations [ref]. Figure 2 contains some abstract UML state chart visual diagrams representing new functionality. New functionality

    Figure 2: VoiceXML 3.0 Framework - The red-filled cells indicates new functionality

  3. Organized the functionality into modules, with each module implementing different functions. One reason for the introduction of a more rigorous semantic definition is that it allows us to assign semantics to individual modules. This makes it easier to understand what happens when modules are combined or new ones are defined. In contrast, VoiceXML 2.0 and 2.1 had a single global semantic definition (the FIA), which made it difficult to understand what would happen if certain elements were removed from the language or if new ones were added. Figure 3 contains illustrates some modules, each containing VoiceXML 3.0 functionality Vendors may extend VoiceXML functionality by creating additional modules with additional functionality not described in this document. For example, a vendor might create a new GPS input module. Application developers should be cautious about using vendor-specific modules because the resulting application may not be portable. Modules

    Figure 3: VoiceXML 3.0 Framework - The red bolded rectangles indicates modules

  4. Restructured and revisedDefined revised the syntax of each module to incorporate any new functionality. Application developers use the syntax of each module as an API to invoke the module’s functions. Figure 4 illustrates some simplified syntax associated with modules. Syntax

    Figure 4: VoiceXML 3.0 Framework - The bolded red text indicates syntax

  5. Introduced the concept of a profile (language) which incorporates the syntax of several modules. Figure 5 illustrates two profiles. For example, a VoiceXML 2.1 profile incorporates the syntax of most of the modules corresponding to the VoiceXML 2.1 functionality which will support most existing VoiceXML 2.1 applications. Thus most VoiceXML 2.1 applications can be easily ported to VoiceXML 3.0 using the VoiceXML 2.1 profile. Another profile omits the VoiceXML 2.1 Form Interpretation Algorithm (FIA). This profile may be used by developers who want to define their one own flow control rather than using the FIA. Profiles enable platform developers to select just the functionality that application developers need for a platform or class of application. Multiple profiles enables developers to use just the profile (language) needed for a platform or class of applications. For example, a lean profile for portable devices, or a full-function profile for servers-based applications using all of the new functionality of VoiceXML 3.0. Profiles

    Figure 5: VoiceXML 3.0 Framework - The dotted red area and the dashed green area indicate two profiles

One of the benefits of detailed semantic descriptions is improving portability within VoiceXML. Two vendors may implement the same functionality differently; however, the functionality must be consistent with the semantic meanings described in this document so that application authors are isolated from the different implementations. This increases portable among platforms that support the same syntax. Note that there are many other factors that effect to the portability that is outside the scope of this document (e.g. speech recognition capabilities, telephony).

2.2 Structure of this document

The remainder of this document is structured as follows:

3 Data Flow Presentation (DFP) Framework presents the Data-Flow-Presentation Framework, its importance for the development of VoiceXML 3.0 and how VoiceXML 3.0 fits into the model.

4 Core Concepts explains the core concepts underlying the new structure for VoiceXML, including resources, resource controllers, the relationship between syntax and semantics, DOM eventing, modules and profiles.

5 Resources presents the resources defined for the language. These provide the key presentation-related functionality in the language.

6 Modules presents the modules defined for the language. Each module consists of a syntax piece (with its user-visible events), a semantics piece (with its behind-the-scenes events) and a description of how the two are connected.

7 Profiles presents two profiles. The first, the VoiceXML 2.1 profile, shows how a language similar to VoiceXML 2.1 can be created using the structure and functionality of VoiceXML 3.0. The second, the Basic profile, is a simple compilation of all of leaves out higher-level flow control constructs such as <form> and the functionality available in VoiceXML 3.0. associated Form Interpretation Algorithm.

The Appendices provide useful references and a glossary of terms used in the specification.

2.3 How to read this document

For everyone: Please first read 3 Data Flow Presentation (DFP) Framework . The data-flow- presentation distinction applies not only to VoiceXML 3.0, but to many of W3C's specifications. Understanding VoiceXML's role as a presentation language is crucial context for understanding the rest of the specification.

For application authors: we recommend that you begin with syntax and only gradually explore details of the semantics as you need to understand behavioral specifics.

  1. If you are familiar with VoiceXML 2 you might want to begin with the Legacy profile in 7.1 Legacy Profile to see an example of all the syntactic pieces in the finished profile.
  2. You should then review the syntax sections of each of the modules in 6 Modules , along with the Basic profile in 7.2 Basic Profile . When you need to understand how a bit of syntax is implemented, read the semantics section corresponding to that syntax.
  3. Along the way you will definitely want to review the parts of 4 Core Concepts that are relevant to your other reading (profiles, modules, syntax, semantics, and DOM eventing).

For VoiceXML platform developers: we recommend that you begin with the functionality and framework and only focus on syntax later.

  1. If you are familiar with VoiceXML 2 you might want to begin with the Legacy profile in 7.1 Legacy Profile to see the user-visible differences between the original VoiceXML 2.1 language and the new Legacy profile. A brief review of the Basic profile in 7.2 Basic Profile would be good as well.
  2. Next you should review 4 Core Concepts in detail, since the rest of the language is built upon the framework described there.
  3. 5 Resources and 6 Modules (the semantics part) should be the bulk of your focus. Remember that they are semantic descriptions only and that you can implement the functionality any way you wish as long as the semantics remain the same.
  4. For document authors one One significant difference with from VoiceXML 2.1 is support for DOM 4.4 Event Model .

3 Data Flow Presentation (DFP) Framework

Unlike VoiceXML 2.0/2.1, the focus in VoiceXML 3.0 is almost exclusively on the user interface portions of the language. By choice, very little work has gone into the development of data storage and manipulation or control flow capabilities. In short, VoiceXML 3.0 has been designed from the ground up as a *presentation* language, according to the definition presented in the Data Flow Presentation ( [DFP] ) Framework.

Although VoiceXML 3.0 is a presentation language, it also contains within it all 3 levels of the DFP framework ( Figure 6).

DFP Architecture

Figure 6: DFP Architecture

The Data Flow Presentation (DFP) Framework is an instance of the Model-View-Controller paradigm, where computation and control flow are kept distinct from application data and from the way in which the application communicates with the outside world. This partitioning of an application allows for any one layer to be replaced independently of the other two. In addition, it is possible to simultaneously make use of more than one Data (Model) language, Flow (Controller), and/or Presentation (View) language.

4 Core Concepts

4.1 Syntactic and Semantic descriptions

This document specifies the VoiceXML 3.0 language as a collection of modules. Each module is described at two levels:

  1. Syntax level -- The syntax is a set of XML elements, attributes, and events used by VoiceXML 3.0 application developers to specify applications. The VoiceXML 3.0 elements and attributes are specified within each module and in the XML schema in appendix TBD. The events are DOM level 3 events. This document provides a textual description of each element, attribute, and event.
  2. Semantics level -- The semantics of each module is described in terms of resources, resource controllers, and semantic events that the resource controllers may generate and consume. Semantics is described by both UML state chart visual diagrams and SCXML representations.

The visual UML state chart diagrams are informative. They are included for ease of reading and quick understanding. The more detailed textual SCXML representations are normative.

It is important to note that this model places no burden or requirements that a VoiceXML interpreter must implement behavior as described in the model. Rather, the requirement is that the behavior must be the same as if it were implemented as described, but it is permitted to have optimizations or different architecture behind the implementation of the markup interpretation.

The semantic descriptions are important for reasons including the following:

4.2 Resources, Resource Controllers, and Events

The resources, resource controllers, and the events they generate are intended only to describe the semantics of VoiceXML 3 modules. Implementations are not required to use SCXML to implement VoiceXML 3 modules, nor must they create objects corresponding to resources, resource controllers, and the SCXML events they raise.

The logical SCXML events must be distinguished from the author-visible DOM events that are a mandatory part of the VoiceXML 3 language. Implementations MUST raise these DOM events and process them in the manner described in 4.4 Event Model . The interaction between actual DOM events and logical SCXML events is described in 4.5 Document Initialization and Execution , below.

Each VoiceXML 3.0 module is described using SCXML notation and optionally a UML state chart representation of its underlying behavior expressed in terms of resources and resource controllers. While the resources and resource controllers are not exposed directly in the markup, they are used to define the semantics of VoiceXML 3.0 markup elements. Figure 7 illustrates the relationship among resource controllers, resources, and media devices. The arrows represent events exchanged among components. A more concrete example is represented in Figure 8 which illustrates the Prompt Resource controller (further defined in 6.4.2 Semantics ), the PromptQueue Resource, and the SSML Media Player.

Semantic model overview

Figure 7: Semantic model with Resources and Resource Controllers

Semantic model details

Figure 8: Semantic model with Specific Examples

4.4 Event Model

Editorial note  
Open Issue: We really need to dig in on how the traditional throw/catch from VXML 2.0 work. How does the event handler selection work (complete with partial event name matching, generic count, best count, cond, document order, as-if-by-copy, etc.)? All of that has to be implemented as the default behavior of the vxmlevent event, which means the vxmlevent needs to be targeted at the right node (the one that is going to be where the as-if-by-copy occurs) and that the payload of the vxmlevent needs to include the event name and the count. We may also want to expose (in JS bindings) the creation of the list of candidate catch handlers as the algorithm is run.

4.4.1 Internal Events

The event model for VoiceXML 3.0 builds upon the DOM Level 3 Events [DOM3Events] specification. DOM Level 3 Events offer a robust set of interfaces for managing the listener registration, dispatching, propagation, and handling of events, as well as a description of how events flow through an XML tree.

The DOM 3.0 event model offers VoiceXML developers a rich set of interfaces that allow them to easily add behavior to their applications. In addition, conforming to the standard DOM event model enables authors to integrate their Voice applications in next generation multimodal or multi-namespaced frameworks such as MMI and CDF with minimal efforts. Note that the VXML 2.0 style events are supported through a new DOM event named 'vxmlevent', and if this vxmlevent is uncanceled then the default action is to run the VXML 2.0 event handling.

Within the VoiceXML 3.0 semantic model, the DOM Level 3 Events APIs are available to all Resource Controllers that have markup elements associated with them. Indeed, this section covers the eventing APIs as available to VoiceXML 3.0 markup elements. The following section describes how the semantic model ties in with the DOM eventing model.

4.4.1.1 Event Interfaces

All VoiceXML 3.0 markup elements implement interfaces that support the following:

  • Subscription to events by event listeners and, symmetrically, the removal of event listeners.
  • Publishing of the events emitted by their resources.
  • Event handling.
4.4.1.1.1 Event

The VoiceXML 3.0 Event interface extends the DOM Level 3 Event interface to support voice specific event information. In particular, the VoiceXML 3.0 Event interface supports a count integer that stores the number of times a resources emits a particular event type. The semantic model manages the count field by incrementing its value and resetting it as described in the section that follows.

Note: Editorial note RH: should  
Open Issue: Because we expose now are using the 'vxmlevent' DOM event, we don't need to add a count to authors? If so, should the generic DOM events (and thus change the generic DOM events). Instead, we have a special variable like event.count or something similar ? need to specify the count as one of the properties of the vxmlevent event.
4.4.1.1.2 EventTarget

VoiceXML 3.0 markup elements implement the DOM Level 3 EventTarget interface.This interface allows registration and removal of event listeners as well as dispatching of events.

4.4.1.1.3 EventListener

The VoiceXML 3.0 markup elements implement the DOM Level 3 EventListener interface. This interface allows the activation of handlers associated with a particular event. When a listener is activated, the event handler execution is done in the semantic model as described in the section that follows.

4.4.1.2 Event Flow
[To be updated by Michael Bodell (members only) due April 1 2008] Events propagate through markup elements as per the DOM event flow . Event listeners may be registered on any of VoiceXML markup elements. When processing a VoiceXML 2.0 profile, event listeners are not allowed to be registered for the capture phase, as this contradicts the as-if-by-copy event semantics of VoiceXML 2.0. If a listener is registered with the capture phase set to true in a VoiceXML 2.0 document, an error.event.illegalphase event will be dispatched onto the root document and the listener registration will be ignored (does that sound reasonable to people?).
4.4.1.2.1 Event Listener Registration

The DOM Level 3 Event specification supports the notion of partial ordering using the event listener group ; all events within a group are ordered. As such, in VoiceXML 3.0, event listeners are registered as they are encountered in the document. Furthermore, all event listeners registered on an element belong to the same default group. Both of these provisions ensure that event handlers will execute in document order.

4.4.1.3 Event Categories The VoiceXML 3.0 specification extends the DOM 3 Event specification to support partial name matching on events. VoiceXML 3.0 creates categories of events ( the list of categories needs to be specified in the VoiceXML 3.0 spec ) and allows authors and the platform to register listeners for either a specific event type or for all events within a particular category or subcategory. For example, VoiceXML 3.0 may create a connection category such as: {"http://www.example.org/2007/v3","connection"} The spec may also declare a subcategory of connection or a specific event type that belongs to this category: {"http://www.example.org/2007/v3","connection.disconnect"} {"http://www.example.org/2007/v3","connection.disconnect.hangup"} Following this declaration, the VoiceXML 3.0 Event specification uses partial name matching to associate events propagating through the DOM to listeners registered on the tree. The VoiceXML 3.0 Event specification follows the prefix matching used in VoiceXML 2.0 for associating events with their categories. Note: It might be useful to introduce the "*" notation to be specify a catch for all events irrespective of their type and/or category.

4.4.2 External Events

VoiceXML 3.0 interpreters may receive events from external sources, for example SCXML engines. In particular, it may receive the life cycle events specified as part of the Multimodal Architecture and Interfaces specification [MMI] . These life cycle events allow the flow component of the DFP architecture to control the presentation layer by starting and stopping the processing of markup. By handling these events, the VoiceXML interpreter acts as a 'modality component' in the multimodal architecture, while the flow component acts as an 'interaction manager'. As a result, VoiceXML 3 applications can be easily extended into multimodal applications. However it is important to note that support for the life cycle events is required by the DFP framework in all applications, whether uni- or multimodal.

The interpreter must handle the following life cycle events automatically:

  • PrepareRequest. This event instructs the interpreter to prepare to run a VoiceXML script. The event contains: a) either the URI of the markup to run or the actual markup itself, b) a context ID which will be used in subsequent messages referring to the same markup, and c) a specification of media channel to use. Note that the interpreter does not actually start running the markup in question when it receives this message. Once the interpreter has finished its preparation, it sends a PrepareResponse event in reply. The PrepareResponse event contains the context ID that was sent in the PrepareRequest plus a status field containing 'success' or 'failure'. This message with a status of 'success' thus indicates that the interpreter is now ready to run the specified markup.
  • StartRequest. This event instructs the interpreter to run the specified script. It will contain the same context ID as the preceding PrepareRequest, and may optionally contain a new specification of the markup to run and the media channel to use, overriding those contained in the PrepareRequest. This event may also be sent without a preceding PrepareRequest. When the interpreter receives this event, it must start running the specified markup using the specified media channel. It will then send a StartResponse event in reply.
  • CancelRequest with 'immediate' flag set to true. This event instructs the interpreter to disconnect from the media and to stop processing the markup. This event contains the context ID. The interpreter may continue processing clean-up handlers etc. after it receives this event, but it should not end any events back to the sender other than the CancelResponse, which acknowledges the receipt of this command. If the 'immediate' flag is set to false, the interpreter passes the event up to be handled by author code, as described below.
  • ClearContextRequest???

All other life cycle events and all other external events are ignored unless the External Communications Module 6.13 External Communication Module is included in the profile. If the External Communications Module is present, all other external events are passed up to the application, placed in the application event queue and then handled as specified by the developer using the functionality defined in that module.

Editorial note  
Open Issue: Should ClearContextRequest be handled automatically? Should Done be sent automatically when the document is finished? Where do these response events get sent?

4.5 Document Initialization and Execution

4.5.1 Initialization

VoiceXML 3.0 document initialization takes place over two phases: "DOM Processing" and "Preparation for Execution". Both of these phases assume the required resources have already been created. Any errors in the initialization of the document or the creation of these resources MUST be thrown in the calling context. If that context was a VoiceXML document, then this MUST be an error.badfetch.

Note that while these phases are ordered, and the steps within the phases ordered, this is only a logical ordering. Implementations are allowed to use a different ordering as long as behave as if they were following the specified ordering.

4.5.1.1 DOM Processing

The first step in initializing a VoiceXML 3.0 document (root document or child) is generating the Level-3 DOM. This task involves both checking the document for well formed XML and full schema and syntax validation to ensure proper tag/attribute relationships.

Once complete, the interpreter invokes the semantic constructor for the root <vxml> node in the DOM. In this context, the term "semantic constructor" represents whatever mechanism is used to create the Resource Controllers for a given node. No particular implementation is implied or required. The root <vxml> node constructor is responsible for invoking the constructors for all nodes in the document that have them. When it does this, it will call the semantic constructor routine passing it

  1. a pointer to the node that has the constructor
  2. a pointer to the root of the DOM
  3. an arbitrary data structure
Editorial note  

Open Issue: we must specify the operation of the root node constructor in more detail as part of the V3 specification. Other people can define modules, but we must specify how they are assembled into a full semantic representation of the application.) If there is an application root document specified, the root node constructor will have to construct its RCs as well, by calling its root node constructor.

Also, needs to happen after creation of RCs and before general semantic initialization. After the creation of of the RCs is when the mapping from syntax to RCs will occur, and that's when the list would be known.

Note that the initial construction process creates the RCs but does not necessarily fully configure them. Further initialization, including in particular the creation of variables and variable scopes, will happen only when the RCs are activated at runtime (e.g. by visiting a Form). However, at this point the list of children for each element (and thus each RC) is known. For each RC this list of children will populate into the appropriate place in the RC data model before semantic initialization of the RC.

Once the RCs are constructed, they are independent of the DOM, except for the interactions specified below. However, while they are running the RCs often make use of what appears to be syntactic information. For example, the concept of 'next item' relies heavily on document order, while <goto> can take a specific syntactic label as its target. We provide for this by assuming that RCs can maintain a shadow copy of relevant syntactic information, where "shadow copy" is intended to allow a variety of implementations. In particular, platforms may make an actual copy of the information or may maintain pointers back into the DOM. The construction process may create multiple RCs for a given node. In that case, one of the RCs will be marked as the primary RC. It is the one that will be invoked when the flow of control reaches that (shadow) node.

4.5.1.2 Preparation for Execution

If the document being initialized is a child of a root document, then the root document of that child must fully complete its initialization before the child can be prepared. In other words, the root document must both process its DOM and prepare for execution before child initialization proceeds.

Once in the preparation phase, static properties (ie those NOT a function of ECMA) are available for lookup. Although this isn't an explicit step, it is mentioned here as this is the first opportunity for their retrieval. Note that even if documentmaxage/stale properties were to be specified in the child document, they would not be available for retrieval when downloading the root document. Rather these values would be taken from the system defaults or context. For example, consider the case of a first call into a system which lands on a child document called A. The default values for documentmaxage/stale would be used when fetching both this child A and root document of the child called A-root. Should A transition to child document B which references root B-root, the <property> values of documentmaxage/stale in A would be used to fetch B. However, the implicit fetch of B-root would use the system defaults for documentmaxage/stale.

With the ability to read <property> values comes the first opportunity to act on any prefetching directives supplied by the application. Prefetching is an optional step, and could be postponed temporality or indefinitely. The only requirement on a conformant processor is that prefetching cannot take place before this step.

Next, document-level variables and scripts are initialized in document order. Note that conformant processors MUST not locally handle any semantic errors generated during this step. Such errors MUST be thrown to the calling document or context (e.g error.badfetch). The reason being that the present document is not yet fully initialized and thus cannot reliably handle errors locally.

The final step in preparation is for the controller to select the first <form> to execute. If either the local controller is malformed or the optional URI fragment points to a non-existent <form>, an error MUST be generated in the calling document or context (eg error.badfetch). A conformant processor MUST not handle this locally.

4.5.2 Execution

After initialization, the semantic control flow does a <goto> to the initial Resource Controller. Once a RC is running, it invokes Resources and other RCs by sending them events. The DOM is not involved in this process. At various points in the processing, however, an RC may decide to raise an author-visible event. It does this by creating an event targeted at a specific DOM node and sending it back to the DOM. When the DOM receives the event, it performs the standard bubble/capture cycle with the target specified in the event. In the course of the bubble/capture cycle, various event handlers may fire. Their execution is a semantic action and occurs back in the semantic 'side' of the environment. The DOM sends messages back to the appropriate semantic objects to cause this to happen. Note that this means that the DOM must store some sort of link to the appropriate RCs. The event handlers may update the data model, execute script, or raise other DOM events. When the handler finishes processing on the semantic side, it sends a notification back to the DOM so that it can resume the bubble/capture phase. (N.B. This notification is NOT a DOM event.) When the DOM finishes the bubble/capture processing of the event, it sends a notification back to the RC that raised the event so that it can continue processing.

Editorial note  
Open Issue: Is this notification a standard semantic event? Note that RC processing must pause during the bubble/capture phase to avoid concurrency problems.

4.5.3 Transition Controllers

Transition controllers provide the basis for managing flow in VoiceXML applications. Resource controllers for some elements like <form> have associated transition controllers which influence how form items get selected and executed. In addition to form, there is a transition controller for each of the following higher VoiceXML scopes:

  • A document-level transition controller associated with the <vxml> element's resource controller that is responsible for starting document processing and deciding which resource controller to execute next, i.e., for a <form> or other interaction element in the document.
  • An application-level transition controller that is responsible for starting application processing and deciding which document-level resource controller to execute next.
  • A top-level or session transition controller that manages the flow for a VoiceXML session. This top-level transition controller is responsible for starting session processing and also holds session level properties.
  • The above transition controllers influence the selection of the first form item resource controller to execute, and subsequent ones through the session.

Whenever a form item or form or document finishes execution, the relevant transition controller is consulted for selecting the subsequent one for execution. To find the relevant transition controller, begin at the current resource controller and navigate along the associated VoiceXML element's parent axis until you reach a resource controller with an associated transition controller. For example, for a form item, the parent form's resource controller has the relevant transition controller.

When a transition controller runs to completion, control is returned to the next higher transition controller along with any results that need to be passed up. For example, when the last form item in a form is filled, the transition controller associated with the form returns control to the document level transition controller along with the results for the filled form. Control may be returned to the parent transition controller in case of such run to completion semantics as well as error semantics.

5 Resources

This section describes semantic models for common VoiceXML resources. Resources have a life cycle of creation and destruction. Specific resources may specify detailed requirements on these phases. All resources must be created prior to their use by a VoiceXML interpreter.

Editorial note  
Standard lifecycle events are expected to be defined in later versions: create event: from idle to created; destroy event: from created to idle.

Resources are defined in terms of a state model and events which it processes within defined states. Events may be divided into those which are defined by the resource itself and events defined by other conceptual entities which the resource receives or sends within these states. These conceptual entities include resource controllers and a 'device' which provides an implementation of the services defined by the resource.

The semantic model is specified in both UML state chart diagrams and SCXML representations. In case of ambiguity, the SCXML representation takes precedence over UML diagrams. Note that SCXML is used here to define the states and events for resources and this definitional usage should not be confused with the use of SCXML to specify application flow (see 3.2 Flow ). Furthermore, these resource events are conceptual, not DOM events: they are used to define relationship with other conceptual entities and are not exposed at the markup level. The relationship between conceptual events and DOM events is described in XXX.

The following resources are defined: data model ( 5.1 Datamodel Resource ), prompt queue ( 5.2 Prompt Queue Resource ) and DTMF and ASR ), recognition -- DTMF, ASR, and SIV ( 5.3 Recognition Resources ), connection (), and timer ( 5.5 Timer Resource ).

[Later versions will defined the following resources: recorder, SIV. Later versions may define the following resources: session recorder, ...]

5.1 Datamodel Resource

Editorial note  

Later versions of this document will clarify that different datamodels may be instanced, such as ECMAScript, XML, etc. Conformance requirements will be stated at a later stage.

The datamodel is a repository for both user- and system-defined data and properties. To simplify variable lookup,we define the datamodel with a synchronous function-call API, rather than an asynchronous one based on events. The data model API does not assume any particular underlying representation of the data or any specific access language, thus allowing implementations to plug in different concrete data model languages.

There is a single global data model that is created when the system is first initialized. Access to data is controlled by means of scopes, which are stored in a stack. Data is always accessed within a particular scope, which may be specified by name but defaults to being the top scope in the stack. At initialization time, a single scope named "Global" is created. Thereafter scopes are explicitly created and destroyed by the data model's clients.

Editorial note  
Resource and Resource controller description to be updated with API calls rather than events.

5.1.1 Data Model Resource API

Table 1: Data Model API
Function Arguments Return Value Sequencing Description
CreateScope name(optional) Success or Failure Creates a new scope object and pushes it on top of the scope stack. If no name is provided the scope is anonymous and may be accessed only when it on the top of the scope stack. A Failure status is returned if a scope already exists with the specified name.
DeleteScope name(optional) Success or Failure Removes a scope from the scope stack. If no name is provided, the topmost scope is removed. Otherwise the scope with provided name is removed. A Failure status is returned if the stack is empty or no scope with the specified name exists.
CreateVariable variableName, value(optional), scopeName(optional) Success or Error Creates a variable. If scopeName is not specified, the variable is created in the top most scope on the scope stack. If no value is provided, the variable is created with the default value specified by the underlying datamodel. A Failure status is returned if a variable of the same name already exists in the specified scope.
DeleteVariable variableName, scopeName(optional) Success or Failure Deletes the variable with the specified name from the specified scope. If no scopeName is provided, the variable is deleted from the topmost scope on the stack. The status Failure is returned if no variable with the specified name exists in the scope.
UpdateVariable variableName, newValue, scopeName(optional) Success or Failure Assigns a new value to the variable specified. If scopeName is not specified, the variable is accessed in the topmost scope on the stack. A Failure status is returned if the specified variable or scope cannot be found.
ReadVariable variableName, scopeName(optional) value Returns the value of the variable specified. If scopeName is not specified, the variable is accessed in the topmost scope on the stack. An error is raised if the specified variable or scope cannot be found.
EvaluateExpression expr, scopeName(optional) value Evaluates the specified expression and returns its value. If scopeName is not specified, the expression is evaluated in the topmost scope on the stack. An error is raised if the specified scope cannot be found.
Issue (): Do we need event listeners on the data model, e.g., to notify when the value of a variable changes? Resolution: None recorded.

5.2 Prompt Queue Resource

5.2.1 State Chart Representation

Here is a UML representation of the prompt queue. This state machine assumes that "queue" and "play" are separate commands and that a separate "play" will always be issued to trigger the play. When the "play" is issued, the systems plays any queued prompts, up to and including the first fetch audio in the queue. Then it halts, even if there are additional prompts or fetch audio in the queue and waits for another "play" command.

Editorial note   Open issue: Can queued prompt commands, either audio or TTS, be left un-fetched or un-rendered until a play command is issued to the prompt resource? This may result in delays or gaps in the production of the actual audio, as the rendering or fetching may not produce playable audio fast enough to avoid inter-prompt delays.

The prompt structure assumed here is fairly abstract. It consists of a specification of the audio along with optional parameters controlling playback (for example, speed or volume.) The audio may be presented in-line, as SSML or some other markup language, or as a pointer to a file or streaming audio source. Logically, URLs are dereferenced at the time the prompt is queued, but implementations are not required to fetch the actual media until the prompt in question is sent to the player device. Note that the player device is assumed to be able to handle both recorded prompts and TTS, and to be able to interpret SSML. Platforms are free to optimize their implementations as long as they conform to the state machine specified here. In particular, platforms may prefetch audio or begin TTS processing in the background before the prompt is sent to the player device. For applications that make use of VCR controls (speed up, skip forward, etc.), actual performance may depend on whether the platform has implemented such optimizations. For example, a request to skip forward on a platform that does not prefetch prompts may result in a long delay. Such performance issues are outside the scope of this specification.

This diagram assumes that SSML mark information is delivered in the Player.Done event, and that the player returns a Player.Done event when it is sent a 'halt' event (otherwise mark information would get lost on barge-in and hangup, etc).

Note that the "FetchAudio" state is shown stubbed out for reasons of space, and is expanded in a separate diagram below the main one.

Semantic model for prompt queue semantics

Figure X: Prompt Queue Model

Semantic model for fetch audio

Figure Y: Fetch audio Model

5.2.2 SCXML Representation

<?xml version="1.0" encoding="UTF-8"?>
<scxml initialstate="Created">
  <datamodel>
    <data name="queue"/>
    <data name="markName"/>
    <data name="markTime"/>
    <data name="bargeInType"/>
  </datamodel>
   <state id="Created">
    <initial id="Idle"/>
    <transition event="QueuePrompt">
     <insert  pos="after" loc = "datamodel/data[@name='queue']/prompt" val="_eventData/prompt"/>
    </transition>
     <transition event="QueueFetchAudio">
       <foreach var="node" nodeset="datamodel/data[@name='queue']/prompt"> 
         <if cond="$node[@fetchAudio='true']">
          <delete loc="$node"/>
         <else>
          <assign loc="$node[@bargeInType]" val="unbargeable"/>
         </else>
         </if>
       </foreach>
    <insert pos="after" name="datamodel/data[@name='queue']/prompt" val="_eventData/audio"/>
    </transition>
    <transition event="setParameter">
     <send target="player" event="setParameter" namelist="_eventData.paramName, _eventData.newValue"/>
    </transition>
    <transition event="Cancel" target="Idle">
     <send target="player" event="halt"/>
     <send event="PlayDone" namelist="/datamodel/data[@name='markName'].text(), /datamodel/data[@name='markTime'].text()"/>
     <delete loc="datamodel/data[@name='queue']/prompt"/>
    </transition>
    <transition event="CancelFetchAudio">
       <foreach var="node" nodeset="datamodel/data[@name='queue']/prompt"> 
         <if cond="$node[@fetchAudio='true']">
          <delete loc="$node"/>
         </if>
       </foreach>
    </transition>
    <state id="Idle">
     <onentry>
      <assign loc="/datamodel/data[@name='markName']" val=""/>
      <assign loc="/datamodel/data[@name='markTime']" val="-1"/>
      <assign loc="/datamodel/data[@name='bargeInType']" val=""/>
     </onentry>
     <transition event="Play" cond="/datamodel/data[@name='queue']/prompt[1][@fetchAudio] eq 'false'" target="PlayingPrompt"/>
     <transition event="Play" cond="/datamodel/data[@name='[queue']/prompt[1][@fetchAudio] eq 'true'" target="FetchAudio"/>
    </state>
    <state id="PlayingPrompt">
     <datamodel>
      <data name="currentPrompt"/>
     </datamodel>
     <onentry>
      <assign loc="/datamodel/data[@name='currentPrompt']/prompt" val="/datamodel/data[@name='queue']/prompt[1])"/>
      <delete loc="/datamodel/data[@name='queue']/prompt[1]"/>
      <if cond="/datamodel/data[@name='currentPrompt']/prompt[@bargeInType] != /datamodel/data[@name='bargeInType']">
       <send event="BargeInChange" namelist="/datamodel/ data[@name='currentPrompt']/prompt[@bargeInType]"/>
       <assign loc="/datamodel/data[@name='bargeInType']" expr="/ datamodel/data[@name='currentPrompt']/prompt[@bargeInType]"/>
      </if>
     </onentry>
     <invoke targettype="player" srcexpr="/datamodel/ data[@name='currentPrompt']/prompt"/>
     <finalize>
       <if cond="_eventData/MarkTime neq '-1'">
         <assign name="/datamodel/data[@name='markName']/" val="_eventData/markName.text()"/>
         <assign name="/datamodel/data[@name='markTime']/" val="_eventData/markTime.text()"/>
       </if>
     </finalize>
     <transition event="player.Done" cond="/datamodel/data[@name='queue']/prompt[last()] le '1'" target="Idle">
      <send event="PlayDone" namelist="/datamodel/data[@name='markName'].text(), /datamodel/data[@name='markTime'].text()"/>
     </transition>
     <transition event="player.Done" cond="/datamodel/data[@name='queue'/prompt[1][@fetchAudio] neq 'true'" target="PlayingPrompt"/>
     <transition event="player.Done"
         cond="/datamodel/data[@name='queue']/prompt[1][@fetchAudio] eq 'true'" target="FetchAudio"/>
    </state> <!-- end PlayingPrompt -->
    <state id="FetchAudio">
      <initial id="WaitFetchAudio"/>
      <transition event="player.Done" target="FetchAudioFinal"/>
      <state id="WaitFetchAudio">
        <onentry>
          <send target="self" event="fetchAudioDelay"
          delay="/datamodel/data[@name='queue']/prompts[1][@fetchaudiodelay]"/>
        </onentry>
       <transition event="fetchAudioDelay" next="StartFetchAudio"/>
       <transition event="cancelFetchAudio" next="FetchAudioFinal"/>
      </state>
     <state id="StartFetchAudio">
      <datamodel>
       <data name="fetchAudio"/>
      </datamodel>
      <onentry>
       <assign loc="/datamodel/data[@name='fetchAudio']" expr="/datamodel/data[@name='queue']/prompts[1]"/>
       <delete loc="/datamodel/data[@name='queue']/prompts[1]"/>
       <send target="self" event="fetchAudioMin" delay="/datamodel/data[@name='fetchAudio'][@fetchaudiominimum]"/>
       <send target="player" event="Play" namelist="/datamodel/data[@name='fetchAudio']"/>
       <if cond="/datamodel/data[@name='bargeInType'].text() ne 'fetchAudio'">
         <send event="BargeInChange" namelist="fetchAudio"/>
       </if>
      </onentry>
      <transition event="CancelFetchAudio" target="WaitFetchMinimum"/>
      <transition event="fetchAudioMin" target="WaitFetchCancel"/>
     </state>
     <state id="WaitFetchMinimum">
       <transition event="fetchAudioMin" target="FetchAudioFinal">
         <send target="player" event="halt"/>
       </transition>
     </state>
     <state id="WaitFetchCancel">
       <transition event="CancelFetchAudio" target="FetchAudioFinal">
         <send target="player" event="halt"/>
       </transition>
     </state>
     <state id="FetchAudioFinal" final="true" />
     <!-- could put cleanup handling here -->
    </state> <!-- end FetchAudio -->
   </state> <!-- end Created -->
</scxml>

5.2.3 Defined Events

The prompt queue resource can be controlled by means of the following events:

Table 2: Events received by prompt queue resource
Event Source Payload Sequencing Description
queuePrompt any prompt (M), properties(O) adds prompt to queue, but does not cause it to be played
queueFetchAudio any prompt (M) adds fetch audio to queue, removing any existing fetch audio from queue. Does not cause it to be played.
play any Causes any queued prompts or fetch audio to be played
changeParameter any paramName, newValue Sets the value of paramName to newValue, which may be either an absolute or relative value. The new setting takes effect immediately, even if there is already a prompt playing.
cancelFetchAudio any Deletes any queued fetch audio. Also cancels any fetch audio that is already playing, unless fetchAudioMin has been specified and not yet reached.
cancel any Immediately cancels any prompt or fetch audio that is playing and clears the queue.

The prompt queue resource returns the following events to its invoker:

Table 3: Events sent by prompt queue resource
Event Target Payload Sequencing Description
prompt.Done controller markName(O), markTime(O) Indicates prompt queue has played to completion and is now empty
bargeintypeChange controller one of: unbargeable, hotword, energy, fetchAudio sent at start of prompt play and whenever a new prompt or fetch audio is played whose bargeinType differs from the preceding one.

Issue ():

Do we need 'fetchAudio' as a distinct bargein type?

Resolution:

None recorded.

5.3 Recognition Resources

Three types of recognition resources are defined: DTMF recognition for recognition of DTMF input, ASR recognition for recognition of speech input, and SIV for speaker identification and verification. Each recognition resource is associated with a device which implements their respective recognition services. Each device represents one or more actual recognizer instances. In case of a device implemented with multiple recognizers - for example two different speech recognition engines - it is the responsibility of the interpreter implementation to ensure that they adhere to the semantic model defined in this section.

DTMF and ASR recognition resources and SIV resources are semantically similar. They share the same state and eventing model as well as recognition processing, timing and result handling. However, the resources differ in the following respects:

  • Properties: the DTMF resource uses DTMF properties (vxml20, 6.3.3), while the ASR uses speech recognition properties (vxml20, 6.3.2).
  • Mode: the DTMF resource has the mode value 'dtmf' and the ASR resource has the value 'voice' (vxml20, inputmodes, 6.3.6)
  • Buffering: only DTMF resource may buffer input when the resource is not active (e.g. in the FIA transition state, vxml20: 4.1.8).
  • SIV resources use voice models, not grammars.

Otherwise, these resources share the same semantic model.

If a resource controller activates both DTMF and ASR recognition resources, then that resource controller is responsible for managing the resources so that only a single recognition result is produced per recognition cycle. If a resource controller activates ASR and SIV resources, it may produce multiple results timed to provide the results within the same cycle or independently.

5.3.1 Definition

The recognition resource is defined works as follows: in its created state state, grammars (or a voice model) are added to the resource and subsequently prepared on the device, recognition device. Recognition with these grammars (or voice model) ncan can be activated and suspended, and recognition results are returned.

When the recognition resource is ready to recognize (at least one active grammar and/or voice model), one or more recognition cycles may occur in sequence.

  • A recognition cycle is initiated when the resource sends the device an event instructing it to listen to the input stream.
  • A recognition cycle is terminated if the device sends the resource an error event, or the device is instructed to stop recognition by the resource. When terminated, the device removes partially or wholly processed input from its buffer, and resource awaits grammars (and/or voice model) to prepare.
  • During the recognition cycle, the device may send events to the resource indicating ongoing recognition status, and recognition results describing one or more input sequences which match active grammars (and/or the voice model).
  • During the recognition cycle, the device may receive instructions to suspend recognition. When the device is suspended, input is not buffered and the device must not send any events until it receives instructions to re-start or terminate recognition.
  • When the resource receives recognition results from the device during the recognition cycle, it passes them to its controller. A recognition cycle is now complete and the resource awaits instructions either to start another recognition cycle or to terminate recognition.

Thus a recognition resource may enter multiple recognition cycles (as required for 'hotword' recognition), while requiring that a device, even if it has multiple instantiations, only produces one set of recognition results per recognition cycle.

The recognition resource is defined in terms of a data model and state model.

The data model is composed of the following elements:

  • activeGrammars: an ordered list of grammars with which to recognize. Each item in the list contains the following information:
    • content: a URI or inline content to the grammar itself
    • properties: grammar-specific properties (vxml20: weight, mode, type, maxage, maxstale, etc)
    • listener: a resource controller associated with this grammar
  • or activeVoiceModel: a voice model with which to perform verification or identification. It contains the following information:
    • content: a URI to the voice model
    • properties: model-specific properties
    • listener: a resource controller associated with this voice model
  • properties: properties pertaining to the recognition process. These properties differ depending on the type of the recognition resource: for a DTMF recognition resource, the properties include DTMF properties; for ASR recognition resource, speech recognition properties; for SIV resource, SIV properties. The properties may also include platform-specific properties.
  • controller: the resource controller to which recognition status, results and error events are sent.
  • mode: the recognition resource's inputmode: 'voice' for an ASR recognition resource, 'dtmf' for a DTMF recognition resource, and 'siv' for an SIV resource.
Editorial note  

List of properties for active grammars needs to be aligned with the list of items sent in the AddGrammar event.

The state model is composed of states corresponding to functional state: idle , preparing grammars / preparing voice model , ready to recognize , recognizing , suspended recognition and waiting for results .

In the idle state, the resource awaits events from resource controllers to activate grammars or a voice model for recognition on the device. The data model - activeGrammars or activeVoiceModel, properties, controller and mode - is (re-)initialized upon entry to this state: activeGrammars and activeVoiceModel are cleared, properties and controllers are set to null. If the resource receives an 'addGrammar' event, a new item is added to activeGrammars using grammar, properties and listener data in the event payload. If the resource receives a 'prepare' event, it updates its data model with event data: 'properties' with the properties event data and 'controller' is updated with the controller event data. Subsequent event notifications and responses are sent to the resource controller identified as the 'controller'. The recognition resource then moves into the preparing grammars (or preparing voice model ) state.

In the preparing grammars state, the resource behavior depends on whether activeGrammars is empty or not. If activeGrammars is empty (i.e. no active grammars are defined for this recognition resource), the resource sends the controller a 'notPrepared' event and returns to the idle state. If activeGrammar is non-empty, the resource sends a 'prepare' event to the device. The event payload includes 'grammars' and 'properties' parameters. The 'grammars' value is an ordered list where each list item is a grammar's content and its properties extracted from activeGrammars. The order of grammars in the 'grammars' parameter must follow the order in the activeGrammar data model. If the device sends a 'prepared' event, the resource sends a 'prepared' event to the controller and transitions into the ready to recognize state.

In the preparing voice models state, the resource behavior depends on whether activeVoiceModel is empty or not. If activeVoiceModel is empty (i.e. voice model is not defined for this resource), the resource sends the controller a 'notPrepared' event and returns to the idle state. If activeVoiceModel is non-empty, the resource sends a 'prepare' event to the device. The event payload includes 'voicemodel' and 'properties' parameters. The 'voicemodel' value is a URI to the voicemodel, and its properties are extracted from activeVoiceModel. If the device sends a 'prepared' event, the resource sends a 'prepared' event to the controller and transitions into the ready to recognize state.

When the recognition resource is in a ready to recognize state, it may receive a 'stop' event. In this case, the resource sends a 'stop' event to the device, and returns to the idle state. If the resource receives a 'listen' event, it sends a 'listen' event to the device and moves into the recognizing state .

When the resource is in a recognizing state , it can toggle between this state and a suspended recognizing state by receiving further 'listen' and 'suspend' events. state. If the resource receives a 'suspend' event, then it moves into the suspended recognizing state and sends the device a 'suspend' event which causes the device to suspend recognition and delete any buffered input. No input is buffered while the device is in a suspended state. If the resource then receives a 'listen' event, it moves back into the recognizing state .

When in the recognizing state , the resource may receive an 'inputStarted' event from the device, indicating that user input has been detected. The resource then moves into a waiting for results state. The device may send an 'error' event (for example, if maximum time has been exceeded) causing it to return to the idle state and send the controller an 'error' event. Alternatively, the device may send a 'recoResults' event, which contains a results parameter, a data structure representing recognition results. In the case of DTMF or ASR, the results can be in VoiceXML 2.0 or EMMA format. For SIV, the results must be in EMMA format. The structure may contain zero or more recognition results. Each result must specify the grammar (or voicemodel) associated with the recognition (using the same grammar/voicemodel name as used in the payload of the 'prepare' event), its recognition confidence and its input mode. The resource sends its controller a 'recoResults' event with event data containing the device's results parameter together with a listener parameter whose value is the listener associated with the grammar of the first result with the highest confidence (if there are no results, then the listener parameter is not defined). The resource then returns to the ready to recognize state , awaiting either a 'stop' event to terminate recognition or a 'listen' event to start another recognition cycle using the same active grammars and recognition properties.

5.3.2 Defined Events

A recognition resource is defined by the events it receives:

Table 6: Events received by recognition resource
Event Source Payload Sequencing Description
addGrammar any grammar (M), listener (M), properties (O) creates a grammar item composed of the grammar, listener and properties, and adds it to the activeGrammars
addVoiceModel any voicemodel (M), listener (M), properties (O) creates a VoiceModel item composed of the voice model, listener and properties, and adds it to the activeVoiceModel
prepare any controller (M), properties (M) prepares the device for recognition using activeGrammars or activeVoiceModel and properties
listen any initiates/resumes recognition
suspend any suspends recognition
stop any terminates recognition

and the events it sends:

Editorial note  

Need to add a prepareGrammar event (from the grammar RC).

Table 7: Events sent by recognition resource
Event Target Payload Sequencing Description
prepared controller one-of: prepared, notPrepared positive response to prepare (activeGrammars or activeVoiceModel prepared)
notPrepared controller one-of: prepared, notPrepared negative response to prepare (no activeGrammars or activeVoiceModel defined)
inputStarted controller notification that onset of input has been detected
inputFinished controller notification that the end of input has been detected
partialResult controller results (M), listener (O) notification of a partial recognition result
recoResult controller results (M), listener (O) notification of complete recognition result, including the results structure and a listener
error controller error status (M) notification that an error has occurred

5.3.5 SCXML Representation

<?xml version="1.0" encoding="UTF-8"?>
<scxml initialstate="Created">
  <datamodel>
    <data name="activeGrammars"/>
    <data name="properties"/>
    <data name="controller"/>
    <data name="mode"/>
  </datamodel>
  <state id="Created">
    <initial id="idle"/>
    <state id="idle">
      <onentry>
        <foreach var="node" nodeset="datamodel/data[@name='activeGrammars']">
          <delete loc="$node"/>
        </foreach>
        <assign loc="/datamodel/data[@name='properties']" val="null"/>
        <assign loc="/datamodel/data[@name='controller']" val="null"/>
        <assign loc="/datamodel/data[@name='mode']" val="voice"/>
      </onentry>
      <transition event="AddGrammar">
        <datamodel>
          <data name = "gram"/>
        </datamodel>
        <assign name="/datamodel/data/[@name='gram']/grammar" expr="_eventData/grammar" />
        <assign name="/datamodel/data/[@name='gram']/properties" expr="_eventData/properties" />
        <assign name="/datamodel/data/[@name='gram']/listener" expr="_eventData/listener" />
        <insert pos="after" name="datamodel/data[@name='activeGrammars']" val="gram"/>
      </transition>
      <transition event="prepare" target="preparingGrammars">
        <assign loc="/datamodel/data[@name='properties']" expr="_eventData/properties"/>
        <assign loc="/datamodel/data[@name='controller']" expr="_eventData/controller"/>
      </transition>
    </state>    <!-- end idle -->
    <state id="preparingGrammars">
      <onentry>
        <if cond="isEmpty(/datamodel/data[@name='activeGrammars']) eq 'false'">
          <send target="device" event="dev:clear"/>
          <send target="device" event="dev:prepare" namelist="/datamodel/data[@name='activeGrammars'], /datamodel/data[@name='properties']"/>
        </if>
      </onentry>
      <transition cond="isEmpty(/datamodel/data[@name='activeGrammars']) eq 'true'" target="idle">
        <send target="controller" event="notPrepared"/>
      </transition>
      <transition event="stop" target="idle">
        <send target="device" event="dev:stop"/>
      </transition>
      <transition event="dev:prepared" target="readyToRecognize">
        <send target="controller" event="Prepared"/>
      </transition>
    </state>    <!-- end preparingGrammars -->
    <state id="readyToRecognize">
      <transition event="listen" target="recognizing" />
      <transition event="stop" target="idle">
        <send target="device" event="dev:stop"/>
      </transition>
    </state>    <!-- end readyToRecognize -->
    <state id="recognizing">
      <onentry>
        <send target="device" event="dev:listen"/>
      </onentry>
      <transition event="suspend" target="suspendedRecognizing"/>
      <transition event="dev:inputStarted" target="waitingForResult"/>
      <transition event="stop" target="idle">
        <send target="device" event="dev:stop"/>
      </transition>
    </state>    <!-- end recognizing -->
    <state id="suspendedRecognizing">
      <onentry>
        <send target="device" event="dev:suspend"/>
      </onentry>
      <transition event="listen" target="recognizing"/>
      <transition event="stop" target="idle">
        <send target="device" event="dev:stop"/>
      </transition>
    </state>    <!-- end suspendedRecognizing -->
    <state id="waitingForResult">
      <onentry>
        <send target="controller" event="inputStarted"/>
      </onentry>
      <transition event="dev:inputFinished">
        <send target="controller" event="inputFinished"/>
      </transition>
      <transition event="dev:partialResult">
        <send target="controller" event="partialResult" namelist="_eventData/results,_eventData/grammar/listener"/>
      </transition>
      <transition event="dev:recoResults" target="readyToRecognize">
        <send target="controller" event="recoResult" namelist="_eventData/results,_eventData/grammar/listener"/>
      </transition>
      <transition event="dev:error" target="idle">
        <send target="controller" event="error" namelist="_eventData/error status"/>
      </transition>
      <transition event="stop" target="idle">
        <send target="device" event="dev:stop"/>
      </transition>
    </state>    <!-- end waitForResult -->
  </state>  <!-- end Created -->
</scxml>

5.4 SIV Resource The working group plans to define an SIV (Speaker Identification and Verification) resource within this document. The group currently expects it to have the following characteristics: It will be defined as an additional recognition resource (along with the existing DTMF and speech recognition resources) which can be activated alone or simultaneously with the other recognition resources. It will make use of voice models rather than grammars. It additionally has the notion of creation of voice models through an enrollment process. Administration of voice biometrics including deletion of voice models and mapping of identity is explicitly out of scope of VoiceXML 3. 5.5 5.4 Connection Resource

5.6 5.5 Timer Resource

5.6.2 5.5.2 Defined Events

A timer resource is defined by the events it receives:

Table 12: Events received by timer resource
Event Source Payload Sequencing Description
start any owner (M), timeout (M), handle (O) The effect of sending a start event to the timer resource will have a new timer started that in timeout time will fire a timerExpired event to the owner of the event. The handle must be used to correlate timers if the cancel is supported.
cancel any owner (M), handle (O) The cancel event will cancel any previous timeout that has been set with the timer resource that match the handle and owner. If there was no previous event set (or the previous timer has fired) then cancel still succeeds, as the semantics of cancel are once cancel has succeed you will not receive a timerExpired event

and the events it sends:

Table 13: Events sent by timer resource
Event Target Payload Sequencing Description
timerExpired controller handle (O) timerExpired must precede any cancelSuccess This event means the timer has fired. The controller that receives the event is the owner of the start event. If the handle was passed in to the start event then it will be passed back when the timer expires
cancelSuccess controller handle (O) timerExpired must precede any cancelSuccess This event means that the timer in question is cancelled canceled and no new timerExpired events may be received

t is possible to receive both a timerExpired event and then a cancelSuccess event as the events may have crossed paths.

6 Modules

In VoiceXML 3.0, the language is partitioned into independent modules which can be combined in various ways. In addition to the modules defined in this section, it is also possible for third parties to define their own modules (see Section XXX).

Each module is assigned a schema, which defines its syntax, plus one or more Resource Controllers (RCs), which define its semantics, plus a "constructor" that knows how to create them from the syntactic representation at initialization time. Only DOM nodes that have schemas and constructors (and hence RCs) assigned to them can be modules in VoiceXML 3.0. However, we may choose to define constructors and RCs for nodes that are not modules. Nodes that do not have constructors and RCs ultimately depend on some module for their interpretation. (Those modules are usually ancestor nodes, but we do not require this.) There can be multiple modules associated with the same VoiceXML element. They may set properties differently, add different child elements, etc. In many cases, some of the modules will be extensions of the others, but we don't require this.

Note there is not necessarily a one-to-one relationship between semantic RCs and syntactic markup elements. It may take several RCs to implement the functionality of a single markup element.

6.1 Grammar Module

This module describes the syntactic and semantic features of a <grammar> element which defines grammars used in ASR and DTMF recognition. Grammars defined via this module are used by other modules.

The attributes and content model of <grammar> are specified in 6.1.1 Syntax . Its semantics are specified in 6.1.2 Semantics .

Editorial note  

Issue: Grammar processing will need to know the Base URI to resolve relative references.

6.1.1 Syntax

[See XXX for schema definitions].

6.1.1.1 Attributes

The <grammar> element has the attributes specified in Table 16.

Table 16: <grammar> Attributes
Name Type Description Required Default Value
mode The only allowed values are "voice" and "dtmf" Defines the mode of the grammar following the modes of the W3C Speech Recognition Grammar Specification [SRGS] . No The value of the document property "grammarmode"
weight Weights are simple positive floating point values without exponentials. Legal formats are "n", "n.", ".n" and "n.n" where "n" is a sequence of one or many digits. Specifies the weight of the grammar. See vxml2: Section 3.1.1.3 No 1.0
fetchhint One of the values "safe" or "prefetch" Defines when the interpreter context should retrieve content from the server. prefetch indicates a file may be downloaded when the page is loaded, whereas safe indicates a file that should only be downloaded when actually needed. No None
fetchtimeout Time Designation The interval to wait for the content to be returned before throwing an error.badfetch event. No None
maxage An unsigned integer Indicates that the document is willing to use content whose age is no greater than the specified time in seconds (cf. 'max-age' in HTTP 1.1 [RFC2616]). The document is not willing to use stale content, unless maxstale is also provided. No None
maxstale An unsigned integer Indicates that the document is willing to use content that has exceeded its expiration time (cf. 'max-stale' in HTTP 1.1 [RFC2616]). If maxstale is assigned a value, then the document is willing to accept content that has exceeded its expiration time by no more than the specified number of seconds. No None

6.1.2 Semantics

The grammar RC is the primary RC for the <grammar> element.

6.1.2.1 Definition

The grammar RC is defined in terms of a data model and state model.

The data model is composed of the following parameters:

  • controller: the RC controlling this grammar RC
  • properties: weight attribute value, fetchtimeout, maxage, maxstale, charset, encoding, and language
  • mode: mode
  • fetchhint: fetchhint

The grammar RC's state model consists of the following states: Idle, Initializing, Ready, and Executing.

While in the Idle state, the RC may receive an 'initialize' event, whose 'controller' event data is used to update the data model. The RC then transitions into the Initializing state.

In the Initializing state, the grammar RC first initializes its child.

  • The values of the fetchtimeout attribute and the grammarfetchtimeout property (**REF**) are used to determine the fetchtimeout property value according to section XXXX.
  • The values of the maxage attribute and the grammarmaxage property (**REF**) are used to determine the maxage property value according to section XXXX.
  • The values of the maxstale attribute and the grammarmaxstale property (**REF**) are used to determine the maxstale property value according to section XXXX.
  • The values of the fetchhint attribute and the grammarfetchhint property (**REF**) are used to determine the fetchhint parameter value according to section XXXX.

Next, the language, charset, and encoding parameters are set to the values in effect at this point in the document. If the fetchhint parameter value is "Prefetch", the RC sends the Prefetch event to the DTMF or ASR Recognizer resource, as appropriate (see below), with the following data: the child RC, fetchtimeout, maxage, maxstale. Finally, the RC sends the controller an 'initialized' event and transitions to the Ready state.

In the Ready state, when the grammar RC receives an 'execute' event it transitions to the Executing state.

In the Executing state,

  • The values of the fetchtimeout attribute and the grammarfetchtimeout property (**REF**) are used to determine the fetchtimeout property value according to section XXXX.
  • The values of the maxage attribute and the grammarmaxage property (**REF**) are used to determine the maxage property value according to section XXXX.
  • The values of the maxstale attribute and the grammarmaxstale property (**REF**) are used to determine the maxstale property value according to section XXXX.

If the child RC is an External Grammar, the grammar RC sends an 'execute' event to the child RC and waits for it to complete.

Then, the grammar RC sends an AddGrammar event to the DTMF Recognizer Resource if mode="dtmf" or to the ASR Recognizer Resource if mode="voice", with the following as event data: the child RC, the fetchhint, language, charset, and encoding parameter values, and the controller RC (e.g., link, field, or form) as the handler for recognition results.

Finally, the grammar RC sends the controller an executed event and transitions to the Ready state.

Editorial note  

The currently-active value of fetchhint, fetchtimeout, maxage, and maxstale properties may be different at execution than at initialization, so the determination of these values should be done by the RCs that initialize or execute the grammar RC rather than by the grammar RC itself. The document text needs to be updated to reflect this. (From Nov 2009 f2f)

Initializing: Validate that behavior of sending a pointer to the child RC to the ASR resource. Is this acceptable, or do we need to extract the grammar data from the child RC and then send that data? The advantage of sending the RC pointer is that it makes clear what kind of grammar info it is -- inline SRGS or external reference.

Execute issues:

  • Note that a mismatch between the "mode" attribute value and any mode param returned in the media type cannot be detected at this stage because the document hasn't been fetched yet.
  • Still need to add a 'cond' capability as we have for prompts.
  • Should we allow explicit scope indication a la VoiceXML 2's "scope" attribute? How do we handle document-scoped grammars defined syntactically at a lower level? In this case should the handler be the controller RC or the controller for the document? Which RC actually executes the grammar RC?
  • How does 'as if by copy' change this for <link> grammars?

Editor will write new section 4.5 "Other" and subsections 4.5.1 "property/attribute resolution" and 4.5.2 "language resolution". Depending on the text, we may need to update the semantics to refer to section 4.5.2 when describing how xml:lang is used.

6.1.2.5 SCXML Representation
<?xml version="1.0" encoding="UTF-8"?>
<scxml initialstate="Created">
  <datamodel>
    <data id="controller"/>
    <data id="child"/>
    <data id="content"/> <!-- not used -->
    <data id="properties"/>
    <data id="mode"/>
    <data id="fetchhint"/>
  </datamodel>
  <state id="Created">
    <initial id="Idle"/>
    <state id="Idle">
      <onentry>
        <!-- ISSUE: Is this needed? -->
        <assign location="$controller" expr="null"/>
        <assign location="$child" expr="null"/>
        <assign location="$content" expr="null"/>
        <assign location="$properties/weight" expr="1.0"/>
        <assign location="$properties/fetchtimeout"/>
        <assign location="$properties/maxage"/>
        <assign location="$properties/maxstale"/>
        <assign location="$properties/charset"/>
        <assign location="$properties/encoding"/>
        <assign location="$properties/language"/>
        <assign location="$mode"/>
        <assign location="$fetchhint"/>
      </onentry>
      <transition event="initialize" target="Initializing">
        <assign name="$controller" expr="_eventData/controller"/>
        <assign name="$child" expr="_eventData/child"/>
      </transition>
    </state>    <!-- end Idle -->
    <state id="Initializing">
      <onentry>
        <!-- ISSUE: complete the initialization -->
        <assign location="$properties/weight" expr="1.0"/>
        <assign location="$properties/fetchtimeout" expr="InitDefault();"/>
        <assign location="$properties/maxage" expr="InitDefault();"/>
        <assign location="$properties/maxstale" expr="InitDefault();"/>
        <assign location="$properties/charset" expr="InitDefault();"/>
        <assign location="$properties/encoding" expr="InitDefault();"/>
        <assign location="$properties/language" expr="InitDefault();"/>
        <assign location="$mode" expr="InitDefault();"/>
        <assign location="$fetchhint" expr="InitDefault();"/>
        <send target="$child/controller" event="initialize"
          namelist="$child"/>
      </onentry>
      <transition event="Initializing.error" target="Idle">
        <send target="controller" event="initialize.error"
           namelist="_eventData/error_status"/>
      </transition>
      <transition event="Initializing.done" target="Ready">
        <if cond="$fetchhint eq 'prefetch'">
           <if cond="$mode eq 'voice'">
              <send target="ASRRecognizer" event="Prefetch"
                 namelist="$child $properties/fetchtimeout 
                           $properties/maxage $properties/maxstale"/>
           <else/>
              <send target="DTMFRecognizer" event="Prefetch"
                 namelist="$child $properties/fetchtimeout 
                           $properties/maxage $properties/maxstale"/>
           </if>
        </if>
        <send target="controller" event="initialized"/>
      </transition>
    </state>    <!-- end Initializing -->
    <state id="Ready">
      <transition event="execute" target="Executing"/>
    </state>    <!-- end Ready -->
    <state id="Executing">
      <onentry>
        <!-- ISSUE: Initialization function to be completed -->
        <assign location="$properties/fetchtimeout" expr="InitDefault();"/>
        <assign location="$properties/maxage" expr="InitDefault();"/>
        <assign location="$properties/maxstale" expr="InitDefault();"/>
    <!-- ISSUE: Add condition if $child is externalgrammar element -->
        <if cond="???">
          <send target="$child/controller" event="execute"
             namelist="$child"/>
        <else/>
           <!-- ISSUE: Missing in sendcontroller RC (e.g. link, field, etc) 
                as the handler for recognition results  -->
           <if cond="$mode eq 'voice'">
              <send target="ASRRecognizer" event="AddGrammar"
                 namelist="$child $fetchhint $properties/language
                           $properties/charset $properties/encoding"/>
           <else/>
              <send target="DTMFRecognizer" event="AddGrammar"
                 namelist="$child $fetchhint $properties/language
                           $properties/charset $properties/encoding"/>
           </if>
           <send target="controller" event="executed"/>
        </if>
      </onentry>
      <transition event="Executing.done">
         <!-- ISSUE: Missing in send controller RC (e.g. link, field, etc) 
              as the handler for recognition results  -->
         <if cond="$mode eq 'voice'">
            <send target="ASRRecognizer" event="AddGrammar"
               namelist="$child $fetchhint $properties/language
                         $properties/charset $properties/encoding"/>
         <else/>
            <send target="DTMFRecognizer" event="AddGrammar"
               namelist="$child $fetchhint $properties/language
                         $properties/charset $properties/encoding"/>
         </if>
         <send target="controller" event="executed"/>
      </transition>
    </state>    <!-- end Executing -->
  </state>
  <!-- end Created -->
</scxml>
Editorial note  
  • Is double initialization in Idle and again in Initializing necessary?
  • Initialization function should be expanded and specialized for different parameters
  • How to get charset, encoding and language?
  • xml:lang might become present in many elements including grammar
  • Base URI might be known at the point of grammar processing
  • consistent syntax to refer to $child
  • How to test if $child is 'externalgrammar' element? How to access DOM? How to access controllers for recognition results (i.e. link, field, etc)?
  • possible optimization to avoid duplicated <if> code in Executing

6.2 Inline SRGS Grammar Module

This module describes the syntactic and semantic features of inline SRGS grammars used in ASR and DTMF recognition.

Editorial note  

Issue: Do we need to support inline ABNF SRGS?:

The attributes and content model of Inline SRGS grammars are specified in 6.2.1 Syntax . Its semantics are specified in 6.2.2 Semantics .

6.2.1 Syntax

[See XXX for schema definitions].

The syntax of the Inline SRGS Grammar Module is precisely all of the XML markup for a legal stand-alone XML form grammar as described in SRGS ( [SRGS] ), minus the XML Prolog. Note that both elements and attributes must be in the SRGS namespace (http://www.w3.org/2001/06/grammar).

6.2.2 Semantics

6.3 External Grammar Module

This module describes the syntactic and semantic features of an <externalgrammar> element which defines external grammars used in ASR and DTMF recognition.

Editorial note  

The name of this element is still under discussion.

The attributes and content model of <externalgrammar> are specified in 6.3.1 Syntax . Its semantics are specified in 6.3.2 Semantics .

6.3.1 Syntax

[See XXX for schema definitions].

6.3.1.1 Attributes

The <externalgrammar> element has the attributes specified in Table 23.

Table 23: <externalgrammar> Attributes
Name Type Description Required Default Value
src anyURI The URI specifying the location of the grammar and optionally a rulename within that grammar, if it is external. The URI is interpreted as a rule reference as defined in Section 2.2 of the Speech Recognition Grammar Specification [SRGS] but not all forms of rule reference are permitted from within VoiceXML. The rule reference capabilities are described in detail below this table. No
srcexpr A data model expression Equivalent to src, except that the URI is dynamically determined by evaluating the content as a data model expression. No
type A data model expression

The preferred media type of the grammar. A resource indicated by the URI reference in the src attribute may be available in one or more media types. The author may specify the preferred media-type via the type attribute. When the content represented by a URI is available in many data formats, a VoiceXML platform may use the preferred media-type to influence which of the multiple formats is used. For instance, on a server implementing HTTP content negotiation, the processor may use the preferred media-type to order the preferences in the negotiation.

The resource representation delivered by dereferencing the URI reference may be considered in terms of two types. The declared media-type is the asserted value for the resource and the actual media-type is the true format of its content. The actual media-type should be the same as the declared media-type, but this is not always the case (e.g. a misconfigured HTTP server might return 'text/plain for an 'application/srgs+xml' document). A specific URI scheme may require that the resource owner always, sometimes, or never return a media-type. The declared media-type is the value returned by the resource owner or, if none is returned, the preferred media type. There may be no declared media-type if the resource owner does not return a value and no preferred type is specified. Whenever specified, the declared media-type is authoritative.

Three special cases may arise. The declared media-type may not be supported by the processor; in this case, an error.unsupported.format is thrown by the platform. The declared media-type may be supported but the actual media-type may not match; an error.badfetch is thrown by the platform. Finally, there may be no declared media-type; the behavior depends on the specific URI scheme and the capabilities of the grammar processor. For instance, HTTP 1.1 allows document introspection (see [RFC2616] , section 7.2.1), the data scheme falls back to a default media type, and local file access defines no guidelines. The following table provides some informative examples:

HTTP 1.1 request Local file access
Media-type returned by the resource owner text/plain application/srgs+xml <none> <none>
Preferred media-type appearing in the grammar Not applicable; the returned type takes precedence application/srgs+xml <none>
Declared media-type text/plain application/srgs+xml application/srgs+xml <none>
Behavior if the actual media-type is application/srgs+xml error.badfetch thrown; the declared and actual types do not match The declared and actual types match; success if application/srgs+xml is supported by the processor; otherwise an error.unsupported.format is thrown Scheme specific; the processor might introspect the document to determine the type.
No None
Editorial note  

Error messages for "type" attribute need to be updated.

See 6.3.1.2 Content Model for restrictions on occurrence of src and srcexpr attributes.

The value of the src attribute is a URI specifying the location of the grammar with an optional fragment for the rulename. Section 2.2 of the Speech Recognition Grammar Specification [SRGS] defines several forms of rule reference. The following are the forms that are permitted on a grammar element in VoiceXML:

  • Reference to a named rule in an external grammar: src attribute is an absolute or relative URI reference to a grammar which includes a fragment with a rulename. This form of rule reference to an external grammar follows the behavior defined in Section 2.2.2 of [SRGS] . If the URI cannot be fetched or if the rulename is not defined in the grammar or is not a public (activatable) rule of that grammar then an error.badfetch is thrown.
  • Reference to the root rule of an external grammar: src attribute is an absolute or relative URI reference to a grammar but does not include a fragment identifying a rulename. This form implicitly references the root rule of the grammar as defined in Section 2.2.2 of [SRGS] . If the URI cannot be fetched or if the grammar cannot be referenced by its root (see Section 4.7 of [SRGS] ) then an error.badfetch is thrown.

The following are the forms of rule reference defined by [SRGS] that are not supported in VoiceXML 3.

  • Local rule reference: a fragment-only URI is not permitted. (See definition in Section 2.2.1 of [SRGS] ). A fragment-only URI value for the src attribute causes an error.semantic event.
  • Reference to special rules: there is no support for special rule references (NULL, VOID, GARBAGE) on the <grammar> element itself. In the XML form of the SRGS specification, the only way to include NULL, VOID, and GARBAGE is via the use of the "special" attribute on the <ruleref> element. Thus, it is not possible to reference individual uses of NULL, VOID, and GARBAGE rules in a separate SRGS document, since that would require a fragment identifier to place on the end of the URI referencing the document, which in turn would require an id within that document for the given use of NULL, VOID, or GARBAGE. Note that the external grammar referenced from the <grammar> element may itself be an SRGS grammar that contains a <ruleref> element with a special attribute to reference NULL, VOID, or GARBAGE.

The <externalgrammar> element has the following co-occurrence constraints:

  • Exactly one of the "src" or "srcexpr" attributes must be specified; otherwise, an error.badfetch event is thrown.
Editorial note  

Editor: please remove the "otherwise, an error.badfetch ..." from the above and all other co-occurrence text and write general text somewhere describing what happens when a co-occurrence constraint is violated.

6.3.2 Semantics

6.4 Prompt Module

This module defines the syntactic and semantic features of a <prompt> element which controls media output. The content model of this element is empty: content is defined in other modules which extend this element's content model (for example 6.5 Builtin SSML Module , 6.6 Media Module and 6.7 Parseq Module ).

The attributes and content model of <prompt> are specified in 6.4.1 Syntax . Its semantics are specified in 6.4.2 Semantics , including how the final prompt content is determined and how the prompt is queued for playback using the PromptQueue Resource ( 5.2 Prompt Queue Resource ).

6.4.2 Semantics

The prompt RC is the primary RC for the <prompt> element.

6.4.2.1 Definition

The prompt RC is defined in terms of a data model and state model.

The data model is composed of the following parameters:

  • controller: the RC controlling this prompt RC
  • children: array of children's (primary) RC
  • count: count attribute value
  • cond: cond attribute expression
  • properties: bargein, bargeintype and timeout attribute values
  • xml:lang: xml:lang attribute value
  • xml:base: xml:base attribute value

The prompt RC's state model consists of the following states: Idle, Initializing, Ready, FormReady, and Executing. The initial state is the Idle state.

While in the Idle state, the prompt RC may receive an 'initialize' event, whose controller event data is used to update the data model. The prompt RC then transitions into Initializing state.

In the Initializing state, the prompt RC initializes its children: this is modeled as a separate RC (see XXX). The children may return an error for initialization. If a child sends an error, then the prompt RC returns an error. When all children are initialized, the prompt RC sends the controller an 'initialized' event and transitions to the Ready state.

In the Ready state, the prompt RC can receive a 'checkStatus' event to check whether this prompt is eligible for execution or not. The value of the cond parameter in its data model is checked against the data model resource: the status is true if the value of the cond parameter evaluates to true. The status, together with its count data, is sent in a 'checkedStatus' event to the controller RC. The controller RC then determines if the prompt is selected for execution ([vxml20: 4.1.6], see PromptSelectionRC, Section XXX). The prompt RC will then transition to the FormReady state. If the prompt RC receives an 'execute' event and the cond parameter evaluates to true, it transitions to the Executing state; if the cond parameter evaluates to false, it will send the controller the executed event and stay in the Ready state.

In the FormReady State, if the prompt RC receives a 'checkStatus' event, it will again check the cond parameter and send the 'checkedStatus' event to the controller RC as in the Ready State. In this state, if the RC receives an 'execute' event it transitions to the Executing state.

In the Executing state, the prompt RC sends an evaluate event to its children. Each child returns either an error, or content (which may include parameters) for playback. If a child sends an error, then the prompt RC returns an error. Once evaluation is complete, the RC sends a queuePrompt event to the Prompt Queue Resource with the <prompt> parameters (bargein, bargeintype, timeout) with event data consisting of the list of content returned by its children. The prompt RC then sends the controller an executed event and transitions to the Ready state.

Editorial note  

SSML validation issue: what if evaluation results in a non-valid structure?

6.4.2.5 SCXML Representation
<?xml version="1.0" encoding="UTF-8"?>
<scxml initialstate="Created">
  <datamodel>
    <data id="properties"/>
    <data id="children"/>
    <data id="content"/>
    <data id="properties"/>
    <data id="count"/>
    <data id="cond"/>
    <data id="xml:lang"/>
    <data id="xml:base"/>
  </datamodel>
  <state id="Created">
    <initial id="Idle"/>
    <state id="Idle">
      <onentry>
        <assign location="$controller" expr="null"/>
        <assign location="$children" expr="null"/>
        <assign location="$content" expr="null"/>
        <assign location="$properties/bargein" expr="true"/>
        <assign location="$properties/bargeintype" expr="speech"/>
        <assign location="$properties/timeout" expr="5s"/>
        <assign location="$count" expr="1"/>
        <assign location="$cond" expr="true"/>
        <assign location="$xml:lang" expr=""/>
        <assign location="$xml:base" expr=""/>
      </onentry>
      <transition event="initialize" target="Initializing">
        <assign name="$controller" expr="_eventData/controller"/>
        <assign name="$children" expr="_eventData/children"/>
      </transition>
    </state>    <!-- end Idle -->
    <state id="Initializing">
      <datamodel>
        <data id="childcounter"/>
      </datamodel>
      <onentry>
        <assign location="$childcounter" expr="0"/>
        <foreach var="child" array="$children">
          <send target="$child/controller" event="initialize"
          namelist="$child/child"/>
        </foreach>
      </onentry>
      <transition event="Initializing.done">
        <assign location="$childcounter" expr="$childcounter + 1"/>
      </transition>
      <transition event="Initializing.error" target="Idle">
        <assign location="$childcounter" expr="$childcounter + 1"/>
        <send target="controller" event="initialize.error"
        namelist="_eventData/error_status"/>
      </transition>
      <transition event="Initializing.done" cond="$childcounter eq
      $children.size()-1" target="Ready">
        <send target="controller" event="initialized"/>
      </transition>
    </state>    <!-- end Initializing -->
    <state id="Ready">
      <datamodel>
        <data id="status"/>
      </datamodel>
      <transition event="checkStatus" target="FormReady">
        <assign location="$status" expr="checkStatus()"/>
        <send target="controller" event="checkStatus"
        namelist="$status, $count"/>
      </transition>
      <transition event="execute" cond="checkStatus() eq 'true'"
      target="Executing"/>
      <transition event="execute" cond="checkStatus() eq 'false'">
        <send target="controller" event="executed"/>
      </transition>
    </state>    <!-- end Ready -->
    <state id="FormReady">
      <datamodel>
        <data id="status"/>
      </datamodel>
      <transition event="checkStatus">
        <assign location="$status" expr="checkStatus()"/>
        <send target="controller" event="checkStatus"
        namelist="$status, $count"/>
      </transition>
      <transition event="execute" target="Executing"/>
    </state>    <!-- end FormReady -->
    <state id="Executing">
      <datamodel>
        <data id = "prompt"/>
      </datamodel>
      <onentry>
        <assign location="$counter" expr="0"/>
        <assign location="$child_return" expr="null"/>
        <foreach var="child" array="$children">
          <send target="$child/controller" event="evaluateChild"/>
        </foreach>
      </onentry>
      <transition event="Executing.done">
        <assign location="$counter" expr="$counter + 1"/>
        <insert pos="after" name="$prompt" expr="_eventData/prompts"/>
      </transition>
      <transition event="Executing.error" target="Idle">
        <send target="controller" event="Executing.error"
        namelist="_eventData/error_status"/>
      </transition>
      <transition event="Executing.done" cond="$counter eq
      $children.size()-1" target="Ready">
        <insert pos="after" name="$prompt" expr="_eventData/prompts"/>
        <send target="PromptQueue" event="/queuePrompt"
        namelist="$prompt, $properties"/>
        <send target="controller" event="executed"/>
      </transition>
    </state>    <!-- end Executing -->
  </state>
  <!-- end Created -->
</scxml>

6.4.3 Events

The events in Table 32 may be raised during initialization and execution of the <prompt> element.

Table 32: <prompt> Events
Event Description State
error.unsupported.language indicates that an unsupported language was encountered. The unsupported language is indicated in the event message variable. execution
error.unsupported. element indicates that the element within the <prompt> element is not supported initialization
error.badfetch indicates that the prompt content is malformed ... initialization, execution
error.noresource indicates that a Prompt Queue resource is not available for rendering the prompt content. execution
error.semantic indicates an error with data model expressions: undefined reference, invalid expression resolution, etc. execution
Editorial note  

The relationship between the user visible events defined in the above table, and semantic event model has yet to be defined.

Can we really determine whether errors are raised in initialization (syntax) or execution (evaluation) states? How does this fit in with errors returned when prompts are played in PromptQueue player implementation? ACTION: Clarify which specific cases are affected by 'error.badfetch' ambiguity re. initialization versus execution states.

Clarify that error.semantic doesn't apply to evaluation of src/expr with <audio> (e.g. fallback).

Clarify that errors are recorded? (vxml21??)

Should media control properties (e.g. clipBegin, speed, etc) of <media> be also available on <prompt>?

We should clarify where the error.badfetch gets thrown. For instance, if we are loading a document with malformed prompt elements, the error.badfetch may get thrown back to the calling document. If we are throwing error.badfetch during execution, then it will be thrown back to the malformed document itself?

6.5 Builtin SSML Module

This module describes the syntactic and semantic features of SSML elements built into VoiceXML.

This module is designed to extend the content model of the <prompt> element defined in 6.4 Prompt Module .

The attributes and content model of SSML elements are specified in 6.5.1 Syntax . Its semantics are specified in 6.5.2 Semantics , including how elements are evaluated to yield final content for playback.

6.5.2 Semantics

When the RC receives an evaluate event, its children are evaluated in order to return an SSML Conforming Stand-Alone Speech Synthesis Markup Language Document which can be processed by a Conforming Speech Synthesis Markup Language Processor .

Evaluation comprises of:

  • data model expressions are evaluated against the data model.
  • If a <foreach> element is present, it is evaluated so as to yield content for each defined item in the array.
  • If an <audio> element has a specified expr attribute, then the attribute value is evaluated to provide a URI value for the src attribute. If the expr evaluation results in an ECMAScript undefined value, then the <audio> element, including its alternate content, is ignored.
  • If a <value> element is present, its expr attribute is evaluated to return a CDATA value.
  • construction of a <speak> element with appropriate version, namespace, and xml:lang attributes. The xml:lang attribute value is inherited from the <prompt> element (see 6.4 Prompt Module ).
  • if an unsupported language is encountered, the platform throws an error.unsupported.language event which specifies the language in its message variable
Editorial note  

We may want to refine the description that the output of evaluation is an SSML Document. One rationale is that we don't want to prohibit that SSML extensions are lost during evaluation. The output may be another Fragment rather than a Document.

Clarify exact nature of <audio> expr value for skipping - undefined vs. null?

Need to specify further error cases

Do these elements have RCs? They are in the VoiceXML namespace but are just enhanced SSML elements.

Need to clarify unsupported languages and external (e.g. MRCP) SSML processors.

6.6 Media Module

The media module defines the syntax and semantics of <media> element.

The module is designed to extend the content model of <prompt> in the prompt module ( 6.4 Prompt Module ).

The <media> element can be seen as an enhanced and generalized version of the VoiceXML <audio> element. It is enhanced in that it provides additional attributes describing the type of media, conditional selection, as well as control over playback . It is a generalization of the <audio> element in that it permits media other than audio to be played; for example, media formats which contains audio and video tracks.

6.6.1 Syntax

[See XXX for schema definitions].

6.6.1.1 Attributes

The <media> element has the attributes specified in Table 34.

Table 34: <media> Attributes
Name Type Description Required Default Value
src The URI specifying the location of the media source. No None
srcexpr A data model expression which evaluates to a URI indicating the location of the media resource. No undefined
cond A data model expression that must evaluate to true after conversion to boolean in order for the media to be played. No true
type

The preferred media type of the output resource. A resource indicated by the URI reference in the src or expr attributes may be available in one or more media types. The author may specify the preferred media type via the type attribute. When the content represented by a URI is available in many data formats, a VoiceXML platform may use the preferred media-type to influence which of the multiple formats is used. For instance, on a server implementing HTTP content negotiation, the processor may use the preferred media-type to order the preferences in the negotiation.

The resource representation delivered by dereferencing the URI reference may be considered in terms of two types. The declared media-type is the asserted value for the resource and the actual media-type is the true format of its content. The actual media-type should be the same as the declared media-type, but this is not always the case (e.g. a misconfigured HTTP server might return 'text/plain' for a 'audio/x-wav' or video/3gpp' resource). A specific URI scheme may require that the resource owner always, sometimes, or never return a media-type. The declared media-type is the value returned by the resource owner or, if none is returned, the preferred media type. There may be no declared media-type if the resource owner does not return a value and no preferred type is specified. Whenever specified, the declared media-type is authoritative.

Three special cases may arise.

  1. The declared media-type may not be supported by the processor. No error is thrown by the platform and the content of the media element is played instead.
  2. The declared media-type may be supported but the actual media-type may not match. No error is thrown by the platform and the content of the media element is played instead.
  3. Finally, there may be no declared media-type; the behavior depends on the specific URI scheme and the media capabilities of the VoiceXML processor. For instance, HTTP 1.1 allows document introspection (see [RFC2616] , section 7.2.1), the data scheme falls back to a default media type, and local file access defines no guidelines.
No undefined
clipBegin Time Designation offset from start of media to begin rendering. This offset is measured in normal media playback time from the beginning of the media. No 0s
clipEnd Time Designation offset from start of media to end rendering. This offset is measured in normal media playback time from the beginning of the media. No None
repeatDur Time Designation total duration for repeatedly rendering media. This duration is measured in normal media playback time from the beginning of the media. No None
repeatCount positive Real number number of iterations of media to render. A fractional value describes a portion of the rendered media. No 1
soundLevel signed ("+" or "-") CSS2 numbers immediately followed by "dB" Decibel values are interpreted as a ratio of the squares of the new signal amplitude (a1) and the current amplitude (a0) and are defined in terms of dB: soundLevel(dB) = 20 log10 (a1 / a0) A setting of a large negative value effectively plays the media silently. A value of '-6.0dB' will play the media at approximately half the amplitude of its current signal amplitude. Similarly, a value of '+6.0dB' will play the media at approximately twice the amplitude of its current signal amplitude (subject to hardware limitations). The absolute sound level of media perceived is further subject to system volume settings, which cannot be controlled with this attribute. No +0.0dB
speed x% (where x is a positive real value) the speed at which to play the referenced media, relative to the original speed. The speed is set to the requested percentage of the speed of the original media. For audio, a change in the speed will change the rate at which recorded samples are played back and this will affect the pitch. No 100%
outputmodes space separated list of media types Determines the modes used for media output. See 8.2.4 Media Properties for further details. No outputmodes property

See occurrence constraints for restrictions on occurrence of src and srcexpr attributes.

Calculations of rendered durations and interaction with other timing properties follow SMIL 2.1 Computing the active duration where

  • <media> is a time container
  • Time Designation values for clipBegin, clipEnd, and repeatDur are a subset of SMIL Clock-value
  • If the length of an media clip is not known in advance then it is treated as indefinite. Consequently repeatCount will have no effect.
  • If clipEnd is after the end of the media, then rendering ends at the media end.
  • If clipBegin is after clipEnd, no audio will be produced.
  • If clipBegin equals clipEnd, the behavior depends upon the kind of media. For media with a concept of frames, such as video, the single frame at the beginning of clipBegin is rendered for repeatDur.
  • repeatDur takes precedence over repeatCount in determining the maximum duration of the media.

Note that not all SMIL 2.1 Timing features are supported.

Editorial note  

Use SMIL 3.0 or SMIL 2.1 reference?

should trimming and media attributes also be defined in <prompt>?

do we need expr values for type, clipBegin, clipEnd, repeatDur, repeatCount, etc? (Perhaps add implied expr for every attribute?)

when is a property evaluation error thrown?

Add fetchtimeout, fetchhint, maxage and maxstale attributes

Major attribute candidate: errormode (flexible error handling which controls whether errors are thrown or fallback is used).

Other candidate attributes: id/idref (use case?)

6.6.1.2 Content Model

The <media> element content model consists of:

  • Inline content: SSML <speak> (0 or 1). Note that this content may include <value> and <foreach> elements from the VoiceXML namespace.
  • <desc> element (0 or more)
  • <media>: for fallback in the case where the resource referenced by the mother <media> element is unavailable (0 or more)
  • <property>: so media related properties can be set (0 or more)

The <media> has the following co-occurrence constraints:

  • One of the src attribute or the srcexpr attribute or inline content must be specified; otherwise, an error.badfetch event is thrown.

Note that the type attribute does not affect inline content. The handling of inline XML content is in accordance to the namespace of the root element (such as SSML <speak>, SMIL <smil>, and so forth). CDATA, or mixed content with VoiceXML <foreach> or <value> elements must be treated as an SSML Fragment and evaluated as described in 6.6.2 Semantics .

Editorial note  

Permit other types of inline content apart from SSML?

Are child <property> elements necessary? Alternative: extended <prompt> so that <property> children are allowed?

6.6.3 Examples

Playback of external audio media resource.

<media type="audio/x-wav" src="http://www.example.com/resource.wav"/>

Application of media operations to audio resource. The soundLevel increases the volume by approximately 50% and the speed is reduced to 50%.

<media type="audio/x-wav" soundLevel="+6.0dB" speed="50%" 
       src="http://www.example.com/resource.wav"/>

Playback of 3GPP media resource.

<media type="video/3gpp" src="http://www.example.com/resource.3gp"/>

Playback of 3GPP media resource with the speed doubled and playback ending after 5 seconds.

<media type="video/3gpp" clipEnd="5s" speed="200%" 
       src="http://www.example.com/resource.3gp"/>

Playback of external SSML document.

<media type="application/ssml+xml" 
       src="http://www.example.com/resource.ssml"/>

Inline CDATA content with a <value> element

<media>
    Ich bin ein Berliner, said <value expr="speaker"/>
</media>

which is syntactically equivalent to

<media>
   <speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis">
    Ich bin ein Berliner, said <value expr="speaker"/>
    </speak>
</media>

Inline SSML content to which gain and clipping operations are applied.

<media soundLevel="+4.0dB" clipBegin="4s">
   <speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis">
    Ich bin ein Berliner.
    </speak>
</media>

Inline SSML with audio media fallback.

<media volume="+4.0dB" clipBegin="4s">
   <speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis">
    Ich bin ein Berliner.
    </speak>
    <media type="audio/x-wav" src="ichbineinberliner.wav">
</media>

6.7 Parseq Module

This module defines the syntax and semantics of <par> and <seq> elements. The <par> element specifies playback of media in parallel, while <seq> specifies playback in sequence.

The module is designed to extend the content model of the <prompt> element ( 6.4 Prompt Module ).

This module is dependent upon the media module ( 6.6 Media Module ).

With connections which support multiple media streams, it is possible to simultaneously playback multiple media types. For media container formats like 3GPP, audio and video media can be generated simultaneously from the same media resource.

There are established use cases for simultaneous playback of multiple media which are specified in separate resources:

  • Video mail: an audio message has been left using a conventional audio only system. For playback on a system with video support, a video resource can be played simultaneously with an image of the person, or an avatar.
  • Enterprise: a video stream resource from a security camera with TTS voiceover providing additional information.
  • Education: a video resource showing medical procedure with commentary provided by lecturer in student's language.
  • Talking heads: an animated avatar together with audio or TTS voiceover.

The intention is provide support for basic use cases where audio or TTS output from one resource can be complemented with output from another resource as permitted by the connection and platform capabilities.

6.7.2 Semantics

Editorial note  

Issue: how should parallel playback interact with the PromptQueue resource? The simplest assumption would be that if this module is supported, then prompt queue needs to be able to handle parallel playback.

For example when bargein event happens during the parallel execution, the synchronization between both prompt and for example video play should be handled. This information should be explained in the prompt queue resource section.

This module requires a PromptQueue resource which support playback of parallel and sequential media. The following defines its playback completion, termination and error handling.

Completion of playback of the <par> element is determined according to the value of its endsync attribute. For instance, assume a <par> element containing <media> (or <seq>) elements A and B, and that B finishes before A. If endsync has the value first, then completion is reported upon B's completion. If endsync has the value last, then completion is reported upon A's completion.

Completion of playback of the <seq> element occurs when the last <media> is complete.

If the <par> element playback is terminated, then playback of its <media> and <seq> children is terminated. Likewise, if the <seq> element playback is terminated, then playback of its (active) <media> elements is terminated.

If mark information is provided by <media> elements (for example with SSML), then, the mark information associated with last element played in sequence or parallel is exposed as described in XXX.

Editorial note  

Open issue: Clarify interaction with VCR media control model(s).

<reposition> approach would require that <par> and <seq> need to be able to restart from a specific position indicated by the markname/time of a <media> element contained within them.

RTC approach would require that for <par>, media operations are applied in parallel.

Error handling policy is inherited from the element in which <par> and <seq> element are children.

For instance if the policy is to ignore errors, then the following applies:

  • If an error occurs when playing a <media> element in <par>, then the error is ignored.
  • Likewise, if there is an error playing back a <media> element in <seq>, the error is ignored and the next <media> element in the sequence, if there is one, is played.
  • If the <media> element in which the error occurs is the final one in the <par> element, then completion of <par> playback is signaled when the error is detected.

If the policy is to terminate playback and report the error, then the any error causes immediate termination of any playback and the error is reported.

If execution of the <par> and <seq> elements requires media capabilities which are not supported by the platform or the connection, or there is an error fetching or playing any <media> element within <par> or <seq>, then error handling follows the defined policy.

6.8 Foreach Module

This module describes the syntactic and semantic features of the <foreach> element.

This module is designed to extend the content model of an element in another module. For example, SSML elements in the 6.5 Builtin SSML Module , the <prompt> element defined in 6.4 Prompt Module , etc.

The attributes and content model of the element are specified in 6.8.1 Syntax . Its semantics are specified in 6.8.2 Semantics .

6.8.1 Syntax

[See XXX for schema definitions].

6.8.3 Examples

Editorial note  

These examples may be moved to the respective profile section later.

The vxml21 profile defines the content model for the <foreach> element so that it may appear in executable content and within <prompt> elements.

Within executable content, except within a <prompt>, the <foreach> element may contain any elements of executable content; this introduces basic looping functionality by which executable content may be repeated for each element of an array.

When <foreach> appears within a <prompt> element as part Builtin SSML content, it may contain only those elements valid within <enumerate> (i.e. the same elements allowed within <prompt> less <meta>, <metadata>, and <lexicon>); this allows for sophisticated concatenation of prompts.

In this example using Builtin SSML, each item in the array has an audio property with a URI value, and a tts property with SSML content. The element loops through the array, playing the audio URI or the SSML content as fallback, with a 300 millisecond break between each iteration.

<prompt>
 <foreach item="item" array="array">
    <audio expr="item.audio"><value expr="item.tts"/></audio>
    <break time="300ms"/>
 </foreach>
</prompt>

In the mediaserver profile, <foreach> may occurs within <prompt> elements and has the content model of 0 or more <media> elements.

Play each media resource in the array.

  <foreach item="item" array="array">
   <media type="audio/x-wav" src="item.audio"/>
  </foreach>

Play each media resource in the array.

<foreach item="item" array="array">
   <media type="audio/x-wav" src="item.wav">
   <media type="application/ssml+xml">
    <speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis">     
     <value expr="item.tts"/>
     <break time=300ms"/>
     </speak>
   </media>
 </media> 
</foreach>

6.9 Form Module

6.9.2 Semantics

6.9.2.1 Form RC

The Form RC is the primary RC for the <form> element.

The Form RC interacts with resource controllers of other modules so as to provide the behavior of VoiceXML 2.1/2.0 <form> tag. Input and control form items are modeled as resource controllers: for the example, the <field> RC ( 6.10.2.1 Field RC ) of the Field Module.

The behavior of the Form RC follows the VoiceXML FIA, although some aspects of this are not modeled directly in this RC: external transition handling is not part of the form RC; input items used separate RCs to manage coordination between media resources, while recognition results can be received directly by form, field or other RCs.

[This initial version does not address all aspects of FIA behavior; for example, event handling, error handling and external transitions are not covered.]

6.9.2.1.1 Definition

The form RC is defined in terms of a data model and state model.

The data model is composed of the following parameters:

  • controller: the RC controlling this form RC
  • children: array of children's (primary) RC
  • activeItem: current form item being executed
  • active: Boolean indicating whether this form is active. Default: false.
  • justFilled: array of child RC which have just been filled
  • recoResult: The recognition result of the previously executed form item.
  • previousItem: The previous form item already executed.
  • nextItem: The next form item which is presently scheduled for executed.
  • modal: The modality of the current form item being executed.
  • id: The form identifier.

The form RC's state model consists of the following states: Idle, Initializing, Ready, SelectingItem, PreparingItem, PreparingFormGrammars, PreparingOtherGrammars, Executing, Active, ProcessingFormResult, Evaluating and Exit.

In the Idle state, the form RC can receive an 'initialize' event whose 'controller' event data is used to update the data model. The RC then transitions into Initiating state.

In the Initializing state, the RC creates a dialog scope in the Datamodel Resource and then initializes its children: this is modeled as a separate RC. When all children are initialized, the RC sends an 'initialized' event to its controller and transitions to the Ready state.

In the Ready state, the form RC sets its active status to false. It can receive one of two events: 'prepareGrammars' or ‘execute’. ‘prepareGrammars’ indicates that another form is active, but this form's form-level grammars may be activated; an 'execute' event indicates that this form is active. If the RC receives a 'prepareGrammars' event, it transitions to the PreparingFormGrammars state. If the RC receives an 'execute' event, it sets its active data to true and transitions to the 'SelectingItem' state.

In the SelectingItem state, the RC determines which form item to select as the active item. This is defined by a FormItemSelection RC which iterates over the children sending each a 'checkStatus' event. If a child returns a true status (indicating that it ready for execution)), the activeItem is set to this child RC and the RC transitions to the PreparingItem state. If no child returns this status, then the RC is complete and transitions the Exit State.

In the PreparingItem state, the activeItem is sent a 'prepare' event causing it to prepare itself; for example, the field RC prepares its prompts and grammars for execution. When the activeItem returns a 'prepared' event, the event data indicates whether the item is modal or not. If the item is modal, then the form RC transitions to the Executing state. If the item is not modal (other grammars can be activated), then the form RC transitions to the PreparingFormGrammars state.

In the PreparingFormGrammars state, the RC prepares form-level grammars. This is defined by a separate RC which iterates through and executes grammar children. When this is complete, the RC transitions to the Active state if the form is not active (active data), and transitions to the PreparingOtherGrammars if the form is active.

In the PreparingOtherGrammars states, the RC sends a 'prepareGrammars' event to its controller RC (which in turn sends the event to appropriate form, document and application level RCs with grammars). When its receives a 'prepared' from its controller, the RC transitions to the Executing state.

In the Executing state, the form RC sends an 'execute' event to the active form item. If the form item is a field, then this will causes prompts to be played and recognition to take place. The RC then transitions to the Active state awaiting a result.

In the Active state, the RC re-initializes the justFilled data to a new array and waits for a recognition results (as active or non-active form), or for a signal from its selected form item that it has received the recognition result. Recognition results are divided into two types: form item level results, received and processed by the form item; and form level results which are received by the form RC which caused the grammar to be added. If a 'recoResult' event is received by the form RC, the RC transitions into the ProcessingFormResult state. If the active form item receives the recognition result (and locally updated itself), then the form RC receives a 'formItemResult' event, adds the active item to the justFilled array, and transitions into the Evaluating state.

In the ProcessingFormResult state, the recognition result is processed by iterating through the form item children, obtaining their name and slotname, and then attempting to match the slotname to the results. If the match is successful, the name variable in the data model result is updated with the value from the recognition result and the child is added to the justFilled data array. When this process is complete, the form RC transitions to the Evaluating state.

In the Evaluating state, the form RC then iterates through its children and if a child is a member of the 'JustFilled' array, it sends a 'evaluate' event to the form item RC causing the appropriate filled RCs to be executed. If the child is a filled RC, then it is executed if appropriate. When evaluation is complete, the form RC transitions to the 'selectformitem' state so that the next form item can be selected for execution.

6.9.2.1.2 Defined Events
Table 38: Table: Events received by <form> RC
Event Source Payload Description
initialize any controller(M) Update the data model
prepareGrammars controller Another form is active, but the current form's form-level grammars may be activated.
execute controller Current form is active

Table 39: Table: Events received by <form> RC
Event Source Payload Description
initialize controller Notification that initialization is complete
prepareGrammars controller Sent to prepare grammars to appropriate form, document and application level RCs.
execute controller Notification of complete recognition result from the field RC.
6.9.2.1.5 SCXML Representation
<?xml version="1.0" encoding="UTF-8"?>
<scxml initialstate="Created">
  <datamodel>
    <data id="controller"/>
    <data id="children"/>
    <data id="activeItem"/>
    <data id="active"/>
    <data id="previousItem"/>
    <data id="nextItem"/>
<data id="recoResult"/>
    <data id="name"/>
    <data id="JustFilled"/>
  </datamodel>
  <state id="Created">
    <initial id="Idle"/>
    <state id="Idle">
      <onentry>
        <assign loc="$controller" val="null"/>
        <assign loc="$children" val="null"/>
        <assign loc="$activeItem" val="null"/>
        <assign loc="$active" val="false"/>
        <assign loc="$previousItem" val="null"/>
        <assign loc="$nextItem" val="null"/>
        <assign loc="$recoResult" val="null"/>
        <assign loc="$name" val="null"/>
      </onentry>
      <transition event="initialize" target="Initializing">
        <assign name="$controller" expr="_eventData/controller"/>
      </transition>
    </state>    <!-- end Idle -->
    <state id="Initializing">
      <datamodel>
        <data id="childcounter"/>
      </datamodel>
      <onentry>
        <assign loc="$childcounter" val="0"/>
        <send target="datamodel" event="createScope" namelist="dialog"/>
        <foreach var="child" array="$children">
          <send target="$child/controller" event="initialize" namelist="$child/child"/>
        </foreach>
      </onentry>
      <transition event="Initializing.done">
        <assign loc="$childcounter" expr="$childcounter + 1"/>
      </transition>
      <transition event="Initializing.error">
        <assign loc="$childcounter" expr="$childcounter + 1"/>
        <send target="controller" event="initialize.error" namelist="_eventData/error_status"/>
      </transition>
      <transition event="Initializing.done" cond="$childcounter eq $children.size()-1" target="Ready">
        <send target="controller" event="initialized"/>
      </transition>
    </state>    <!-- end Initializing -->
    <state id="Ready">
      <onentry>
        <assign loc="$active" val="false"/>
      </onentry>
      <transition event="execute" target="SelectingItem:FormItemSelection">
        <assign loc="$active" value="true"/>
      </transition>
      <transition event="prepareGrammars" target="PreparingFormGrammars"/>
    </state>    <!-- end Ready -->
    <state id="SelectingItem:FormItemSelection">
      <onentry>
        <send target="FormItemSelection" event="checkStatus" namelist="$children"/>
      </onentry>
      <transition event="SelectedFormItem.done" cond="activeItem eq 'null'" target="Exit"/>
      <transition event="SelectedFormItem.done" cond="activeItem neq 'null'" target="PreparingItem"/>
    </state>    <!-- end SelectingItem:FormItemSelection -->
    <state id="PreparingItem">
      <onentry>
        <send target="activeitem" event="prepare" />
      </onentry>
      <transition event="prepared" cond="_eventData/modal eq 'true'" target="Executing"/>
      <transition event="prepared" cond="_eventData/modal eq 'false'" target="PreparingFormGrammars"/>
    </state>    <!-- end PreparingItem-->
    <state id="Exit">
      <onentry>
        <send target="datamodel" event="destroyScope" namelist="dialog"/>
        <send target="parent" event="done"/>
      </onentry>
    </state>    <!-- end Exit-->
    <state id="PreparingFormGrammars">
      <transition event="PrepareFormGrammars.done" cond="active eq 'true'" target="PreparingOtherGrammars"/>
      <transition event="PrepareFormGrammars.done" cond="active eq 'false'" target="PreparingOtherGrammars">
        <send target="controller" event="prepared"/>
      </transition>
    </state>    <!-- end PreparingFormGrammars -->
    <state id="PreparingOtherGrammars">
      <onentry>
        <send target="controller" event="prepareGrammars"/>
      </onentry>
      <transition event="PrepareOtherGrammars.done" target="Executing"/>
    </state>    <!-- end PreparingOtherGrammars -->
    <state id="Executing">
      <onentry>
        <send target="activeItem" event="execute"/>
      </onentry>
      <transition event="Executing.done" target="Active"/>
    </state>    <!-- end Executing -->
    <state id="Active">
      <onentry>
        <assign loc="$JustFilled" expr="new Array()"/>
      </onentry>
      <transition event="fieldResult" target="Evaluating">
        <insert pos="after" name="$JustFilled" val="currentitem"/>
      </transition>
      <transition event="PlayAndRecognize:RecogResult" target="ProcessingFormResult">
        <insert pos="after" name="$JustFilled" val="currentitem"/>
      </transition>
    </state>    <!-- end Active -->
    <state id="ProcessingFormResult">
      <onentry>
          <foreach var="child" array="$children">
           <if cond="$child.slotname eq _eventData/RecogResult/slotname">
                       <assign loc="$name" value="_eventData/RecogResult/name"/>
                       <insert pos="after" name="$JustFilled" val="$child"/>
                       <transition target="Evaluating"/>
           </if>
        </foreach>
      </onentry>
      <transition event="ProcessingFormResult.done" target="Evaluating"/>
    </state>    <!-- end ProcessingFormResult -->
    <state id="Evaluating">
      <onentry>
        <send target="activeItem" event="evaluate"/>
      </onentry>
      <transition event="Evaluating.done" target="SelectingItem:FormItemSelection"/>
    </state>    <!-- end Executing -->
  </state>  <!-- end Created -->
</scxml>

6.10 Field Module

6.10.2 Semantics

The semantics of field elements are defined using the following resource controllers: Field ( 6.10.2.1 Field RC ), PlayandRecognize ( 6.10.2.2 PlayandRecognize RC ), ...

6.10.2.1 Field RC

The Field Resource Controller is the primary RC for the field element.

6.10.2.1.1 Definition

The field RC is defined in terms of a data model and state model.

The data model is composed of the following parameters:

  • controller: the RC controlling this field RC
  • children: array of children's (primary) RC
  • includePrompts: boolean indicating whether prompts are to be played. Default: true.
  • counter: prompt counter. Default: 1.
  • recoResult: ??
  • name:
  • expr:
  • cond:
  • modal:
  • slot:

The field RC's state model consists of the following states: Idle, Initializing, Ready, Preparing, Prepared, Executing and Evaluating.

While in the Idle state, the RC may receive an 'initialize' event, whose 'controller' event data is used to update the data model. The RC then transitions into Initiating state.

In the Initializing state, the RC creates a variable in the Datamodel Resource: the variable name corresponds to the name in the RC's data model, and the variable value is set to the value of the RC's data model expr, if this is defined. The field RC then initializes its children: this is modeled as a separate RC (see XXX). When all children are initialized, the RC transitions to the Ready state.

In the Ready state, the field RC can receive an 'checkStatus' event to check whether it can be executed or not. The value of name and cond in its data model are checked: the status is true if the name is undefined and the value of cond evaluates to true. The status is returned in a 'checkedStatus' event sent back to the controller RC. If the RC receives a 'prepare' event, it updates includePrompts in its data model using the event data, and transitions to the Preparing state.

In the Preparing state, the field prepares its prompts and grammars. Prompts are prepared only if the includePrompts data is true; otherwise, prompts within the field are not prepared (e.g. field prompts aren't queued following a <reprompt>). Preparation of prompts is modeled as a separate RC (see XXX), as is preparation of grammars (see YYY). These RCs are summarized below.

Prompts are prepared by iterating through the children array. In the iteration, each prompt RC child is sent a 'checkStatus' event. If the prompt child returns true (its cond parameter evaluates to true), then it is added to a 'correct count' list together with its count. Once the iteration is complete, the RC determines the highest count on the 'correct count' list: the highest count among those on the list less than or equal to the current count value. All child on the 'correct count' list whose count is not the highest count are removed. The RC then iterates through the 'correct count' list and sends an 'execute' event to each prompt RC, causing it to be queued on the PromptQueue Resource.

Grammars are prepared by recursing through the children array and sending each grammar RC child an 'execute' event. The grammar RC then, if appropriate, sends an 'addGrammar' event to the DTMF or ASR Recognizer Resource where the grammar itself, its properties and the field RC is sent as the handler for recognition results.

When prompts and grammars have been prepared, the prompt counter is incremented and the field RC sends a 'prepared' event to its controller with event data indicating its modal status and then transition into the Prepared state.

In the Prepared state, the field RC may receive an 'execute' event from its controller. The RC sends an 'execute' event to the PlayAndRecognize RC ( 6.10.2.2 PlayandRecognize RC ), causing any queued prompts to be played and recognition to be initiated. In the event data, the controller is set to this RC, and other data is derived from data model properties. The RC transitions to the Executing state.

In the Executing state, the PlayAndRecognize RC must send recoResults (or error events: noinput, nomatch, error.semantic) to the field RC.

If the field RC receives the recoResults, then it updates its name variable in the Datamodel Resource. The field RC then sends a 'fieldResult' event to its controller indicating that a field result has been received and processed.

If the recoResult is received by the field RC's controller, then the field receives an 'evaluate' event which causes it to transition to the Evaluating state.

In the Evaluating state, the field RC iterates through its children executing each filled RC: this is modeled by a separate RC (see XXX). When evaluation is complete, the RC sends a 'evaluated' event to its controller and transitions to the Ready state.

6.10.2.1.5 SCXML Representation
<?xml version="1.0" encoding="UTF-8"?>
<scxml initialstate="Created">
  <datamodel>
    <data id="controller"/>
    <data id="children"/>
    <data id="counter"/>
    <data id="recoResult"/>
    <data id="cond"/>
    <data id="name"/>
    <data id="expr"/>
    <data id="modal"/>
    <data id="includePrompts"/>
  </datamodel>
  <state id="Created">
    <initial id="Idle"/>
    <state id="Idle">
      <onentry>
        <assign location="$controller" val="null"/>
        <assign location="$children" expr="new Array()"/>
        <assign location="$counter" val="1"/>
        <assign location="$recoResult" val="null"/>
        <assign location="$cond" val="null"/>
        <assign location="$expr" val="null"/>
        <assign location="$modal" val="false"/>
        <assign location="$includePrompts" val="true"/>
      </onentry>
      <transition event="initialize" target="Initializing">
        <assign name="$controller" expr="_eventData/controller"/>
      </transition>
    </state>
    <!-- end Idle -->
    <state id="Initializing">
      <datamodel>
        <data id="childcounter"/>
      </datamodel>
      <onentry>
        <if cond="expr neq 'null'">
          <send target="datamodel" event="assign" namelist="$name, $expr"/>
          <else>
            <send target="datamodel" event="create" namelist="$name"/>
          </else>
        </if>
        <assign location="$childcounter" val="0"/>
        <foreach var="child" array="$children">
          <send target="$child/controller" event="initialize"/>
        </foreach>
      </onentry>
      <transition event="Initializing.done">
        <assign location="$childcounter" expr="$childcounter + 1"/>
      </transition>
      <transition event="Initializing.error">
        <assign location="$childcounter" expr="$childcounter + 1"/>
        <send target="controller" event="initialize.error" namelist="_eventData/error_status"/>
      </transition>
      <transition event="Initializing.done" cond="$childcounter eq $children.size()-1" target="Ready">
        <send target="controller" event="initialized"/>
      </transition>
    </state>
    <!-- end Initializing -->
    <state id="Ready">
      <transition event="checkStatus" >
        <assign location="$status" expr="checkStatus()"/>
        <send target="controller" event="checkedStatus" namelist="_eventData/status"/>
      </transition>
      <transition event="prepare" target="Preparing">
        <assign location="$includePrompts" expr="_eventData/includePrompts"/>
      </transition>
    </state>
    <!-- end Ready -->
    <state id="Preparing">
      <onentry>
        <if cond="$includePrompts eq 'true'">
          <send target="Prompts RC" event="initialize"/>
        </if>
        <send target="Grammars RC" event="initialize"/>
      </onentry>
      <transition event="preparing.done" target="Prepared">
        <send target="controller" event="prepared" namelist="modal"/>
      </transition>
    </state>
    <!-- end Preparing -->
    <state id="Prepared">
      <transition event="execute" target="Executing">
        <send target="PlayAndRecognize" event="execute" namlist="self, inputmodes"/>
      </transition>
    </state>
    <!-- end Prepared-->
    <state id="Executing">
      <datamodel>
        <data id="value"/>
      </datamodel>
      <transition event="playAndReco:recoResult">
        <assign location="$value" expr="processResults($name, slot, _eventdata/result)"/>
        <send target="datamodel" event="assign" namelist="$name, $value"/>
        <send target="parent" event="fieldResult"/>
      </transition>
    </state>
    <!-- end Executing-->
    <state id="Evaluating">
      <onentry>
        <send target="filled RC" event="executeFilleds"/>
      </onentry>
      <transition event="evaluating.done" target="Ready">
        <send target="controller" event="evaluated"/>
      </transition>
    </state>
    <!-- end Evaluating-->
  </state>
  <!-- end Created -->
</scxml>
6.10.2.2 PlayandRecognize RC

The PlayandRecognize RC coordinates media input with Recognizer resources and media output with the PromptQueue Resource.

The following use cases are covered:

  1. Bargein is not active and bargeintype is speech. Prompts are played to completion and the user provides positive input, negative input or no input.
  2. Bargein is active and bargeintype is speech. Prompts are played to completion and the user provides positive input, negative input or no input.
  3. Bargein is active and bargeintype is speech. User interrupts prompts and the user provides positive input, negative input or no input.
  4. Bargein is not active and bargeintype is hotword. Prompts are played to completion and the user provides positive input, negative input or no input. User may provide a positive input after one or more negative inputs. The 'nomatch' event is never generated.
  5. Bargein is active and bargeintype is hotword. User interrupts prompts and the user provides positive input, negative input or no input. User may provide a positive input after one or more negative inputs. The 'nomatch' event is never generated.
  6. Prompt sequences alternating between bargein and no bargein.
  7. Prompt sequences alternating between speech and hotword bargeintype.
Editorial note  
Open issue: should we remove the possibility for alternating speech and hotword bargein modes within the recognition cycle?
6.10.2.2.1 Definition

The PlayandRecognize RC coordinates media input with recognition resources and media output with the PromptQueue Resource on behalf of a form item.

This RC activates prompt queue playback, activates recognition resources, manages bargein behavior and handles results from recognition resources.

The RC is defined in terms of a data model and a state model.

The data model is composed of the following parameters:

  • controller: the RC controlling this RC
  • bargein: Boolean indicates whether bargein is active or not. Default: true.
  • bargeintype: indicates the type of bargein, if active. Default: speech.
  • inputmodes: active recognition input modes. Default: voice and dtmf.
  • inputtimeout: timeout to wait for input. Default: 0s. (Required since the prompt queue may be empty).
  • dtmfProps: DTMF properties
  • asrProps: Speech recognition properties
  • maxnbest: maximum number of nbest results. Default: 1.
  • recoActive: boolean indicating whether recognition is active. Default: false.
  • markname: string indicating current markname. Default: null
  • marktime: time designator indicating current marktime. Default: 0s.
  • recoResult:
  • recoListener:
  • activeGrammars: Boolean indicating whether grammars are active. Default: false.

The RC model consists of the following states: idle, prepare recognition resources, start playing, playing prompts with bargein, playing prompts without bargein, recognizing with a timer, waiting for input, waiting for speech result and update results. The complexity of this model is partially a consequence of supporting the relationship between hotword bargein and recognition result processing.

While in the idle state, the RC may receive an 'execute' event, whose event data is used to update the data model. The event information includes: controller, inputmodes, inputtimeout, dtmfProps, asrProps and maxnbest. The RC transition to the prepare recognition resources state.

In the prepare recognition resources, the RC sends 'prepare' events to the ASR and DTMF recognition resource. Both events specify this RC as the controller parameter, while the properties parameter differs. In this state, the RC can received 'prepared' or 'notPrepared' events from either recognition resources. If neither resource returns a 'prepared' event, then activeGrammars is false (i.e. no active DTMF or speech grammar) and the RC sends an 'error.semantic' event to the controller and exits. If at least one resource returns a 'prepared' event, then the RC moves into the start playing state.

The start playing state begins by sending the PromptQueue resource a 'play' event. The PromptQueue responds with a 'playDone' event if there are no prompt in the prompt queue; as a result, this RC moves into the start recognizing with timer state. If there is at least one prompts in the queue, the PromptQueue sends this RC a 'playStarted' event whose data contains the bargein and bargeintype values for the first prompt, and the input timeout value for the last prompt in the queue. The data model is updated with this information.

Editorial note  
Open issue: PromptQueue Resource doesn't currently have playStarted event. If we don't add playStarted event, then is there a better way to get the bargein, bargeintype, and timeout information from the prompts in the PromptQueue?

Interaction with the recognizer during prompt playback is determined by the data model's bargein value. If bargein is true, then this RC transitions to the playing with bargein state. If bargein is false, the RC transitions to the playing without bargein state.

Editorial note  
Open Issue: The event "bargeinChange" as a one way notification could pose a problem, as it takes finite time for recognizer to suspend or resume. This might work better if PromptQueue Resource waited for an event "bargeinChangeAck" (or similar) from PlayandRecognize RC before starting the next play. PlayandRecognize RC will send the event "bargeinChangeAck" after it completed suspend or resume action on the recognizer.

In the playing without bargein state, recognition is suspended if it has been previously activated (recoActive parameter of the data model tracks this). Suspending recognition is conditional on the value of 'inputmodes' data parameter; if 'dtmf' is in inputmodes, then DTMF recognition is suspended; if 'voice' is in inputmodes, the ASR recognition is suspended. In this state, the PromptQueue can report to this RC changes in bargein and bargeintype as prompts are played: a 'bargeintypeChange' event with the values 'hotword' or 'speech' cause the data model parameter 'bargein' to the set to 'true' and the 'bargeintype' parameter to be updated with event data value. If the PromptQueue resource sends a 'playDone' event, then the data model markname and marktime parameters are updated and the RC transitions to the start recognizing with timer state.

In the playing with bargein state, recognition is activated if it has not been previously activated (determined by recoActive parameter in the data model). Activating recognition is conditional on the value of 'inputmodes' data parameter; if 'dtmf' is in inputmodes, then DTMF recognition is activated; if 'voice' is in inputmodes, then ASR recognition is activated. In this state, the PromptQueue can report changes in bargein and bargeintype as prompts are played: a 'bargeintypeChange' event where the event data value is not 'unbargeable' causes the data model 'bargeintype' parameter to be updated with the event data ('hotword' or 'speech'); while a 'bargeintypeChange' where the event data value is 'unbargeable' causes the data model 'bargein' parameter to set to false and the RC transitions to the playing without bargein state. If the PromptQueue resources sends a 'playDone' event, then the data model markname and marktime parameters are updated and the RC transitions to the start recognizing with timer state.

Recognition handling in this state depends upon the bargeintype data parameter. If the bargeintype is 'speech' and a recognizer sends a 'inputStarted' event, then the RC transition to the waiting for speech result state. If the bargeintype is 'hotword', then recognition results are processed within this state. In particular, if a recognition resource sends a 'recoResults' event, then its event data is processed to determine if the recognition result is positive or negative.

Editorial note  
Further details on recognition processing to be added in later versions. recoResults data parameter is updated with the recognition results (truncated to maxnbest). A speech result is positive iff there is at least one result whose confidence level is equal to or greater than the recognition confidence level; otherwise the result is negative. DTMF results are always positive. The recoListener data parameter is defined as the listener associated with the best result if the result is positive.

If positive, the RC sends the PromptQueue a 'halt' event, and transitions to the update results state. If negative, the RC sends a 'listen' event to the recognition resource which sent the 'recoResults' event.

In the start recognizing with timer state, an input timer is activated for the value of the inputtimeout data parameter and, if the recognition is not already active (determined by the recoActive data parameter). Recognition activation is conditional on the value of 'inputmodes' data parameter; if 'dtmf' is in inputmodes, then DTMF recognition is activated; if 'voice' is in inputmodes, the ASR recognition is activated. The RC then transitions into the waiting for input state.

In the waiting for input state, the RC waits for user input. If it receives a 'timerExpired' event, then the RC sends a 'stop' event to all recognition resources, sends a 'noinput' event to its controller and exits. Recognition handling in this state depends upon the bargeintype data parameter. If the bargeintype is 'speech' and a recognizer sends a 'inputStarted' event, then the RC transition to the waiting for speech result state. If the bargeintype is 'hotword', then recognition results are processed within this state. In particular, if a recognition resource sends a 'recoResults' event, then its event data is processed to determine if the recognition result is positive or negative. If positive, the RC cancels the timer, and transitions to the update results state. If negative, the RC sends a 'listen' event to the recognition resource which sent the 'recoResults' event.

In the waiting for speech result state, the RC waits for a 'recoResult' event whose data is used to update the recoResult data parameter and to set the recoListener data parameter if the recognition result is positive. The RC then transitions to the update results state.

In the update results state, the RC sends 'assign' events to the data model resource, so that the lastresult object in application scope is updated with recognition results as well as markname and marktime information. If the recoListener data parameter is defined, then the RC sends a 'recoResult' event to the recognition listener RC; otherwise, it sends 'nomatch' event to its controller. The RC then exits.

Editorial note  

Open issue: Behavior if one reco resource sends 'inputStarted' but other sends 'recoResults'? Race conditions between recognizers returning results? (This problem is inherent to the presence of two recognizers. For the sake of clear semantics, we could restrict only one recognizer to respond with 'inputStarted' and 'recoResults'. The other recognizer is always 'stopped'. But a better choice might be to have only one recognizer that handles both DTMF and speech, since semantically both recognizers are very similar.)

6.10.2.2.5 SCXML Representation
<?xml version="1.0" encoding="UTF-8"?>
<scxml initialstate="Created">
  <datamodel>
    <data id="controller"/>
    <data id="bargein"/>
    <data id="bargeintype"/>
    <data id="inputmodes"/>
    <data id="inputtimeout"/>
    <data id="dtmfProps"/>
    <data id="asrProps"/>
    <data id="maxnbest"/>
    <data id="recoActive"/>
    <data id="markname"/>
    <data id="marktime"/>
    <data id="recoResult"/>
    <data id="recoListener"/>
    <data id="activeGrammars"/>
  </datamodel>
  <state id="Created">
    <initial id="Idle"/>
    <state id="Idle">
      <onentry>
        <assign location="$controller" val="null"/>
        <assign location="$bargein" expr="true"/>
        <assign location="$bargeintype" val="speech"/>
        <assign location="$inputmodes" val="voice"/>
        <assign location="$inputtimeout" val="0s"/>
        <assign location="$dtmfProps" val="null"/>
        <assign location="$asrProps" val="null"/>
        <assign location="$maxbest" val="1"/>
        <assign location="$recoActive" val="false"/>
        <assign location="$markname" val="null"/>
        <assign location="$marktime" val="0"/>
        <assign location="$recoResult" val="null"/>
        <assign location="$recoListener" val="null"/>
        <assign location="$activeGrammars" val="false"/>
      </onentry>
      <transition event="execute" target="PrepareRecognitionResources">
        <assign name="/datamodel/data/[@name='controller']" expr="_eventData/controller"/>
        <assign name="/datamodel/data/[@name='inputmodes']" expr="_eventData/modes"/>
        <assign name="/datamodel/data/[@name='inputtimeout']" expr="_eventData/timeout"/>
        <assign name="/datamodel/data/[@name='dtmfProps']" expr="_eventData/dtmfProps"/>
        <assign name="/datamodel/data/[@name='asrProps']" expr="_eventData/asrProps"/>
        <assign name="/datamodel/data/[@name='maxnbest']" expr="_eventData/maxnbest"/>
      </transition>
    </state>
    <!-- end Idle -->
    <state id="PrepareRecognitionResources">
      <transition target="StartPlaying" cond="$activeGrammars eq 'true'"/>
      <transition target="Exit" cond="$activeGrammars eq 'false'">
        <send target="controller" event="error.semantic"/>
      </transition>
    </state>
    <!-- end PrepareRecognitionResources -->
    <state id="StartPlaying">
      <onentry>
        <send target="PromptQueue" event="pq:play"/>
      </onentry>
      <transition event="pq:playStarted" cond="$bargein eq 'true'" target="PlayingWithBargein">
        <assign location="$bargein" expr="_eventdata/bargein"/>
      </transition>
      <transition event="pq:playStarted" cond="$bargein eq 'false'" target="PlayingWithoutBargein">
        <assign location="$bargein" expr="_eventdata/bargein"/>
      </transition>
      <transition event="pq:playDone" target="StartRecognizingWithTimer"/>
    </state>
    <!-- end StartPlaying -->
    <state id="PlayingWithoutBargein">
      <onentry>
        <if cond="$recoActive eq 'true'">
          <if cond="in('dtmf',$inputmodes) ">
            <send target="DTMFRecognizer" event="rec:suspend"/>
          </if>
          <if cond="in('voice',$inputmodes) ">
            <send target="DTMFRecognizer" event="rec:suspend"/>
          </if>
        </if>
      </onentry>
      <transition event="bargeintypeChange" cond="_eventdata/value neq 'unbargeable'" target="PlayWithBargein">
        <assign location="$bargein" val="true"/>
        <assign location="$bargeintype" expr="_eventdata/value"/>
      </transition>
      <transition event="pq:playDone" target="StartRecognizingWithTimer">
        <assign location="$markname" expr="_eventdata/markname"/>
        <assign location="$marktime" expr="_eventdata/marktime"/>
      </transition>
    </state>
    <!-- end PlayingWithoutBargein -->
    <state id="PlayingWithBargein">
      <datamodel>
        <data id="negorpos"/>
      </datamodel>
      <onentry>
        <if cond="in('dtmf',$inputmodes) ">
          <send target="DTMFRecognizer" event="rec:listen"/>
        </if>
        <if cond="in('voice',$inputmodes)">
          <send target="DTMFRecognizer" event="rec:listen"/>
        </if>
        <assign location="$recoActive" val="true"/>
      </onentry>
      <transition event="bargeintypeChange" cond="_eventdata/value neq 'unbargeable'">
        <assign location="$bargeintype" expr="_eventdata/value"/>
      </transition>
      <transition event="rec:recoResult">
        <assign location="$negorpos" expr="processRecoResult()"/>
        <send target="parent" event="negorpos"/>
      </transition>
      <transition event="negativeRecoResult">
        <send target="rec_source" event="listen"/>
      </transition>
      <transition event="pq:playDone" target="StartRecognizingWithTimer">
        <assign location="$markname" expr="_eventdata/markname"/>
        <assign location="$marktime" expr="_eventdata/marktime"/>
      </transition>
      <transition event="positiveRecoResult">
        <send target="PromptQueue" event="pq:halt"/>
      </transition>
      <transition event="rec:inputStarted" cond="$bargeintype eq 'speech'">
        <send target="PromptQueue" event="pq:halt"/>
      </transition>
    </state>
    <!-- end PlayingWithBargein -->
    <state id="StartRecognizingWithTimer">
      <onentry>
        <send target="Timer" event="start" namelist="$inputtimeout"/>
        <if cond="$recoActive eq 'false'">
          <if cond="in('dtmf',$inputmodes) ">
            <send target="DTMFRecognizer" event="rec:listen"/>
          </if>
          <if cond="in('voice',$inputmodes)">
            <send target="DTMFRecognizer" event="rec:listen"/>
          </if>
          <assign location="$recoActive" val="true"/>
        </if>
      </onentry>
      <transition event="execute" target="Executing">
        <send target="PlayAndRecognize" event="execute" namlist="self, inputmodes"/>
      </transition>
    </state>
    <!-- end StartRecognizingWithTimer-->
    <state id="WaitingForInput">
      <datamodel>
        <data id="negorpos"/>
      </datamodel>
      <transition event="rec:recoResult">
        <assign location="$negorpos" expr="processResults()"/>
        <send target="parent" event="negorpos"/>
      </transition>
      <transition event="negativeRecoResult">
        <send target="rec_source" event="listen"/>
      </transition>
      <transition event="timerExpired">
        <send target="Recognizer" event="rec:stop"/>
        <send target="controller" event="noinput"/>
      </transition>
      <transition event="rec:inputStarted" cond="$bargeintype eq 'speech'" target="WaitingForSpeechResult">
        <send target="Timer" event="cancel"/>
      </transition>
    </state>
    <!-- end WaitingForInput-->
    <state id="WaitingForSpeechResult">
      <datamodel>
        <data id="negorpos"/>
      </datamodel>
      <!--TBD: the original diagram seems put the event at the wrong place-->
      <transition event="rec:recoResult" target="UpdateResults">
        <assign location="$negorpos" expr="processResults()"/>
        <send target="parent" event="negorpos"/>
      </transition>
    </state>
    <!-- end WaitingForSpeechResult-->
    <state id="UpdateResults">
      <onentry>
        <send target="datamodel" namelist="application, lastresult$, recoResults"/>
        <if cond="$negorpos neq 'null'">
          <send target="recoListener" event="recoResult" namelist="recoResults"/>
          <else/>
          <send target="controller" event="nomatch"/>
        </if>
      </onentry>
      <transition target="Exit"/>
    </state>
    <!-- end UpdateResults-->
  </state>
  <!-- end Created -->
</scxml>

6.11 Builtin Grammar Module

6.11.3 Syntax and Semantics

Builtin grammars may be specified in one of two ways:

  • Through the "type" attribute on the <field> tag (see Table 41). For example <field type="boolean">. Note that when a builtin is specified in this way, it is in addition to any <grammar> under the <field>
  • Using the "builtin" protocol for the grammar URI. For example: <grammar src="builtin:boolean"/>

Each builtin type has a convention for the format of the value returned. These are independent of language and of the implementation. The return type for builtin fields is a string except for the boolean field type. To access the actual recognition result, the author can reference the <field> shadow variable "name$.utterance". Alternatively, the developer can access application.lastresult$, where application.lastresult$.interpretation has the same string value as application.lastresult$.utterance.

Table 49: Builtin Grammar Types
Type Description
boolean Inputs include affirmative and negative phrases appropriate to the current language. DTMF 1 is affirmative and 2 is negative. The result is ECMAScript true for affirmative or false for negative. The value will be submitted as the string "true" or the string "false". If the field value is subsequently used in <say-as> with the interpret-as value "vxml:boolean", it will be spoken as an affirmative or negative phrase appropriate to the current language.
date Valid spoken inputs include phrases that specify a date, including a month day and year. DTMF inputs are: four digits for the year, followed by two digits for the month, and two digits for the day. The result is a fixed-length date string with format yyyymmdd, e.g. "20000704". If the year is not specified, yyyy is returned as "????"; if the month is not specified mm is returned as "??"; and if the day is not specified dd is returned as "??". If the value is subsequently used in <say-as> with the interpret-as value "vxml:date", it will be spoken as date phrase appropriate to the current language.
digits Valid spoken or DTMF inputs include one or more digits, 0 through 9. The result is a string of digits. If the result is subsequently used in <say-as> with the interpret-as value "vxml:digits", it will be spoken as a sequence of digits appropriate to the current language. A user can say for example "two one two seven", but not "twenty one hundred and twenty-seven". A platform may support constructs such as "two double-five eight".
currency Valid spoken inputs include phrases that specify a currency amount. For DTMF input, the "*" key will act as the decimal point. The result is a string with the format UUUmm.nn, where UUU is the three character currency indicator according to ISO standard 4217 [ISO4217], or mm.nn if the currency is not spoken by the user or if the currency cannot be reliably determined (e.g. "dollar" and "peso" are ambiguous). If the field is subsequently used in <say-as> with the interpret-as value "vxml:currency", it will be spoken as a currency amount appropriate to the current language.
number Valid spoken inputs include phrases that specify numbers, such as "one hundred twenty-three", or "five point three". Valid DTMF input includes positive numbers entered using digits and "*" to represent a decimal point. The result is a string of digits from 0 to 9 and may optionally include a decimal point (".") and/or a plus or minus sign. ECMAScript automatically converts result strings to numerical values when used in numerical expressions. The result must not use a leading zero (which would cause ECMAScript to interpret as an octal number). If the field is subsequently used in <say-as> with the interpret-as value "vxml:number", it will be spoken as a number appropriate to the current language.
phone Valid spoken inputs include phrases that specify a phone number. DTMF asterisk "*" represents "x". The result is a string containing a telephone number consisting of a string of digits and optionally containing the character "x" to indicate a phone number with an extension. For North America, a result could be "8005551234x789". If the field is subsequently used in <say-as> with the interpret-as value "vxml:phone", it will be spoken as a phone number appropriate to the current language.
time Valid spoken inputs include phrases that specify a time, including hours and minutes. The result is a five character string in the format hhmmx, where x is one of "a" for AM, "p" for PM, "h" to indicate a time specified using 24 hour clock, or "?" to indicate an ambiguous time. Input can be via DTMF. Because there is no DTMF convention for specifying AM/PM, in the case of DTMF input, the result will always end with "h" or "?". If the field is subsequently used in <say-as> with the interpret-as value "vxml:time", it will be spoken as a time appropriate to the current language.

Both the "boolean" and "digits" types can be parameterized as follows:

Table 50: Digit and Boolean Grammar Parameterization
digits?minlength=n A string of at least n digits. Applicable to speech and DTMF grammars. If minlength conflicts with either the length or maxlength attributes then a error.badfetch event is thrown.
digits?maxlength=n A string of at most n digits. Applicable to speech and DTMF grammars. If maxlength conflicts with either the length or minlength attributes then a error.badfetch event is thrown.
digits?length=n A string of exactly n digits. Applicable to speech and DTMF grammars. If length conflicts with either the minlength or maxlength attributes then a error.badfetch event is thrown.
boolean?y=d A grammar that treats the keypress d as an affirmative answer. Applicable only to the DTMF grammar.
boolean?n=d A grammar that treats the keypress d as a negative answer. Applicable only to the DTMF grammar.

Note that more than one parameter may be specified separated by the ";" character. This is illustrated in the last example below.

6.11.4 Examples

A <field> element with a builtin grammar type. In this example, the boolean type indicates that inputs are various forms of true and false. The value actually put into the field is either true or false. The field would be read out using the appropriate affirmative or negative response in prompts.

            <field name="lo_fat_meal" type="boolean">
                
                <prompt>
                    Do you want a low fat meal on this flight?
                </prompt>
                <help>
                    Low fat means less than 10 grams of fat, and under
                    250 calories.
                </help>
                <filled>
                    <prompt>
                        I heard <emphasis><say-as interpret-as="vxml:boolean">
                            <value expr="lo_fat_meal"/></say-as></emphasis>.
                    </prompt>
                </filled>
            </field>            
           

In the next example, digits indicates that input will be spoken or keyed digits. The result is stored as a string, and rendered as digits using the <say-as> with "vxml:digits" as the value for the interpret-as attribute, i.e., "one-two-three", not "one hundred twenty-three". The <filled> action tests the field to see if it has 12 digits. If not, the user hears the error message.

                <field name="ticket_num" type="digits">
                    <prompt>
                        Read the 12 digit number from your ticket.
                    </prompt>
                    <help>The 12 digit number is to the lower left.</help>
                    <filled>
                        <if cond="ticket_num.length != 12">
                            <prompt>
                                Sorry, I didn't hear exactly 12 digits.
                            </prompt>
                            <assign name="ticket_num" expr="undefined"/>
                            <else/>
                            <prompt>I heard <say-as interpret-as="vxml:digits"> 
                                <value expr="ticket_num"/></say-as>
                            </prompt>
                        </if>
                    </filled>
                </field>            
           

The builtin boolean grammar and builtin digits grammar can be parameterized. This is done by explicitly referring to builtin grammars using a platform-specific builtin URI scheme and using a URI-style query syntax of the form type?param=value in the src attribute of a <grammar> element, or in the type attribute of a <field>. In this example, the <grammar> parameterizes the builtin DTMF grammar, the first <field> parameterizes the builtin DTMF grammar (the speech grammar will be activated as normal) and the second <field> parameterizes both builtin DTMF and speech grammars. Parameters which are undefined for a given grammar type will be ignored; for example, "builtin:grammar/boolean?y=7".

            <grammar src="builtin:dtmf/boolean?y=7;n=9"/>
            
            <field type="boolean?y=7;n=9">
                <prompt>
                    If this is correct say yes or press seven, if not, say no or press nine.
                </prompt>
            </field>
            
            <field type="digits?minlength=3;maxlength=5">
                <prompt>Please enter your passcode</prompt>
            </field>            
       

6.12 Data Access and Manipulation Module

6.12.2 Semantics

The semantics of Data Access and Manipulation can be described in terms of the various scopes in VoiceXML 3.0, the relevance to platform properties, the corresponding implicit variables that platforms must support, the variable resolution mechanism, standard session and application variables and the set of legal data values and expressions.

6.12.2.1 The scope stack

Access to data is controlled by means of scopes, which are conceptually stored in a stack. Data is always accessed within a particular scope, which may be specified by name but defaults to being the top scope in the stack. At initialization time, a single scope named "session" is created. Thereafter scopes are explicitly created and destroyed by the data model resource's clients as necessary. Likewise, during the lifetime of each scope, data is added, read, updated and deleted by the data model resource's clients as necessary.

Implementation note: The API is defined in 5.1.1 Data Model Resource API .

At any given point in time, based on the VoiceXML document structure and the execution state, the stack may contain the following scopes whose semantics are described in VoiceXML 3.0 as follows (bottom to top):

Table 51: Variable Scopes
session These are read-only variables that pertain to an entire user session. They are declared and set by the interpreter context. New session variables cannot be declared by VoiceXML documents.
application These are declared with <var> and <script> elements that are children of the application root document's <vxml> element. They are initialized when the application root document is loaded. They exist while the application root document is loaded, and are visible to the root document and any other loaded application leaf document. Note that while executing inside the application root document document.x is equivalent to application.x.
document These variables are declared with <var> and <script> elements that are children of the document's <vxml> element. They are initialized when the document is loaded. They exist while the document is loaded. They are visible only within that document, unless the document is an application root, in which case the variables are visible by leaf documents through the application scope only.
dialog Each dialog (<form> or <menu>) has a dialog scope that exists while the user is visiting that dialog, and which is visible to the elements of that dialog. Dialog scope contains the following variables: variables declared by <var> and <script> child elements of <form>, form item variables, and form item shadow variables. The child <var> and <script> elements of <form> are initialized when the form is first visited, as opposed to <var> elements inside executable content which are initialized when the executable content is executed.
(anonymous) Each <block>, <filled>, and <catch> element defines a new anonymous scope to contain variables declared in that element.
6.12.2.2 Relevance of scope stack to properties

Properties are discussed in detail in 8.2 Properties . Properties may be defined for the whole application, for the whole document at the <vxml> level, for a particular dialog at the <form> or <menu> level, or for a particular form item. Thus, access to properties is also controlled by means of the same scope stack that is used by named variables.

VoiceXML 3.0 provides a consistent mechanism to unambiguously read these properties in any scope using the data access and manipulation language in a manner similar to accessing and manipulating named variables. This is described in the two sections below.

6.12.2.3 Implicit variables

VoiceXML 3.0 provides several implicit variables in the data access and manipulation language to unambiguously identify the various scopes in the scope stack. Whenever the corresponding scopes are available, they can be referenced under specific names, which are always the same regardless of the location in the VoiceXML document. Additionally, an implicit variable "properties$" is available in each scope which points to the defined properties for that scope.

Table 52: Implicit Variables
session This implicit variable refers to the session scope.
application This implicit variable refers to the application scope.
document This implicit variable refers to the document scope.
dialog This implicit variable refers to the dialog scope.
properties$ This read-only implicit variable refers to the defined properties which affect platform behavior in a given scope. The value is an ECMAScript object with multiple ECMAScript properties as necessary where each ECMAScript property has the name of an existing platform property in that scope and value corresponding to the value of the platform property.

Note that in some data access expression languages (such as XPath), it may be necessary to expose the semantics of implicit variables as expression language functions instead of variables.

Also note that there is no implicit variable corresponding to the anonymous scope since it is not necessary given the variable resolution mechanism described in the next section. Where scope qualifiers are functions, a function to identify the anonymous scope may be necessary.

Finally, the use of the "properties$" implicit variable in VoiceXML 3.0 means that the variable "properties$" is now reserved in all scopes with the semantics described above.

6.12.2.4 Variable resolution

This section describes how named variables are resolved in VoiceXML 3.0. Named variables in expressions may be scope-qualified (using implicit variables) or scope-unqualified.

Some examples of scope-qualified variables that may occur in expressions are listed in the table below.

Table 53: Resolution examples (ECMAScript)
Expression Result
application.hello The value of the "hello" named variable in the application scope.
dialog.retries The value of the "retries" named variable in the dialog scope.
dialog.properties$.bargein The value of the "bargein" platform property defined at the current "dialog" scope.

The above table assumes that all the named variables used in the expressions exist. If any of the named variables do not exist, an error.semantic will result.

In cases where the named variables are unqualified i.e. there is no implicit variable indicating the scope in use, the following variable resolution mechanism is used:

  • The anonymous scope is checked for the named variable, and its value is returned if the variable is found
  • Otherwise, the dialog scope is checked for the named variable, and its value is returned if the variable is found
  • Otherwise, the document scope is checked for the named variable, and its value is returned if the variable is found
  • Otherwise, the application scope is checked for the named variable, and its value is returned if the variable is found
  • Otherwise, the session scope is checked for the named variable, and its value is returned if the variable is found
  • Otherwise, an error.semantic is thrown

The steps corresponding to any scopes that do not exist at the time of expression evaluation are ignored. The resolution mechanism begins with the closest enclosing scope in the given document structure.

6.12.2.5 Standard session variables

The following standard variables are available in the session scope:

session.connection.local.uri
This variable is a URI which addresses the local interpreter context device.
session.connection.remote.uri
This variable is a URI which addresses the remote caller device.
session.connection.protocol.name
This variable is the name of the connection protocol. The name also represents the subobject name for protocol specific information. For instance, if session.connection.protocol.name is 'q931', session.connection.protocol.q931.uui might specify the user-to-user information property of the connection.
session.connection.protocol.version
This variable is the version of the connection protocol.
session.connection.redirect
This variable is an array representing the connection redirection paths. The first element is the original called number, the last element is the last redirected number. Each element of the array contains a uri, pi (presentation information), si (screening information), and reason property. The reason property can be either "unknown", "user busy", "no reply", "deflection during alerting", "deflection immediate response", "mobile subscriber not reachable".
session.connection.aai
This variable is application-to-application information passed during connection setup.
session.connection.originator
This variable directly references either the local or remote property (For instance, the following ECMAScript would return true if the remote party initiated the connection: var caller_initiate = connection.originator == connection.remote).
6.12.2.6 Standard application variables

The following standard variables are available in the application scope:

application.lastresult$
This variable holds information about the last recognition to occur within this application. It is an array of elements where each element, application.lastresult$[i], represents a possible result through the following variables:
application.lastresult$.confidence
The whole utterance confidence level for this interpretation from 0.0-1.0. A value of 0.0 indicates minimum confidence, and a value of 1.0 indicates maximum confidence. More specific interpretation of a confidence value is platform-dependent.
application.lastresult$.utterance
The raw string of words that were recognized for this interpretation. The exact tokenization and spelling is platform-specific (e.g. "five hundred thirty" or "5 hundred 30" or even "530"). In the case of a DTMF grammar, this variable will contain the matched digit string.
application.lastresult$.inputmode
For this interpretation,the mode in which user input was provided: dtmf or voice.
application.lastresult$.interpretation
An ECMAScript variable containing the interpretation as described in the Semantic Interpretation for Speech Recognition specification [SISR].
application.lastresult$.markname
The name of the mark last executed by the SSML processor before barge-in occurred or the end of audio playback occurred. If no mark was executed, this variable is undefined.
application.lastresult$.marktime
The number of milliseconds that elapsed since the last mark was executed by the SSML processor until barge-in occurred or the end of audio playback occurred. If no mark was executed, this variable is undefined.
application.lastresult$.recording
The variable that stores a reference to the recording, or undefined if no audio is collected. Like the input item variable associated with a <record> element as described in section 2.3.6 of [VXML2], the implementation of this variable may vary between platforms.
application.lastresult$.recordingsize
The size of the recording in bytes, or undefined if no audio is collected.
application.lastresult$.recordingduration
The duration of the recording in milliseconds, or undefined if no audio is collected.

Interpretations are sorted by confidence score, from highest to lowest. Interpretations with the same confidence score are further sorted according to the precedence relationship among the grammars producing the interpretations. Different elements in application.lastresult$ will always differ in their utterance, interpretation, or both.

The number of application.lastresult$ elements is guaranteed to be greater than or equal to one and less than or equal to the system property "maxnbest". If no results have been generated by the system, then "application.lastresult$" shall be ECMAScript undefined.

Additionally, application.lastresult$ itself contains the properties confidence, utterance, inputmode, and interpretation corresponding to those of the 0th element in the ECMAScript array.

All of the shadow variables described above are set immediately after any recognition. In this context, a <nomatch> event counts as a recognition, and causes the value of "application.lastresult$" to be set, though the values stored in application.lastresult$ are platform dependent. In addition, the existing values of field variables are not affected by a <nomatch>. In contrast, a <noinput> event does not change the value of "application.lastresult$". After the value of "application.lastresult$" is set, the value persists (unless it is modified by the application) until the browser enters the next waiting state, when it is set to undefined. Similarly, when an application root document is loaded, this variable is set to the value undefined . The variable application.lastresult$ and all of its components are writeable writable and can be modified by the application.

6.12.3 Syntax

The syntax of the Data Access and Manipulation Module is described in terms of full support for CRUD operations (Create, Read, Update, Delete) on the Data layer in sections 2.3.1 through 2.3.4. The relevance of this syntax for properties is described in section 2.3.5.

6.12.3.1 Creating variables: the <var> element

The declaration of named variables is done using the <var> element. It can occur in executable content or as a child of <form> or <vxml>.

If it occurs in executable content, it declares a variable in the anonymous scope associated with the enclosing <block>, <filled>, or catch element. This declaration is made only when the <var> element is executed. If the variable is already declared in this scope, subsequent declarations act as assignments, as in ECMAScript.

If a <var> is a child of a <form> element, it declares a variable in the dialog scope of the <form>. This declaration is made during the form's initialization phase.

If a <var> is a child of a <vxml> element, it declares a variable in the document scope; and if it is the child of a <vxml> element in a root document then it also declares the variable in the application scope. This declaration is made when the document is initialized; initializations happen in document order.

Attributes of <var>

Table 54: <var> Attributes
name The name of the variable that will hold the result. This attribute must not specify a scope-qualified variable (if a variable is specified with a scope prefix, then an error.semantic event is thrown). The default scope in which the variable is defined is determined from the position in the document at which the element is declared.
expr The initial value of the variable (optional). If there is no expr attribute, the variable retains its current value, if any. Variables start out with the default value determined by the data access expression language in use if they are not given initial values (for example, with ECMAScript the initial value is undefined ).
scope The scope within which the named variable must be created (optional). Must be one of session , application , document or dialog . If the specified scope does not exist, then an error.semantic event is thrown.

The addition of the "scope" attribute in VoiceXML 3.0 adds more flexibility for the creation of variables, and allows creation to be decoupled from document location of the <var> element, if desired by the application.

Children of <var>

The children of the <var> element represent an in-line specification of the value of the variable.

If "expr" attribute is present, then the element must not have any children. Thus "expr" and children are mutually exclusive for the <var> element.

<var> examples

This section is informative.

    <var name="phone" expr="'6305551212'"/>
    <var name="y" expr="document.z+1"/>
    <var name="foo" scope="application" expr="dialog.bar * 2"/>
    <var name="itinerary">
      <root xmlns="">
        <flight>SW123</flight>
        <origin>JFK</origin>
        <depart>2009-01-01T14:32:00</depart>
        <destination>SFO</destination>
        <arrive>2009-01-01T18:14:00</arrive>
      </root>
    </var>
   

The above examples have the following result, in order:

  1. Creates a variable with name "phone" and String value "6305551212" in the closest enclosing scope as determined by the position of this <var> element in the document. If a variable named "phone" is already present in the mentioned scope, its value is updated to the String value "6305551212" (since this is always true, the rest of this section will not repeat this for each example).
  2. Creates a variable with name "y" in the closest enclosing scope as determined by the position of this <var> element in the document and value corresponding to the result of the expression "document.z+1", evaluated when this <var> element is executed.
  3. Creates a variable with name "foo" in the application scope and value corresponding to the result of the expression "dialog.bar * 2", evaluated when this <var> element is executed.
  4. Creates a variable with name "itinerary" in the closest enclosing scope as determined by the position of this <var> element in the document and value specified by the following in-line XML tree (the internal representation may be the corresponding DOM node, for example):
        <root xmlns="">
          <flight>SW123</flight>
          <origin>JFK</origin>
          <depart>2009-01-01T14:32:00</depart>
          <destination>SFO</destination>
          <arrive>2009-01-01T18:14:00</arrive>
        </root>
             
    

Translating to the Data Model Resource API

Implementation Notes: This section illustrates how the above examples translate to the 5.1.1 Data Model Resource API .

The above examples result in the following Data Model Resource API calls, in order:

  1. At the time of <var> execution, first the value is obtained for the new variable with name "phone" by evaluating the expression "'6305551212'", and subsequently the variable is created in the scope on top of the stack:
    1. Obtain variable value by calling EvaluateExpression("'6305551212'")
    2. Create variable by calling CreateVariable("name", value) where value is the result obtained in a. above. The optional scope parameter is not specified since the scope on the top of the stack is chosen by default.
  2. At the time of <var> execution, first the value is obtained for the new variable with name "y" by evaluating the expression "document.z+1", and subsequently the variable is created in the scope on top of the stack:
    1. Obtain variable value by calling EvaluateExpression("document.z+1")
    2. Create variable by calling CreateVariable("y", value) where value is the result obtained in a. above.
  3. At the time of <var> execution, first the value is obtained for the new variable with name "foo" by evaluating the expression "dialog.bar * 2", and subsequently the variable is created in application scope:
    1. Obtain variable value by calling EvaluateExpression("dialog.bar * 2")
    2. Create variable by calling CreateVariable("foo", value, "application") where value is the result obtained in a. above.
  4. At the time of <var> execution, the new variable with name "itinerary" is created in the scope on top of the stack using the in-line specification for the value in the body of the <var> element:
    1. Process the in-line specification below into an internal representation for the data model. For example, the assumed XML data model in this example may choose to internally represent this in-line specification as a DOM node.
          <root xmlns="">
            <flight>SW123</flight>
            <origin>JFK</origin>
            <depart>2009-01-01T14:32:00</depart>
            <destination>SFO</destination>
            <arrive>2009-01-01T18:14:00</arrive>
          </root>
                 
      
    2. Create variable by calling CreateVariable("itinerary", node) where node is the DOM node from a. above.
6.12.3.2 Reading variables: "expr" and "cond" attributes and the <value> element

The values of the named variables in the existing scopes in the scope stack are available for introspection and for further computation. These values can be read wherever expressions can be specified in the VoiceXML 3.0 document. Important examples include the "expr" and "cond" attributes on various syntactic elements. The "expr" attribute values are legal expressions as defined by the syntax of the data access and manipulation language (see Section 2.2.7 for details). The "cond" attribute values function as predicates, and in addition to being expressions, must evaluate to a boolean value.

6.12.3.2.1 Inserting variable values in prompts: The <value> element

The <value> element is used to insert the value of an expression into a prompt. 6.4 Prompt Module specifies prompts in detail.

Attributes of <value>

Table 55: <value> attributes
expr The expression to render. See Section 2.2.7 for legal values of expressions.
scope The scope within which the named variables in the expression are resolved (optional). Must be one of session , application , document or dialog . If the specified scope does not exist, then an error.semantic event is thrown.

<value> examples

    <value expr="application.duration + dialog.duration"/>
    <value expr="foo * bar"/>
    <value expr="foo + bar + application.baz" scope="document"/>
   

The above examples render the following, in order:

  1. The value corresponding to the sum (or concatenation, as the case may be) of the "duration" named variable in the application scope and the "duration" named variable in the dialog scope.
  2. The value corresponding to the product of the "foo" and "bar" named variables in the closest enclosing scope (the top of the scope stack).
  3. The value corresponding to the sum (or concatenation, as the case may be) of the "foo" and "bar" named variables in the document scope, and the "baz" named variable in the application scope.

Translating to the Data Model Resource API

Implementation Notes: This section illustrates how the above examples translate to the 5.1.1 Data Model Resource API .

The above examples result in the following Data Model Resource API calls:

    1. Evaluate the expression "application.duration + dialog.duration" in the closest enclosing scope by calling EvaluateExpression("application.duration + dialog.duration")
    2. The expression evaluator in turn resolves the scope-qualified variables in the expression by calling ReadVariable("duration", "application") ReadVariable("duration", "dialog")
    1. Evaluate the expression "foo * bar" in the closest enclosing scope by calling EvaluateExpression("foo * bar")
    2. The expression evaluator in turn resolves the scope-unqualified variables in the expression by calling ReadVariable("foo") ReadVariable("bar")
    1. Evaluate the expression "foo + bar + application.baz" in the document scope in the expression by calling EvaluateExpression("foo + bar + application.baz", "document")
    2. The expression evaluator in turn resolves the variables by calling ReadVariable("foo", "document") ReadVariable("bar", "document") ReadVariable("baz", "application")
6.12.3.3 Updating variables: the <assign> and <data> elements
6.12.3.3.1 The <assign> element

The <assign> element assigns a value to a variable.

It is illegal to make an assignment to a variable that has not been explicitly declared using a <var> element or a var statement within a <script>. Attempting to assign to an undeclared variable causes an error.semantic event to be thrown.

Note that when an ECMAScript object, say "obj", has been properly initialized then its properties, for instance "obj.prop1", can be assigned without explicit declaration (in fact, an attempt to declare ECMAScript object properties such as "obj.prop1" would result in an error.semantic event being thrown).

Attributes of <assign>

Table 56: <assign> attributes
name The name of the variable being assigned to. The corresponding variable must have been previously declared otherwise an error.semantic event is thrown. By default, the scope in which the variable is resolved is the closest enclosing scope of the currently active element. To remove ambiguity, the variable name may be prefixed with a scope name.
expr The expression evaluating to the new value of the variable (optional).

Children of <assign>

The children of the <assign> element represent an in-line specification of the new value of the variable.

If "expr" attribute is present, then the element must not have any children. Thus "expr" and children are mutually exclusive for the <assign> element.

<assign> examples

This section is informative.

    <assign name="phone" expr="'6305551212'"/>
    <assign name="y" expr="document.z+1"/>
    <assign name="application.foo" expr="dialog.bar * 2"/>
    <assign name="itinerary">
      <root xmlns="">
        <flight>SW123</flight>
        <origin>JFK</origin>
        <depart>2009-01-01T14:32:00</depart>
        <destination>SFO</destination>
        <arrive>2009-01-01T18:14:00</arrive>
      </root>
    </var>
   

The above examples have the following result, in order:

  1. Updates the variable with name "phone" to a new String value "6305551212" in the closest enclosing scope as determined by the position of this <assign> element in the document. If a variable named "phone" is not already defined in the mentioned scope, an error.semantic is thrown (since this is always true if variables that are not already defined are attempted to be updated using <assign>, the rest of this section will not repeat this for each example).
  2. Updates the variable with name "y" in the closest enclosing scope as determined by the position of this <assign> element in the document to the value corresponding to the result of the expression "document.z+1", evaluated when this <assign> element is executed.
  3. Updates the variable with name "foo" in the application scope to the value corresponding to the result of the expression "dialog.bar * 2", evaluated when this <assign> element is executed.
  4. Updates the variable with name "itinerary" in the closest enclosing scope as determined by the position of this <assign> element in the document to the value specified by the following in-line XML tree (the internal representation may be the corresponding DOM node, for example):
        <root xmlns="">
          <flight>SW123</flight>
          <origin>JFK</origin>
          <depart>2009-01-01T14:32:00</depart>
          <destination>SFO</destination>
          <arrive>2009-01-01T18:14:00</arrive>
        </root>
           
    

Translating to the Data Model Resource API

Implementation Notes: This section illustrates how the above examples translate to the 5.1.1 Data Model Resource API .

The above examples result in the following Data Model Resource API calls, in order:

  1. At the time of <assign> execution, first the new value is obtained for the variable with name "phone" by evaluating the expression "'6305551212'", and subsequently the variable is updated in the scope on top of the stack:
    1. Obtain new variable value by calling EvaluateExpression("'6305551212'")
    2. Update variable value by calling UpdateVariable("name", value) where value is the result obtained in a. above. The optional scope parameter is not specified since the scope on the top of the stack is chosen by default.
  2. At the time of <assign> execution, first the new value is obtained for the variable with name "y" by evaluating the expression "document.z+1", and subsequently the variable is updated in the scope on top of the stack:
    1. Obtain new variable value by calling EvaluateExpression("document.z+1")
    2. Update variable value by calling UpdateVariable("y", value) where value is the result obtained in a. above.
  3. At the time of <assign> execution, first the new value is obtained for the variable with name "foo" by evaluating the expression "dialog.bar * 2", and subsequently the variable is updated in application scope:
    1. Obtain new variable value by calling EvaluateExpression("dialog.bar * 2")
    2. Update variable value by calling UpdateVariable("foo", value, "application") where value is the result obtained in a. above.
  4. At the time of <assign> execution, the variable with name "itinerary" is updated in the scope on top of the stack using the in-line specification for the new value specified in the body of the <assign> element:
    1. Process the in-line specification below into an internal representation for the data model. For example, the assumed XML data model in this example may choose to internally represent this in-line specification as a DOM node.
          <root xmlns="">
            <flight>SW123</flight>
            <origin>JFK</origin>
            <depart>2009-01-01T14:32:00</depart>
            <destination>SFO</destination>
            <arrive>2009-01-01T18:14:00</arrive>
          </root>
                 
      
    2. Update variable value by calling UpdateVariable("itinerary", node) where node is the DOM node from a. above.
6.12.3.3.2 The <data> element

The <data> element allows a VoiceXML application to fetch an in-line specification of a new value for a named variable from a document server without transitioning to a new VoiceXML document. The data fetched is bound to the named variable.

Attributes of <data>

Table 57: <data> Attributes
src The URI specifying the location of the in-line data specification to retrieve (optional). This specification depends on the data language in use for the VoiceXML document (XML, JSON).
name The name of the variable that the data fetched will be bound to.
scope The scope within which the named variable to bind the data is found (optional). Must be one of session , application , document or dialog . If the specified scope does not exist, then an error.semantic event is thrown.
srcexpr Like src, except that the URI is dynamically determined by evaluating the given expression when the data needs to be fetched (optional). If srcexpr cannot be evaluated, an error.semantic event is thrown.
method The request method: get (the default) or post (optional).
namelist The list of variables to submit (optional). By default, no variables are submitted. If a namelist is supplied, it may contain individual variable references which are submitted with the same qualification used in the namelist. Declared VoiceXML variables can be referenced.
enctype The media encoding type of the submitted document (optional). The default is application/x-www-form-urlencoded. Interpreters must also support multipart/form-data [RFC2388] and may support additional encoding types.
fetchaudio See Section 6.1 of [VXML2] (optional). This defaults to the fetchaudio property described in Section 6.3.5 of [VXML2].
fetchhint See Section 6.1 of [VXML2] (optional). This defaults to the datafetchhint property described in Section 2.3.3.2.3.
fetchtimeout See Section 6.1 of [VXML2] (optional). This defaults to the fetchtimeout property described in Section 6.3.5 of [VXML2].
maxage See Section 6.1 of [VXML2] (optional). This defaults to the datamaxage property described in Section 2.3.3.2.3.
maxstale See Section 6.1 of [VXML2] (optional). This defaults to the datamaxstale property described in Section 2.3.3.2.3.

Exactly one of "src" or "srcexpr" must be specified; otherwise, an error.badfetch event is thrown. If the content cannot be retrieved, the interpreter throws an error as specified for fetch failures in Section 5.2.6 of [VXML2].

If the value of the src or srcexpr attribute includes a fragment identifier, the processing of that fragment identifier is platform-specific.

Platforms should support parsing XML data into a DOM. If an implementation does not support DOM, the name attribute must not be set, and any retrieved content must be ignored by the interpreter. If the name attribute is present, these implementations will throw error.unsupported.data.name.

If the name attribute is present, and the returned document is XML as identified by [RFC3023], the VoiceXML interpreter must expose the retrieved content via a read-only subset of the DOM as specified in Appendix D of [VXML2.1]. An interpreter may support additional data formats by recognizing additional media types. If an interpreter receives a document in a data format that it does not understand, or the data is not well-formed as defined by the specification of that format, the interpreter throws error.badfetch. If the media type of the retrieved content is one of those defined in [RFC3023] but the content is not well-formed XML, the interpreter throws error.badfetch.

If use of the DOM causes an uncaught DOMException to be thrown, the VoiceXML interpreter throws error.semantic.

Before exposing the data in an XML document referenced by the <data> element via the DOM, the interpreter should check that the referring document is allowed to access the data. If access is denied the interpreter must throw error.noauthorization.

Note: One strategy commonly implemented in voice browsers to control access to data is the "access-control" processing instruction described in the WG Note: Authorizing Read Access to XML Content Using the <?access-control?> Processing Instruction 1.0 [DATA_AUTH].

Like the <var> element, the <data> element can occur in executable content or as a child of <form> or <vxml>. In addition, it shares the same default scoping rules as the <var> element. If a <data> element has the same name as a variable already declared in the same scope, the variable is assigned a reference to the new value exposed by the <data> element.

Like the <submit> element, when variable data is submitted to the server its value is first converted into a string before being submitted. If the variable is a DOM Object, its serialized as the corresponding XML. If the variable is an ECMAScript Object, the mechanism by which it is submitted is not currently defined. If a <data> element's namelist contains a variable which references recorded audio but does not contain an enctype of multipart/form-data [RFC2388], the behavior is not specified. It is discouraged to attempt to URL-encode large quantities of data.

<data> example

The example discussed in this section uses XML as the data language and fetches the following XML document using the <data> element:

    <?xml version="1.0" encoding="UTF-8"?>
    <quote xmlns="http://www.example.org">
      <ticker>F</ticker>
      <name>Ford Motor Company</name>
      <change>0.10</change>
      <last>3.00</last>
    </quote>
   

The above stock quote is retrieved in one dialog, the document element is cached in a variable at document scope and used to playback the quote in another dialog. The data access and manipulation language in the example is XPath 2.0 [XPATH20].

    <?xml version="1.0" encoding="UTF-8"?>
    <vxml xmlns="http://www.w3.org/2001/vxml" 
      version="2.1"
      xmlns:ex="http://www.example.org"
      xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
      xsi:schemaLocation="http://www.w3.org/2001/vxml 
      http://www.w3.org/TR/2007/REC-voicexml21-20070619/vxml.xsd">
      <var name="quote"/>
      <var name="tickers">
        <tickers xmlns="">
          <ford>f</ford>
          <!-- etc., the dialog below hardcodes ford -->
        </tickers>
      </var>
      <form id="get_quote">
        <block>
          <data name="quote" scope="document"
            srcexpr="'http://www.example.org/getquote?ticker=' + document('tickers')/ford"/>
          <goto next="#play_quote"/>         
        </block>
      </form>
      <form id="play_quote">
        <block>
          <var name="name" expr="document('quote')/ex:name"/>
          <var name="change" expr="document('quote')/ex:change"/>
          <var name="last" expr="document('quote')/ex:last"/>
          <var name="dollars" expr="fn:floor(last)"/>
          <var name="cents" expr="fn:substring(last,fn:string-length(last)-1)"/>
          <!--play the company name -->
          <audio expr="document('tickers')/ford + '.wav'"><value expr="name"/></audio>
          <!-- play 'unchanged, 'up', or 'down' based on zero, positive, or negative change -->
          <if cond="change = 0">
            <audio src="unchanged_at.wav"/>
          <else/>
            <if cond="change &gt; 0">
              <audio src="up.wav"/>
            <else/> <!-- negative -->
              <audio src="down.wav"/>
            </if>
            <audio src="by.wav"/>
            <!-- play change in value as positive number -->
            <audio expr="fn:abs(change) + '.wav'"><value expr="fn:abs(change)"/></audio>
            <audio src="to.wav"/>
          </if>
          <!-- play the current price per share -->
          <audio expr="dollars + '.wav'"><value expr="dollars"/></audio>
          <if cond="cents &gt; 0">
            <audio src="point.wav"/>
            <audio expr="cents + '.wav'"><value expr="cents"/></audio>
          </if>
        </block>
      </form>
    </vxml>
   

Translating to the Data Model Resource API

Implementation Notes: This section illustrates how the above examples translate to the 5.1.1 Data Model Resource API .

The single <data> usage in the above example results in the following behavior and Data Model Resource API calls:

At the time of <data> execution, the variable with name "quote" is updated in the document scope using the in-line specification for the new value retrieved from the URI expression 'http://www.example.org/getquote?ticker=' + document('tickers')/ford which evaluates to http://www.example.org/getquote?ticker=f

  1. Obtain the URI to request the in-line specification of the new value from, by evaluating the "srcexpr" attribute value EvaluateExpression("'http://www.example.org/getquote?ticker=' + document('tickers')/ford")
  2. Request the in-line specification from the URI resulting from a. which happens to be http://www.example.org/getquote?ticker=f in this example.
  3. Process the response received (the in-line data specification below) into an internal representation for the data model. The XML data model internally represents this in-line specification as a DOM node (the document element).
        <quote xmlns="http://www.example.org">
          <ticker>F</ticker>
          <name>Ford Motor Company</name>
          <change>0.10</change>
          <last>3.00</last>
        </quote>
           
    
  4. Update the "quote" variable value in document scope by calling UpdateVariable("quote", node, "document") where node is the DOM node in c. above.

<data> Fetching Properties

These properties pertain to documents fetched by the <data> element.

Table 58: <data> Fetching Properties
datafetchhint Tells the platform whether or not data documents may be pre-fetched. The value is either prefetch (the default), or safe.
datamaxage Tells the platform the maximum acceptable age, in seconds, of cached documents. The default is platform-specific.
datamaxstale Tells the platform the maximum acceptable staleness, in seconds, of expired cached data documents. The default is platform-specific.
6.12.3.4 Deleting variables: the <clear> element

The <clear> element resets one or more variables, including form items.

For each specified variable name, the variable is resolved relative to the current scope by default (to remove ambiguity, each variable name in the namelist may be prefixed with a scope name). Once a declared variable has been identified, its value is assigned the default initial value defined by the data access expression language in use (for example, when using ECMAScript, the variables are reset to the undefined value). In addition, if the variable name corresponds to a form item, then the form item's prompt counter and event counter are reset.

Attributes of <clear>

Table 59: <clear> attributes
namelist The list of variables to be reset; this can include variable names other than form items. If an undeclared variable is referenced in the namelist, then an error.semantic is thrown. When not specified, all form items in the current form are cleared.
scope The scope within which the named variables must be resolved (optional). Must be one of session , application , document or dialog . If the specified scope does not exist, then an error.semantic event is thrown.

<clear> examples

This section is informative.

  <clear namelist="city state zip"/>
  <clear namelist="application.foo dialog.bar baz"/>
  <clear namelist="alpha beta application.gamma" scope="document"/>
  <clear/>
  <clear scope="dialog"/>
   

The above examples have the following result, in order:

  1. The variables "city", "state" and "zip" in the closest enclosing scope are reset. If any of these are form items, the associated prompt and event counters are also reset (since this is always true if variables are form items, the rest of this section will not repeat this for each example).
  2. The variable "foo" is reset in application scope, the variable "bar" is reset in the dialog scope and the variable "baz" is reset in the closest enclosing scope.
  3. The scope-unqualified "alpha" and "beta" variables are reset in the specified document scope default, and the "gamma" variable is reset in the application scope.
  4. All variables in the closest enclosing scope are reset.
  5. All variables in the dialog scope are reset.

Translating to the Data Model Resource API

Implementation Notes: This section illustrates how the above examples translate to the 5.1.1 Data Model Resource API .

The above examples result in the following Data Model Resource API calls, in order:

  1. The namelist "city state zip" is tokenized into variable names and each variable is reset in the scope on top of the stack (the optional scope parameter is not specified in the calls since the scope on the top of the stack is chosen by default):
    1. Tokenize namelist "city state zip" into tokens "city", "state" and "zip"
    2. Reset variable "city" by calling DeleteVariable("city")
    3. Reset variable "state" by calling DeleteVariable("state")
    4. Reset variable "zip" by calling DeleteVariable("zip")
  2. The namelist "application.foo dialog.bar baz" is tokenized into variable names and each scope-qualified variable is reset in the mentioned scope and scope-unqualified variables are reset in the scope on top of the stack:
    1. Tokenize namelist "application.foo dialog.bar baz" into tokens "application.foo", "dialog.bar" and "baz"
    2. Reset variable "foo" in application scope by calling DeleteVariable("foo", "application")
    3. Reset variable "bar" in dialog scope by calling DeleteVariable("bar", "dialog")
    4. Reset variable "baz" in current scope by calling DeleteVariable("baz")
  3. The namelist "alpha beta application.gamma" is tokenized into variable names and each scope-qualified variable is reset in the mentioned scope and scope-unqualified variables are reset in the document scope:
    1. Tokenize namelist "alpha beta application.gamma" into tokens "alpha", "beta" and "application.gamma"
    2. Reset variable "alpha" by calling DeleteVariable("alpha", "document")
    3. Reset variable "beta" by calling DeleteVariable("beta", "document")
    4. Reset variable "gamma" by calling DeleteVariable("gamma", "application")
  4. In the absence of a namelist, each variable in the scope on top of the stack is reset. For each variable var in the closest enclosing scope:
    1. Reset variable DeleteVariable(var)
    2. If var is a form item, reset prompt and event counters
  5. In the absence of a namelist, each variable in the dialog scope is reset. For each variable var in dialog scope:
    1. Reset variable DeleteVariable(var, "dialog")
    2. If var is a form item, reset prompt and event counters
6.12.3.5 Relevance for properties

Platform properties are discussed in detail in 8.2 Properties . VoiceXML 3.0 provides a consistent mechanism to unambiguously read these properties in any scope using the data access and manipulation language in a manner similar to accessing and manipulating named variables as illustrated in section 2.3.2. However, properties cannot be created, updated or deleted using any of the syntax described in this module. The <property> element syntax must be used for such operations.

6.12.5 Implicit functions using XPath

Implicit variables described in section 2.2.3 to qualify scope are amenable to certain data access and manipulation languages (such as ECMAScript) but are not as elegant while incorporating in the syntax of others, such as XPath. VoiceXML 3.0 permits the use of functions rather than variables to address this. The following table illustrates how scope qualifiers are exposed as XPath functions.

Table 61: Implicit XPath functions and variables
session() This single String argument function retrieves the value of the variable named in the argument from the session scope.
application() This single String argument function retrieves the value of the variable named in the argument from the application scope.
document() This single String argument function retrieves the value of the variable named in the argument from the document scope.
dialog() This single String argument function retrieves the value of the variable named in the argument from the dialog scope.
anonymous() This single String argument function retrieves the value of the variable named in the argument from the anonymous scope.
properties$ This read-only implicit variable refers to the defined properties which affect platform behavior in a given scope. The value is an XML tree with a <properties> root element and multiple children as necessary where each child element has the name of an existing platform property in that scope and body content corresponding to the value of the platform property. CDATA sections are used if necessary.

The following table shows how these qualifier functions are used, and the examples are XPath variants of the examples illustrated in Table 53.

Table 62: Resolution examples (XPath)
Usage Result
application('hello') The value of the "hello" named variable in the application scope.
dialog('retries') The value of the "retries" named variable in the dialog scope.
dialog('properties$')/bargein The value of the "bargein" platform property defined at the current "dialog" scope.

6.13 External Communication Module

This module supports the sending and receiving of external messages by a voice application by introducing the <send> and <receive> elements into VoiceXML. The application developer chooses to send and receive external messages synchronously or asynchronously. When sending a message, the developer chooses whether or not it should represent a named event. The developer also chooses whether or not to include a payload. These choices can be made statically or dynamically at run-time.

Note that this section only covers receiving messages that the interpreter does not handle. In other words, at application level events. Some events, like lifecycle events targeted at creating or destroying sessions are not targetted at the application author but instead are handled by the browser itself. The complete list of all these interpreter level events is TBD but might include events such as "create session", "pause", "resume", or "disconnect".

Although this section handles many of the easy and moderately difficult cases, for certain very complicated cases it may be appropriate to put a gatekeeper filter between the VXML interpreter and the external events to filter and only allow certain events to interrupt the processing of the VXML document. For example if someone wanted a "operator" event to only be allowed to interrupt the VXML document if its data variable held a certain value, or if they wanted the "operator" event but not the "caller" event, then a filter might be appropriate. SCXML is one method that is suitable for providing these type of more advanced filters.

6.13.1 Receiving external messages within a voice application

Because external messages can arrive at any time, they can be disruptive to a voice application. A voice application developer decides whether these messages are delivered to the application synchronously or asynchronously using the "externalevents.enable" property. The property can be set to one of the following values:

Table 63: externalevents.enable values
true External messages are delivered asynchronously as VoiceXML events.
false External messages are delivered synchronously. This is the default.

When external messages are delivered synchronously, an application developer decides whether these messages are preserved or discarded by setting the "externalevents.queue" property. The property can be set to one of the following values:

Table 64: externalevents.queue values
true External messages are queued.
false An external messages that is not delivered as a VoiceXML event is discarded. This is the default.
6.13.1.2 Receiving External Messages Asynchronously

To receive an external message asynchronously, an application defines an "externalmessage" event handler. The event handler must be declared within the appropriate scope since the user-defined <catch> handler is selected using the algorithm described in section 5.2.4 of [VXML2].

If the payload of an external message includes an event name, the name is appended to the name of the event that is thrown to the application separated by a dot (e.g. "externalmessage.ready"). This allows applications to handle external messages using different event handlers.

Asynchronous external messages are processed in the same manner that a disconnect event is handled in VXML2 .

Events are dispatched to the application serially. Since the interpreter only reflects the data associated with a single external message at a time, it is the application's responsibility to manage the data associated with each external message once that message has been delivered.

The following example demonstrates asynchronous receipt of an external message. The catch handler copies the reflected external message into an array at application scope.

<vxml version="2.1"
  xmlns="http://www.w3.org/2001/vxml">
  <property name="externalevents.enable" value="true"/>
  <var name="myMessages" expr="new Array()"/>
  <catch event="externalmessage">
    <var name="lm" expr="application.lastmessage$"/>
    <if cond="lm.contenttype == 'text/xml' || lm.contenttype == 'application/xml'">
      <log>received XML with root document element
        <value expr="lm.content.documentElement.nodeName"/>
      </log>
    <elseif cond="typeof lm.content == 'string'"/>
      <log>received <value expr="lm.content"/></log>
    <else/>
      <log>received unknown external message type
        <value expr="typeof lm.content"/>
      </log>
    </if>
    <script>
      myMessages.push({'content' : lm.content, 'ctype' : lm.contenttype});
    </script>
  </catch>
  <form>
  <field name="num" type="digits">
    <prompt>pick a number any number</prompt>
    <catch event="noinput nomatch">
      sorry. didn't get that.
      <reprompt/>      
    </catch>
    <filled>
      you said <value expr="num"/>
      <clear/>
    </filled>
  </field>
  </form>
</vxml>
6.13.1.3 Receiving External Messages Synchronously

To receive an external message synchronously set the "externalevents.enable" property to false and the "externalevents.queue" property to true, and use the <receive> element to pull messages off the queue. <receive> blocks until an external message is received or the timeout specified by the maxtime attribute is exceeded.

6.13.1.3.1 <receive>

To support receipt of external messages within a voice application, use the <receive> element. <receive> is allowed wherever executable content is allowed in [VXML21], for example a <block> element.

<receive> supports the following attributes:

Table 66: <receive> properties
Name Description Required Default
fetchaudio See Section 6.1 of [VXML2]. This defaults to the fetchaudio property described in Section 6.3.5 of [VXML2]. No N/A
fetchaudioexpr An ECMAScript expression evaluating to the fetchaudio URI. If evaluation of the expression fails, the interpreter throws "error.semantic". No N/A
maxtime A W3C time specifier indicating the maximum amount of time the interpreter waits to receive an external message. If the timeout is exceeded, the interpreter throws "error.badfetch." A value of "none" indicates the interpreter blocks indefinitely. No 0s
maxtimeexpr An ECMAScript expression evaluating to the maxtime value. If evaluation of the expression fails, the interpreter throws "error.semantic". No 0s

Only one of fetchaudio and fetchaudioexpr can be specified or "error.badfetch" is thrown.

Only one of maxtime and maxtimeexpr can be specified or "error.badfetch" is thrown.

When present, the attributes fetchaudioexpr and maxtimeexpr are evaluated when the <receive> is executed.

The following example demonstrates synchronously receiving an external message. In this example, the interpreter blocks for up to 15 seconds waiting for an external message to arrive. If no external message is received during that interval, the interpreter throws "error.badfetch". If a message is received, the interpreter proceeds by executing the <log> element.

<vxml version="2.1"
  xmlns="http://www.w3.org/2001/vxml">
  <property name="externalevents.queue" value="true"/>
  <form>
    <catch event="error.badfetch">
      <log>timed out waiting for external message</log>
    </catch>
  
    <block>
      Hold on ...
      <receive maxtime="15s" 
        fetchaudio="http://www.example.com/audio/fetching.wav"/>
      <log>got <value expr="application.lastmessage$.content"/></log>
    </block>  
  </form>
</vxml>

6.13.2 Sending messages from a voice application

To send a message from a VoiceXML application to a remote endpoint, use the <send> element. <send> is allowed within executable content . Implementations must support the following attributes:

Table 67: <send> Attributes
Name Description Required Default
async A boolean indicating whether or not to block until the final response to the transaction created by sending the external event is received, or a timeout. No true
asyncexpr An ECMAScript expression evaluating to the value of the async attribute. If evaluation of the expression fails, the interpreter throws "error.semantic". No N/A
body A string representing the data to be sent in the body of the message. No N/A
bodyexpr An ECMAScript expression evaluating to the body of the message to be sent. If evaluation of the expression fails, the interpreter throws "error.semantic". No N/A
contenttype A string indicating the media type of the body being sent, if any. The set of content types may be limited by the underlying platform. If an unsupported media type is specified, the interpreter throws "error.badfetch.<protocol>.400." The interpreter is not required to inspect the data specified in the body to validate that it conforms to the specified media type. No text/plain
contenttypeexpr An ECMAScript expression evaluating to the media type of the body. If evaluation of the expression fails, the interpreter throws "error.semantic". No N/A
event The name of the event to send. The value is a string which only includes alphanumeric characters and the "." (dot) character. The first character must be a letter. If the value is invalid, then an "error.badfetch" event is thrown. No N/A
eventexpr An ECMAScript expression evaluating to the name of the event to be sent. If evaluation of the expression fails, the interpreter throws "error.semantic". No N/A
fetchaudio See Section 6.1 of [VXML2]. This defaults to the fetchaudio property described in Section 6.3.5 of [VXML2]. No N/A
fetchaudioexpr An ECMAScript expression evaluating to the fetchaudio URI. If evaluation of the expression fails, the interpreter throws "error.semantic". No N/A
namelist A list of zero or more whitespace-separated variable names to send. By default, no variables are submitted. Values for these variables are evaluated when the <send> element is executed. Only declared variables can be referenced; otherwise, "error.semantic" is thrown. Variables must be submitted to the server with the same qualification used in the namelist. When an ECMAScript variable is submitted to the server, its value must be converted first into a string before being sent. If the variable is an ECMAScript object, the mechanism by which it is submitted is platform-specific. Instead of submitting an ECMAScript object directly, the application developer can explicitly submit the individual properties of the object (e.g. "date.month date.year"). No N/A
target Specifies the URI to which the event is sent. If the attribute is not specified, the event is sent to the component which invoked the VoiceXML session. No Invoking component
targetexpr An ECMAScript expression evaluating to the target URI. If evaluation of the expression fails, the interpreter throws "error.semantic". No N/A
timeout See 6.13.2.1 sendtimeout . This defaults to the sendtimeout property. No N/A
timeoutexpr An ECMAScript expression evaluating to the timeout interval for a synchronous <send>. If evaluation of the expression fails, the interpreter throws "error.semantic" No N/A

Only one of async and asyncexpr can be specified or "error.badfetch" is thrown.

Only one of event or eventexpr can be specified or "error.badfetch" is thrown.

Only one of body, bodyexpr, namelist, event, or eventexpr must be specified or "error.badfetch" is thrown.

Only one of contenttype and contenttypeexpr can be specified or "error.badfetch" is thrown.

Only one of fetchaudio and fetchaudioexpr can be specified or "error.badfetch" is thrown.

Only one of target and targetexpr can be specified or "error.badfetch" is thrown.

Only one of timeout and timeoutexpr can be specified or "error.badfetch" is thrown.

When present, the attributes asyncexpr, bodyexpr, contenttypeexpr, eventexpr, fetchaudioexpr, targetexpr, and timeoutexpr are evaluated when the <send> is executed.

If a synchronous <send> succeeds, execution proceeds according to the Form Interpretation Algorithm. If the <send> times out, the interpreter throws "error.badfetch" to the application. If the interpreter encounters an error upon sending the external message, the interpreter throws "error.badfetch.<protocol>.<status_code>" to the application. If no status code is available, the interpreter throws "error.badfetch.<protocol>".

The following example demonstrates the use of <send> synchronously:

<vxml version="2.1"
  xmlns="http://www.w3.org/2001/vxml">
  <form>
  <field name="user_id" type="digits">
    <prompt>please type your five digit i d</prompt>
    <filled>
      <send async="false" 
            bodyexpr="'&lt;userinfo&gt;&lt;id&gt;' + user_id + '&lt;/id&gt;&lt;/userinfo&gt;'" 
            contenttype="text/xml"/>
      <goto next="mainmenu.vxml"/>
    </filled>
  </field>
  </form>
</vxml>

Upon executing an asynchronous <send>, the interpreter continues execution of the voice application immediately and disregards the disposition of the message that was sent.

The following example demonstrates the use of <send> asynchronously:

<vxml version="2.1"
  xmlns="http://www.w3.org/2001/vxml">
  <form>
    <var name="tasktarget" expr="'http://www.example.com/taskman.pl'"/>
    <var name="taskname" expr="'cc'"/>
    <var name="taskstate"/>
    <block>
      <assign name="taskstate" expr="'start'"/>     
      <send async="true" 
            targetexpr="tasktarget" 
            namelist="taskname taskstate"/>
    </block>
    <field name="ccnum"/>
    <field name="expdate"/>
    <block>
      <assign name="taskstate" expr="'end'"/>     
      <send async="true" 
            targetexpr="tasktarget" 
            namelist="taskname taskstate"/>
    </block>
  </form>
</vxml>

6.14 Session Root Module

The session root module allows for a VXML document to exist across a VXML session (I.e., transition from one application to another) similar to the way an application root document allows for a VXML document to exist across VXML document transitions.

6.14.2 Semantics

The session attribute is an optional attribute on vxml tag. It is a URI reference just like the application attribute is (same URI resolution). If a VXML session has not yet encountered a document with a session root then upon encountering the first vxml document that has a session root, the session root document is loaded and parsed just like how a normal vxml document would load and parse an application root document. If a VXML session has already loaded a different session root then the behavior when a future session attributes is encountered is controlled by the requiresession attribute. If the requiresession attribute is true then encountering a session root attribute with a different URL then the already loaded session root is an error and an error.badfetch is generated. If the requiresession attribute is false then the new session attribute is ignored and the old one is used. The requiresession attribute defaults to false if not present. The behavior of the session root is completely the same as the behavior of the application root, except that while executing in the session root the vxml browser is allowed to write to the javascript session scope, and variables declared as child of the vxml tag thus become session scope variables. In particular, in VXML 2.0 section 5.1.2 when talking about the variable scopes the text for application in table 40 is also appropriate for session (new text "These are declared with <var> and <script> elements that are children of the session root document's <vxml> element. They are initialized when the session root document is loaded. They exist while the session document is loaded, and are visible to the session root document, the application root document, and any other loaded application leaf document.").

This session document then is loaded and active in the hierarchy of documents that follows the javascript scope chaining (that is a document is below an application root is below a session root). This means that if a variable is declared in the session root and then in some local form in the leaf document the variable would be shadowed (just like how the shadowing from the application root).

This also implies that the catch selection algorithm as described in VXML 2.0 section 5.2.4 would have to change to include the session root document as a potential source of catch handlers (new text "Form an ordered list of catches consisting of all catches in the current scope and all enclosing scopes (form item, form, document, application root document, session root document, interpreter context), ordered first by scope (starting with the current scope), and then within each scope by document order."). Then all catch handling would remain the same, in particular the as-if-by copy semantics are retained so if an event from a leaf document was handled by a catch handler from the session root the catch handler wouldn't execute within the context of the session root document but would instead execute as if by copy into the local leaf document context.

This also implies that property lookup from section 6.3 of VXML 2.0 would have to change to say that property value lookup can also go to the session root, if a more local value for the property isn't found (new text "Properties may be defined for the whole session, for the whole application, for the whole document at the <vxml> level, for a particular dialog at the <form> or <menu> level, or for a particular form item.). This doesn't change the usual way properties work where a property at a lower level override one at a higher level.

This also implies that the behavior for link's that are document-level link of session roots are active which would be a change to section 2.5 of VXML 2.0 (new text "If an application root document has a document-level link, its grammars are active no matter what document of the application is being executed. If an session root document has a document-level link, its grammars are active no matter what document of the session is being executed. If execution is in a modal form item, then link grammars at the form, document, application or session level are not active.").

Similar to for links, the scope of grammars from section 3.1.3 of VXML 2.0 would be changed to specify what happens when a grammar from a session root has document scope (new text "Form grammars are by default given dialog scope, so that they are active only when the user is in the form. If they are given scope document, they are active whenever the user is in the document. If they are given scope document and the document is the application root document, then they are also active whenever the user is in another loaded document in the same application. If they are given scope document and the document is the session root document, then they are also active throughout the session.". Note that this active throughout the session can still be trumped by modal listen states (just like the application root can). Section 3.1.4 of VXML 2.0 also changes the activation of grammars bulleted list to include the session root (new text: "grammars contained in links in its application root document or session root document, and grammars for menus and forms in its application root document or session root document which are given document scope.").

6.14.3 Examples

For the sake of compactness assume throughout this example that the single letters used are actually fully qualified URIs. A VXML document "A" transitions to VXML document "B" which is partially represented below:

<vxml session="C" application="D" … >

Before "B" can finish initialization of "B" it loads, parses, and initializes the VXML documents at both "C" and "D". While executing in "B" any grammars, properties, links, and variables included from either "C" or "D" influence execution. Document "B" then transitions to document "E", with no session attribute, partially represented below:

<vxml application="D" … >

While executing in "E" having come from "B", everything from both "C" and "D" are still active. "D" is still active as we haven't left the application yet. "C" is still active as we are part of the same session. Document "E" now transitions to document "F" partially represented below:

<vxml application="G" … >

Now, since we have changed applications, the application root document form "D" is unloaded and grammars, variables, properties, etc. from "D" are no longer influencing our execution. Document "G" defines our application root and it, along with "C" which is still active since we are in the same session, now influence our execution. Document "F" now transitions to "H" partially represented below:

<vxml session="I" … >

Now, since there is already "C" as our session root document defined we cannot load document "I" and treat this as our session root. In the absence of requiresession "I" is ignored and "H" is executed using "C" as our session document. If instead "H" looked as below:

<vxml session="I" requiresession="true" … >

Now "H" would fail to load and execution would revert to document "F" where the appropriate error.badfetch for "H" would be thrown.

6.15 Run Time Control Module

Run time controls are represented by voice or dtmf grammars that are always active, even when the interpreter is not waiting for user input (e.g., when transitioning between documents.) When the grammar representing the rtc is matched, the action specified by the rtc is taken. When an rtc grammar completes recognition, it is immediately restarted whether it matched the input or not. Other grammars, including standard recognition grammars, may be active at the same time as an rtc.

Both <rtc> and <cancelrtc>are scoped to the nearest enclosing control element (<item>,<form>, ...). If not within a control element, they are scoped to the document they are in. A given rtc may be defined multiple times within an application. At any point during execution, the most narrowly scoped <rtc> or <cancelrtc> element will be in effect. If an active rtc is turned off by a <cancelrtc> tag, it will be reactivated when the interpreter leaves the scope of the <cancelrtc> tag (unless it comes within the scope of another <cancelrtc> tag). For example, if an <rtc> tag is scoped to a <form> and a <cancelrtc> tag is scoped to a field within the <form>, the rtc will be active while the form is executing, except when it is the field in question. Application authors may thus use the <rtc> and <cancelrtc> tags along with the scoping rules to exercize fine grained control over the activity of rtcs.

6.16 SIV Module

Speaker biometric resources supported in VoiceXML 3.0 provide three types of functions:

  • Speaker Verification
  • Speaker Identification
  • Speaker Enrollment

The following figure shows an overview of the flow of information in SIV processing:

Overview of SIV processing information flow

The SIV engine computes a match based on one or more utterances from the user, a voicemodel or reference voice model, thresholds and other configuration parameters. The results are presented to VoiceXML in EMMA 1.0 format.

Verification is the process by which a user's utterances are matched against a pre-computed reference voice model. The next figure details this process.

Verification process

Verification decisions are based upon a wide variety of criteria, and applications must choose how to evaluate trade-offs between application-specific factors:

Table 71: Decision/Match Evaluation
Good Enough
  • Individual and cumulative matches are in the range above the specified threshold
More Data Needed
  • Individual utterance or match was low quality
  • Below threshold, above abort criteria
  • Reprompt to collect more data:
    • discard utterance, collect replacement
    • keep utterance, add additional
Abort
  • Below lower specified threshold
  • Too many attempts or bad matches

6.16.1 SIV Core Functions

The core functions of SIV processing are divided into creating a voice model and enrolling a user in the system (Enrollment) and using a pre-existing voice model (either Verification or Identification). Additionally, some engines support adaptation of an existing voice model based on new user utterances. An SIV dialog may consist of a sequence of one or more SIV dialog turns.

The SIV resource is defined in terms of a data model and state model. The data model is composed of the following elements:

  • activeVoiceModel:
    • content: a URI reference to the voice model (or the voice model itself) with which to perform verification or identification.
    • properties: type, vendor, etc.
    • listener: a resource controller associated with this SIV resource
  • properties: properties pertaining to the SIV process. These properties differ depending on the type of the SIV resource type and function
    • Resource type
      • text-dependent
      • text-independent
      • other types
    • Function
      • Enrollment
      • Verification
      • Identification
    • Platform-specific
  • controller: the resource controller to which status, results and error events are sent.

An SIV Resource manages SIV devices during a processing cycle:

  • An SIV cycle is initiated when the resource sends the device an event instructing it to listen to the input stream.
  • An SIV cycle is terminated if the device is instructed to stop by the resource, if the device sends the resource an error event, or the device ends processing early. When terminated, the device removes partially or wholly processed input from its buffer and discards the voicemodel.
  • During an SIV cycle, the device may send events to the resource indicating ongoing status.
  • When the resource receives results from the device during the SIV cycle, it passes them to its controller. At this point the SIV cycle is complete and the resource awaits instructions either to start another SIV cycle or to terminate.

6.16.2 Syntax

[Schema definition TBD]

Table 72: <voicemodel> Attributes

Name

Type

Description

Required

Default Value

mode

enroll

verify

identify

Defines the mode of SIV function

Yes

None

If no mode is provided, throw error.badfetch

type

Text-independent

Text-dependent

[other]

Type of speech technology used by the SIV engine

No

None

[Should V3 support "other" types?]

identity

URI

Claim of identity passed to SIV engine to select a voice model.

Yes when mode="verify";

Must not be supplied for mode="identify"

If mode="enroll" and identity URI is provided, a voice model will be created at the URI specified. Else, the form item variable will contain the URI of the created voice model.

If mode="verify" and no identity is supplied, throw error.badfetch.

fetchhint

One of the values "safe" or "prefetch"

Defines when the interpreter context should retrieve content from the server. prefetch indicates a file may be downloaded when the page is loaded, whereas safe indicates a file that should only be downloaded when actually needed.

No

None

fetchtimeout

Time Designation

The interval to wait for the content to be returned before throwing an error.badfetch event.

No

None

maxage

An unsigned integer

Indicates that the document is willing to use content whose age is no greater than the specified time in seconds (cf. 'max-age' in HTTP 1.1 [RFC2616]). The document is not willing to use stale content, unless maxstale is also provided.

No

None

maxstale

An unsigned integer

Indicates that the document is willing to use content that has exceeded its expiration time (cf. 'max-stale' in HTTP 1.1 [RFC2616]). If maxstale is assigned a value, then the document is willing to accept content that has exceeded its expiration time by no more than the specified number of seconds.

No

None

6.16.3 Semantics

The voicemodel RC is the primary RC for the <voicemodel> element.

6.18 Disconnect Module

This module defines the syntactic and semantic features of a <disconnect> element. The <disconnect> element causes the interpreter context to disconnect from the user. It provides the interpreter context a way to enter into final processing state by throwing the "connection.disconnect.hangup" event. [See 5.5 5.4 Connection Resource ].

Processing the <disconnect> element also causes the interpreter context to flush the prompt queue before disconnecting the interpreter context from the user and subsequently throwing "connection.disconnect.hangup" event.

The attributes and content model of <disconnect> are specified in 6.18.1 Syntax . Its semantics are specified in 6.18.2 Semantics .

6.18.2 Semantics

The disconnect RC is the primary RC for the <disconnect> element.

6.18.2.1 Definition

The disconnect RC is defined in terms of a data model and state model.

The data model is composed of the following parameters:

  • controller: the RC controlling this disconnect RC
  • namelist: list of ECMAScript variables

The disconnect RC's state model consists of the following states: Idle, Initializing, Ready, Executing, and Disconnecting. The initial state is the Idle state.

While in the Idle state, the RC may receive an 'initialize' event, whose 'controller' event data is used to update the data model. The RC then transitions into the Initializing state.

In the Initializing state, the disconnect RC essentially does nothing. The RC sends the controller an 'initialized' event and transitions to the Ready state.

In the Ready state, when the disconnect RC receives an 'execute' event it sends an 'execute' event to the Play RC ( 6.19 Play Module ), causing any queued prompts to be played. Note that the event data passed to Play RC must have:

  • the controller set to this RC
  • bargein = false

This RC transitions to the Executing state after sending the event request to the Play RC.

In the Executing state, when the disconnect RC receives the "playDone" event, it instructs the connection resource to disconnect the interpreter context from the user and enters into the "Disconnecting" state.

In the "Disconnecting" state, when the disconnect RC receives "userDisconnected" event,

  • The namelist variables are tokenized into individual variable names and their values are obtained using ReadVariable. The way, values of these individual variables is made available to interpreter context is platform specific
  • It throws the "connection.disconnect.hangup" event to the parent element
  • It transitions to the Ready state
Editorial note  

Play RC is not yet defined.

6.19 Play Module

This module defines the semantic features of a Play capability. Note that there is no XML syntax associated with this RC. It is only used by the Disconnect module ( 6.18 Disconnect Module ).

6.19.1 Semantics

The Play RC coordinates media output with the PromptQueue Resource ( 5.2 Prompt Queue Resource ).

6.20 Record Module

This module defines the syntactic and semantic features of a <record> element which collects and stores a recording from the user.

The attributes and content model of <record> are specified in 6.20.1 Syntax . Its semantics are specified in 6.20.2 Semantics .

6.20.1 Syntax

[See XXX for schema definitions].

6.20.1.1 Attributes

The <record> element has the attributes specified in Table 85.

Table 85: <record> Attributes
Name Type Description Required Default Value
name The name must conform to the variable naming conventions in (TODO). The form item variable in the dialog scope that will hold the recording. The name must be unique among form items in the form. If the name is not unique, then a badfetch error is thrown when the document is fetched. Note that how this variable is implemented may vary between platforms (although all platforms must support its behavior in <audio> and <submit> as described in this specification). Yes
expr TBD The initial value of the form item variable. If initialized to a value, then the form item will not be visited unless the form item variable is cleared. No ECMAScript undefined
cond data model expression A data model expression that must evaluate to true after conversion to boolean in order for the form item to be visited. No true
modal boolean If this is true, all non-local speech and DTMF grammars are not active while making the recording. If this is false, non-local speech and DTMF grammars are active. No true
beep boolean If true, a tone is emitted just prior to recording. No false
maxtime Time Designation The maximum duration to record. No Platform-specific value
finalsilence Time Designation The interval of silence that indicates end of speech. No Platform-specific value
dtmfterm boolean If true, any DTMF keypress will be treated as a match of an active (anonymous) local DTMF grammar. No true
type Required audio file formats specified in <xref>Appendix E of VoiceXML 2</xref> (Other formats may also be supported) The media format of the resulting recording. No Platform-specific (one of the required formats)

6.20.2 Semantics

The semantics of the record module are defined using the following resource controllers: RecordInputItem ( 6.20.2.1 RecordInputItem RC ), PlayandRecognize ( 6.10.2.2 PlayandRecognize RC ), Record ( 6.20.2.2 Record RC ).

6.20.2.1 RecordInputItem RC

The Record Input Item Resource Controller is the primary RC for the record element.

6.20.2.2 Record RC

The Record RC coordinates media with the Record resource and a form item. It is expected that other functionality that may be related or simultaneous such as prompt playing or recognition will be handled by the form item's rc. The record RC only handles the recording.

6.21 Property Module

The <property> element sets a property value. Properties are used to set values that affect platform behavior, such as the recognition process, timeouts, caching policy, etc. For a more complete list of individual property values see 8.2 Properties .

Properties may be defined for the session, for the whole application, for the whole document at the <vxml> level, for a particular dialog at the <form> or <menu> level, or for a particular form item. Properties apply to their parent element and all the descendants of the parent. A property at a lower level overrides a property at a higher level. When different values for a property are specified at the same level, the last one in document order applies. Properties specified in the session root document provide default values for properties throughout the session; properties specified in the application root document provide default values for properties in every document in the application; properties specified in an individual document override property values specified in the application root document.

6.21.2 Semantics

Whenever a particular property needs to be used, its value must be looked up. The mechanism to lookup the value of a property is dependent on both the syntactic structure of the authored document and the current location in execution. The timing of whenever a property needs to be used is dependent on the property in question. For instance, the speedvsaccuracy recognizer property needs to have its value be looked up whenever a recognition is to take place (like in a <field> tag) but it doesn't need to be looked up at other times like when a generic prompt is queued in a block or when a script file is fetched. In contrast, a property like bargein needs to be looked up whenever a prompt is queued in a block and the scripttimeout property needs to be looked up whenever a script file is loaded. For a more complete list of individual properties and when they are evaluated see section 8 on environment that lists the various properties and details about them.

When a property is looked up first it must be determined if this property is defined at the current execution level. If it is then the last instance of the property element is the one that is evaluated and used. If no property elements for this property exist at the current document level then the next higher level XML container is checked, including going to application root documents and session root documents. This parent chain is halted whenever the property is found. The chain of parents to follow for a look up can span a number of different document elements. For instance it may have had to be looked up at the current level (I.e., is there a property defined in the field), the dialog level (I.e., what about in the form), the document level (I.e., what about in the vxml element), the application level (I.e., what about the application root), the session level (I.e., what about the session root). If the property isn't found at any of these levels then either a specification specified default or a platform default may need to be used (see the platform root). For instance the speedvsaccuracy default value is defined to be 0.5 by the VXML specification. The completetimeout property however has no specification default and instead must use a platform default if it is not present.

If the platform has trouble evaluating the value of a property (I.e., an expr failure) or the value of a property is invalid (I.e., a completetimeout of "orange") then the platform should throw error.semantic.

6.22 Transition Controller Module

This module defines the syntactic and semantic features of a <controller> element. Transition controllers provide the basis for managing flow in VoiceXML applications. Resource controllers for some elements like <form> have associated transition controllers which influence how form items get selected and executed. In addition to form, there is a transition controller for each of the following higher VoiceXML scopes:

  • A document-level transition controller associated with the <vxml> element's resource controller that is responsible for starting document processing and deciding which resource controller to execute next, i.e., for a <form> or other interaction element in the document.
  • An application-level transition controller that is responsible for starting application processing and deciding which document-level resource controller to execute next.
  • A top-level or session transition controller that manages the flow for a VoiceXML session. This top-level transition controller is responsible for starting session processing and also holds session level properties.

The above transition controllers influence the selection of the first form item resource controller to execute, and subsequent ones through the session.

The attributes and content model of <controller> are specified in 6.22.1 Syntax . Its semantics are specified in 6.22.2 Semantics .

6.22.4 Examples

EXAMPLE 1: If the transition controller is described using an XML type or vocabulary, the description is a direct child of the <controller> element.

<v3:vxml version="3.0" ...>
  <!-- document level transition controller, type defaults to SCXML -->
  <v3:controller id="controller" type="application/xml+scxml">
    <scxml:scxml version="1.0" ...>
      <!-- Transition controller as a state chart -->
    </scxml:scxml>
  </v3:controller>
  <!-- Remainder of VoiceXML 3.0 document -->
</v3:vxml>

EXAMPLE 2: If the transition controller is described using a non-XML type or vocabulary, the description is the text content of the <controller> element. A CDATA section may be used if needed.

<v3:vxml version="3.0" ...>
  <!-- document level transition controller -->
  <v3:controller id="controller" type="text/some-controller-notation">
    <![CDATA[
      // Some text-based transition controller description
    ]]>
  </v3:controller>
  <!-- Remainder of VoiceXML 3.0 document -->
</v3:vxml>

EXAMPLE 3: [Support for this convenience syntax not yet decided -- here's the tentative text: If the transition controller is described using SCXML, then a convenience syntax of placing the <scxml> root element as a direct child of the <vxml> or <form> element is supported without the need of a <controller> wrapper. Thereby, the following two variants are equivalent:]

<v3:vxml version="3.0" ...>
  <v3:controller>
    <scxml:scxml version="1.0" ...>
      <!-- Transition controller as a state chart -->
    </scxml:scxml>
  </v3:controller>
  <!-- Remainder of VoiceXML 3.0 document -->
</v3:vxml>
Variant 2:
<v3:vxml version="3.0" ...>
  <scxml:scxml version="1.0" ...>
    <!-- Transition controller as a state chart -->
  </scxml:scxml>
  <!-- Remainder of VoiceXML 3.0 document -->
</v3:vxml>
For further examples on using SCXML to describe transition controllers, see
9.2 Integrating Flow Control Languages into VoiceXML .

7 Profiles

VoiceXML 3.0, like SMIL, is a specification that contains a variety of functional modules. Not all implementers of VoiceXML will be interested in implementing all of the functionality defined in the document. For example, an implementer may have no interest in speech or DTMF recognition but still be interested in speech output. An example might be an implementer of book reading products for the visually impaired. Also, the syntax defined for each of the VoiceXML 3.0 modules is fairly low-level, and authors familiar with a more declarative language may wish to have higher-level syntax that is easier to program in.

To address these interests while maintaining sufficiently precise behavior definition to enhance portability, we encourage the use of profiles.

A profile is an implementation of VoiceXML 3 that

  • implements a specified set of VoiceXML 3.0 modules
  • precisely defines any necessary interaction between the modules
  • has its own schema that defines the complete syntax of the profile
  • defines conformance requirements for the profile
  • optionally introduces new elements whose behavior is completely defined in terms of the core elements defined in the specified set of modules. These new elements can provide the higher-level "syntactic sugar" needed to simplify authoring in this language profile.

This specification defines the following profiles:

  • VoiceXML21: supports VoiceXML 2.1 functionality
  • Media Server: supports advanced media functions (including DTMF, speech recognition, media playback and control, speech synthesis and SIV) where results are returned to the control layer.
  • Maximal profile: includes all VoiceXML 3.0 modules.

It should be possible for other profiles to be created, perhaps by modifying an existing profile, combining different modules, or even adding new module functionality and syntax.

Implementers may differ in their choice of which profiles they implement. Implementers must support a designated set of modules in order to claim support for VoiceXML 3. That designated set is TBD.

[ISSUES:

  • who can create profiles? More discussion is needed on whether profiles should only be created within the Voice Browser Working Group or by which classes of authors/programmers/implementers outside of the VBWG.
  • different dimensions for creation of profiles, e.g. subsets of core functionality, addition of syntactic sugar/macros, modification of underlying modules or creation of new ones.
  • How much modification of module behavior is needed just to piece together existing modules?
  • ]

7.1 Legacy Profile

The Legacy profile is included demonstrating how profiles are defined in VoiceXML 3.0. Using existing elements from the [VOICEXML21] specification is helpful as the semantics of these elements are already well defined and well understood. Thus changes in how they are presented are a result of the module and profile style of VoiceXML 3.0 and of making more explicit and formal the precise detailed semantics.

The Legacy profile also plays a transitional role as VoiceXML 3.0 as a whole is built on top of VoiceXML 2.1. VoiceXML 3.0 is a superset of VoiceXML 2.1 and includes the traditional 2.1 functionality plus some new modules. The Legacy profile is the set of modules that were always present in VoiceXML 2.1 but that weren't expressed in the specification as individual modules. This also allows a clear path for the VoiceXML application developer as the application developer will not need to learn substantial new syntax or semantics when they develop in the Legacy profile of VoiceXML 3.0.

The Legacy profile also represents a proof of concept to ensure that the new modular profile method of describing the specification is in no way limited. VoiceXML 3.0 in its entirety will be in no way limited or constrained because of the use of profiles and modules and formalized semantic models. Anything that was standardized in VoiceXML 2.1 can be standardized in this new format and the Legacy profile reveals that.

This profile uses the prompt module ( 6.4 Prompt Module ) extended with can be best described in the Builtin SSML module ( 6.5 Builtin SSML following 3 sections:

  • Module ) Conformance
  • Convenience Syntax
  • Default Handlers and ( 6.8 Foreach Module ). Transition Controllers

7.1.1 Conformance

This section defines semantics of how different modules coexist with each other to simulate the behavior of VoiceXML 2.1. It outlines all the required modules and any additions/deletions from each of these modules to make it conform to this profile. It also talks about the interaction amongst various modules so that behavior similar to that in VoiceXML 2.1 is achieved.

To conform with this profile, processors must implement the following modules:

The schema for the Legacy Profile is given in D.8 Schema for Legacy Profile .
Editorial note  

The following content is missing from the Vxml 3.0 specification and needs to be defined:

  • Vxml Root Module
  • Event Handling and throwing (event handlers like <catch>, <noinput>,<nomatch>, etc.)
  • FIA? (although talked about in the Form module, it is not addressed completely)
  • Executable Content
  • <reprompt>
  • <disconnect>
  • <exit>
  • <if> <elseif> <else>
  • <goto>
  • <submit>
  • <log>
  • <return>
  • <script>
  • <throw>
  • All of the VoiceXML 2.0/2.1 properties (Should go in 6.12)
  • <script> Script Module
  • <transfer>
  • <initial>
  • <link>
  • <record>
  • <object>
  • Transitions: <goto>, <submit>, <subdialog>
  • Specify transitions amongst inter-module clearly
  • author controlled transitions within form
  • author controlled transitions outside form
  • automatic transition behavior outside form
  • <filled>

7.2 Basic Profile

7.2.2 What the Basic Profile includes

The following enumerates the modules included in the Basic Profile.

7.2.2.1 SIV functions

The SIV Module (See 6.16 SIV Module ) includes verification, identification, and enrollment functions.

7.2.2.2 Presentation functions

The Builtin-SSML Module ( 6.5 Builtin SSML Module ), the Media Module ( 6.6 Media Module ), and the Parseq Module ( 6.7 Parseq Module ) provide functions for presenting information to the user.

7.2.2.3 Capture functions

Capture functions include

7.2.2.4 Other modules

The Basic Profile also includes the Data Access and Manipulation module ( 6.12 Data Access and Manipulation Module ) for accessing local variables, parameters, returned values, etc. This module is not intended to access external databases.

7.5 Convenience Syntax (Syntactic Sugar)

Profiles can provide convenience syntax to simplify authoring for that profile without decreasing portability. Convenience Syntax, as we define it here, can be implemented via a straightforward text mapping from the convenience syntax to profile code that uses only the syntax defined by the modules in the profile. Convenience syntax cannot add functionality. It only makes existing functionality easier to code.

A Convenience syntax definition must include

  • one or more new XML attributes and/or elements
  • for each possible use of the new attributes and elements, a non- cyclical mapping from the code containing the element(s) or attribute(s) to code containing other convenience syntax or module syntax, such that the behavior of the original code can be completely described in terms of module syntax.
  • (optional) "initial" code to be executed before each application begins that sets up variables, etc. needed by the mapped code.

The existence and definition of the mapping above means that an author can write VoiceXML applications using the (presumably simpler) convenience syntax, while being assured that the code will execute *as if* the code had been replaced by the (presumably more complex but well-defined) module syntax. This allows authors to code simple cases in the convenience syntax, and make use of other VoiceXML syntax elements and attributes only as needed.

E Convenience Syntax in VoiceXML 2.x shows how the VoiceXML 2.1 <menu> and pre-defined catch handlers could be coded using other V2 notation (i.e., as convenience syntax).

[Examples TBD.]

8 Environment

8.1 Resource Fetching

8.1.1 Fetching

A VoiceXML interpreter context needs to fetch VoiceXML documents, and other resources, such as media files, grammars, scripts, and XML data. Each fetch of the content associated with a URI is governed by the following attributes:

Table 96: Fetch Attributes
fetchtimeout The interval to wait for the content to be returned before throwing an error.badfetch event. The value is a Time Designation . If not specified, a value derived from the innermost fetchtimeout property is used.
fetchhint Defines when the interpreter context should retrieve content from the server. prefetch indicates a file may be downloaded when the page is loaded, whereas safe indicates a file that should only be downloaded when actually needed. If not specified, a value derived from the innermost relevant fetchhint property is used.
maxage Indicates that the document is willing to use content whose age is no greater than the specified time in seconds (cf. 'max-age' in HTTP 1.1 [RFC2616] ). The document is not willing to use stale content, unless maxstale is also provided. If not specified, a value derived from the innermost relevant maxage property, if present, is used.
maxstale Indicates that the document is willing to use content that has exceeded its expiration time (cf. 'max-stale' in HTTP 1.1 [RFC2616] ). If maxstale is assigned a value, then the document is willing to accept content that has exceeded its expiration time by no more than the specified number of seconds. If not specified, a value derived from the innermost relevant maxstale property, if present, is used.

When content is fetched from a URI, the fetchtimeout attribute determines how long to wait for the content (starting from the time when the resource is needed), and the fetchhint attribute determines when the content is fetched. The caching policy for a VoiceXML interpreter context utilizes the maxage and maxstale attributes and is explained in more detail below.

The fetchhint attribute, in combination with the various fetchhint properties, is merely a hint to the interpreter context about when it may schedule the fetch of a resource. Telling the interpreter context that it may prefetch a resource does not require that the resource be prefetched; it only suggests that the resource may be prefetched. However, the interpreter context is always required to honor the safe fetchhint.

When transitioning from one dialog to another, through either a <subdialog>, <goto>, <submit>, <link>, or <choice> element, there are additional rules that affect interpreter behavior. If the referenced URI names a document (e.g. "doc#dialog"), or if query data is provided (through POST or GET), then a new document is obtained (either from a local cache, intermediate cache, or from a origin Web server). When it is obtained, the document goes through its initialization phase (i.e., obtaining and initializing a new application root document if needed, initializing document variables, and executing document scripts). The requested dialog (or first dialog if none is specified) is then initialized and execution of the dialog begins.

Generally, if a URI reference contains only a fragment (e.g., "#my_dialog"), then no document is fetched, and no initialization of that document is performed. However, <submit> always results in a fetch, and if a fragment is accompanied by a namelist attribute there will also be a fetch.

Another exception is when a URI reference in a leaf document references the application root document. In this case, the root document is transitioned to without fetching and without initialization even if the URI reference contains an absolute or relative URI (see 4.5.2.2 Application Root and [RFC2396] ). However, if the URI reference to the root document contains a query string or a namelist attribute, the root document is fetched.

Elements that fetch VoiceXML documents also support the following additional attribute:

Table 97: Additional Fetch Attribute
fetchaudio The URI of the audio clip to play while the fetch is being done. If not specified, the fetchaudio property is used, and if that property is not set, no audio is played during the fetch. The fetching of the audio clip is governed by the audiofetchhint, audiomaxage, audiomaxstale, and fetchtimeout properties in effect at the time of the fetch. The playing of the audio clip is governed by the fetchaudiodelay, and fetchaudiominimum properties in effect at the time of the fetch.

The fetchaudio attribute is useful for enhancing a user experience when there may be noticeable delays while the next document is retrieved. This can be used to play background music, or a series of announcements. When the document is retrieved, the audio file is interrupted if it is still playing. If an error occurs retrieving fetchaudio from its URI, no badfetch event is thrown and no audio is played during the fetch.

8.1.2 Caching

The VoiceXML interpreter context, like [HTML] visual browsers, can use caching to improve performance in fetching documents and other resources; audio recordings (which can be quite large) are as common to VoiceXML documents as images are to HTML pages. In a visual browser it is common to include end user controls to update or refresh content that is perceived to be stale. This is not the case for the VoiceXML interpreter context, since it lacks equivalent end user controls. Thus enforcement of cache refresh is at the discretion of the document through appropriate use of the maxage, and maxstale attributes.

The caching policy used by the VoiceXML interpreter context must adhere to the cache correctness rules of HTTP 1.1 ( [RFC2616] ). In particular, the Expires and Cache-Control headers must be honored. The following algorithm summarizes these rules and represents the interpreter context behavior when requesting a resource:

  • If the resource is not present in the cache, fetch it from the server using get.
  • If the resource is in the cache,
    • If a maxage value is provided,
      • If age of the cached resource <= maxage,
        • If the resource has expired,
          • Perform maxstale check.
        • Otherwise, use the cached copy.
      • Otherwise, fetch it from the server using get.
    • Otherwise,
      • If the resource has expired,
        • Perform maxstale check.
      • Otherwise, use the cached copy.

The "maxstale check" is:

  • If maxstale is provided,
    • If cached copy has exceeded its expiration time by no more than maxstale seconds, then use the cached copy.
    • Otherwise, fetch it from the server using get.
  • Otherwise, fetch it from the server using get.

Note: it is an optimization to perform a "get if modified" on a document still present in the cache when the policy requires a fetch from the server.

The maxage and maxstale properties are allowed to have no default value whatsoever. If the value is not provided by the document author, and the platform does not provide a default value, then the value is undefined and the 'Otherwise' clause of the algorithm applies. All other properties must provide a default value (either as given by the specification or by the platform).

While the maxage and maxstale attributes are drawn from and directly supported by HTTP 1.1, some resources may be addressed by URIs that name protocols other than HTTP. If the protocol does not support the notion of resource age, the interpreter context shall compute a resource's age from the time it was received. If the protocol does not support the notion of resource staleness, the interpreter context shall consider the resource to have expired immediately upon receipt.

8.2 Properties

Properties are used to set values that affect platform behavior, such as the recognition process, timeouts, caching policy, etc.

The following types of properties are defined: speech recognition ( 8.2.1 Speech Recognition Properties ), DTMF recognition ( 8.2.2 DTMF Recognition Properties ), prompt and collect ( 8.2.3 Prompt and Collect Properties ), media ( 8.2.4 Media Properties ), fetching ( 8.2.5 Fetch Properties ) and miscellaneous ( 8.2.6 Miscellaneous Properties ) properties.

Editorial note  

Open issue: should the specification provide specific default values rather than platform-specific?

Open issue: Should we add a 'type' column for all properties?

8.2.1 Speech Recognition Properties

The following generic speech recognition properties are defined.

Table 98: Speech Recognition Properties
Name Description Default
confidencelevel The speech recognition confidence level, a float value in the range of 0.0 to 1.0. Results are rejected (a nomatch event is thrown) when application.lastresult$.confidence is below this threshold. A value of 0.0 means minimum confidence is needed for a recognition, and a value of 1.0 requires maximum confidence. The value is a Real Number Designation (see 8.4 Value Designations ). 0.5
sensitivity Set the sensitivity level. A value of 1.0 means that it is highly sensitive to quiet input. A value of 0.0 means it is least sensitive to noise. The value is a Real Number Designation (see 8.4 Value Designations ). 0.5
speedvsaccuracy A hint specifying the desired balance between speed vs. accuracy. A value of 0.0 means fastest recognition. A value of 1.0 means best accuracy. The value is a Real Number Designation (see 8.4 Value Designations ). 0.5
completetimeout The length of silence required following user speech before the speech recognizer finalizes a result (either accepting it or throwing a nomatch event). The complete timeout is used when the speech is a complete match of an active grammar. By contrast, the incomplete timeout is used when the speech is an incomplete match to an active grammar. A long complete timeout value delays the result completion and therefore makes the computer's response slow. A short complete timeout may lead to an utterance being broken up inappropriately. Reasonable complete timeout values are typically in the range of 0.3 seconds to 1.0 seconds. The value is a Time Designation (see 8.4 Value Designations ). See 8.3 Speech and DTMF Input Timing Properties . Although platforms must parse the completetimeout property, platforms are not required to support the behavior of completetimeout. Platforms choosing not to support the behavior of completetimeout must so document and adjust the behavior of the incompletetimeout property as described below. platform-dependent
incompletetimeout The required length of silence following user speech after which a recognizer finalizes a result. The incomplete timeout applies when the speech prior to the silence is an incomplete match of all active grammars.  In this case, once the timeout is triggered, the partial result is rejected (with a nomatch event). The incomplete timeout also applies when the speech prior to the silence is a complete match of an active grammar, but where it is possible to speak further and still match the grammar. By contrast, the complete timeout is used when the speech is a complete match to an active grammar and no further words can be spoken. A long incomplete timeout value delays the result completion and therefore makes the computer's response slow. A short incomplete timeout may lead to an utterance being broken up inappropriately. The incomplete timeout is usually longer than the complete timeout to allow users to pause mid-utterance (for example, to breathe). See 8.3 Speech and DTMF Input Timing Properties Platforms choosing not to support the completetimeout property (described above) must use the maximum of the completetimeout and incompletetimeout values as the value for the incompletetimeout. The value is a Time Designation (see 8.4 Value Designations ). undefined?
maxspeechtimeout The maximum duration of user speech. If this time elapsed before the user stops speaking, the event "maxspeechtimeout" is thrown. The value is a Time Designation (see 8.4 Value Designations ). platform-dependent

8.2.5 Fetch Properties

The following properties pertain to the fetching of new documents and resources.

Note that maxage and maxstale properties may have no default value - see 8.1.2 Caching .

Table 102: Fetch Properties
Name Description Default
audiofetchhint This tells the platform whether or not it can attempt to optimize dialog interpretation by pre-fetching audio. The value is either safe to say that audio is only fetched when it is needed, never before; or prefetch to permit, but not require the platform to pre-fetch the audio. prefetch
audiomaxage Tells the platform the maximum acceptable age, in seconds, of cached audio resources. platform-specific
audiomaxstale Tells the platform the maximum acceptable staleness, in seconds, of expired cached audio resources. platform-specific
documentfetchhint Tells the platform whether or not documents may be pre-fetched. The value is either safe (the default), or prefetch. safe
documentmaxage Tells the platform the maximum acceptable age, in seconds, of cached documents. platform-specific
documentmaxstale Tells the platform the maximum acceptable staleness, in seconds, of expired cached documents. platform-specific
grammarfetchhint Tells the platform whether or not grammars may be pre-fetched. The value is either prefetch (the default), or safe. prefetch
grammarmaxage Tells the platform the maximum acceptable age, in seconds, of cached grammars. platform-specific
grammarmaxstale Tells the platform the maximum acceptable staleness, in seconds, of expired cached grammars. platform-specific.
objectfetchhint Tells the platform whether the URI contents for <object> may be pre-fetched or not. The values are prefetch, or safe. prefetch
objectmaxage Tells the platform the maximum acceptable age, in seconds, of cached objects. platform-specific
objectmaxstale Tells the platform the maximum acceptable staleness, in seconds, of expired cached objects. platform-specific
scriptfetchhint Tells whether scripts may be pre-fetched or not. The values are prefetch (the default), or safe. prefetch
scriptmaxage Tells the platform the maximum acceptable age, in seconds, of cached scripts. platform-specific
scriptmaxstale Tells the platform the maximum acceptable staleness, in seconds, of expired cached scripts. platform-specific.
fetchaudio The URI of the audio to play while waiting for a document to be fetched. The default is not to play any audio during fetch delays. There are no fetchaudio properties for audio, grammars, objects, and scripts. The fetching of the audio clip is governed by the audiofetchhint, audiomaxage, audiomaxstale, and fetchtimeout properties in effect at the time of the fetch. The playing of the audio clip is governed by the fetchaudiodelay, and fetchaudiominimum properties in effect at the time of the fetch. undefined
fetchaudiodelay The time interval to wait at the start of a fetch delay before playing the fetchaudio source. The value is a Time Designation (see 8.4 Value Designations ). The default interval is platform-dependent, e.g. "2s".  The idea is that when a fetch delay is short, it may be better to have a few seconds of silence instead of a bit of fetchaudio that is immediately cut off. platform-specific
fetchaudiominimum The minimum time interval to play a fetchaudio source, once started, even if the fetch result arrives in the meantime. The value is a Time Designation (see 8.4 Value Designations ). The default is platform-dependent, e.g., "5s".  The idea is that once the user does begin to hear fetchaudio, it should not be stopped too quickly. platform-specific
fetchtimeout The timeout for fetches. The value is a Time Designation (see 8.4 Value Designations ). platform-specific

8.2.6 Miscellaneous Properties

The following miscellaneous properties are defined.

Table 103: Miscellaneous Properties
Name Description Default
inputmodes This property determines which input modality to use. The input modes to enable: dtmf and voice. On platforms that support both modes, inputmodes defaults to "dtmf voice". To disable speech recognition, set inputmodes to "dtmf". To disable DTMF, set it to "voice". One use for this would be to turn off speech recognition in noisy environments. Another would be to conserve speech recognition resources by turning them off where the input is always expected to be DTMF. This property does not control the activation of grammars. For instance, voice-only grammars may be active when the inputmode is restricted to DTMF. Those grammars would not be matched, however, because the voice input modality is not active. ???
universals Platforms may optionally provide platform-specific universal command grammars, such as "help", "cancel", or "exit" grammars, that are always active (except in the case of modal input items - see "Activation of Grammars" (link TBD)) and which generate specific events. Note that relying on platform-provided grammars is not good practice for production-grade applications (see 6.11 Builtin Grammar Module ). Applications choosing to migrate from universals grammars to a more the robust developer-specified grammars should replace the universals <property> with one or more <link> (TODO, hyperlink) element(s). Because <link>s can also generate the same events as universal grammars, and because the <catch> handlers for the universal grammars persist outside the universals <property>, the migration should be seamless. The value "none" is the default, and means that all platform default universal command grammars are disabled. The value "all" turns them all on. Individual grammars are enabled by listing their names separated by spaces; for example, "cancel exit help". none
maxnbest This property controls the maximum size of the "application.lastresult$" array; the array is constrained to be no larger than the value specified by 'maxnbest'. This property has a minimum value of 1. 1

8.3 Speech and DTMF Input Timing Properties

The various timing properties for speech and DTMF recognition work together to define the user experience. The ways in which these different timing parameters function are outlined in the timing diagrams below. In these diagrams, the start for wait of DTMF input, or user speech both occur at the time that the last prompt has finished playing.

8.3.1 DTMF Grammars

DTMF grammars use timeout, interdigittimeout, termtimeout and termchar as described in 8.2.2 DTMF Recognition Properties to tailor the user experience. The effects of these are shown in the following timing diagrams.

9 Integration with Other Markup Languages

This section presents some initial thoughts on how VoiceXML might be embedded within SCXML and how flow control languages such as SCXML and CCXML might be integrated into VoiceXML.

9.1 Embedding of VoiceXML within SCXML

The following bank application example demonstrates how external vxml application could be invoked by a scxml script and vice versa. The state machine and flow control was implemented in the BankApp.scxml. The call is started from the BankApp.scxml, it will first call BankApp.vxml's form "getAccountNum" to collect the account number, then query database for checking and saving balance. The BankApp.scxml will then invoke the form "playBalance". If this form finds the accountType is not defined, it will invoke the AccountType.scxml, which will call BankApp.vxml form "getAccountType" to get the accountType. The "playBalance" will then play the balance on the corresponding account and return the call back to the BankApp.scxml.

BankApp.scxml

<?xml version="1.0" encoding="UTF-8"?>
<scxml xmlns="http://www.w3.org/2005/07/scxml" xmlns:my="http://scxml.example.org/" version="1.0" initial="getAccountNum" profile="ecmascript" >
 
  <state id="getAccountNum">
      <invoke targettype="vxml3" src="BankApp.vxml#getAccountNum" />
      <transition event="vxml3.gotAccountNum" target="getBalance"/>
  </state>
 
  <state id="getBalance">
      <datamodel>
           <data name="method" expr="'getBalance'"/>
           <data name="accountNum" expr="_data.accountNum"/>
      </datamodel>
      <send targettype="basichttp" target="BankDB.do" namelist="method accountNum" />
      <transition event="basichttp.gotBalance" target="playingBalance"/>
  </state>
 
  <state id="playBalance">
      <datamodel>
           <data name="checking_balance" expr="_data.checking.balance" />
           <data name="saving_balance" expr="_data.saving.balance" />
      </datamodel>
      <invoke targettype="vxml3" target="BankApp.vxml#playBalance" namelist="checking_balance saving_balance" />
      <transition event="vxml3.playedBalance" target="exit" />
  </state>
 
  <final id="exit"/>
</scxml>

AccountType.scxml

<?xml version="1.0" encoding="UTF-8"?>
<scxml xmlns="http://www.w3.org/2005/07/scxml" xmlns:my="http://scxml.example.org/" version="1.0" initial="getAccountType" profile="ecmascript" >
 
  <state id="getAccountType">
      <invoke targettype="vxml3" src="BankApp.vxml#getAccountType" />
  <transition event="vxml3.gotAccountType" target="exit"/>
  </state>
 
  <final id="exit"/>
</scxml>

BankApp.vxml

<?xml version="1.0" encoding="UTF-8"?>
 
<!-- TODO: need to add final namespace, schema, etc. for vxml element. -->
<vxml version="3.0">
 
      <form id="getAccountNum">
            <field name="accountNum">
                  <grammar src=“accountNum.grxml" type="application/grammar+xml"/>
                  <prompt>
                        Please tell me your account number.
                  </prompt>
                  <filled>
                        <exit namelist="accountNum"/>
                  </filled>
            </field>
      </form>
 
      <form id="getAccountType">
            <field name="accountType">
                  <grammar src=“accountType.grxml" type="application/grammar+xml"/>
                  <prompt>
                        Do you want the balance on checking or saving account?
                  </prompt>
                  <filled>
                        <exit namelist="accountType"/>
                  </filled>
            </field>
      </form>
     
      <form id="playBalance">
            <var name="checking_balance"/>
            <var name="saving_balance"/>
            <block>
            <if cond="accountType == undefined">
 
<!--Here we are trying to invoke the external scxml script. At the time this example is written,
    the syntax to do this has not yet been decided. -->
                  <goto next="AccountType.scxml#getAccountType"/>
            </if>
                 
            <if cond="accountType == 'checking'">
                  <prompt>
                    The checking account balance is <value expr="checking_balance"/>.
                  </prompt>
            <else>
                  <prompt>
                    The saving account balance is <value expr="saving_balance"/>.
                  </prompt>
            </if>
                 
            </block>
      </form>
</vxml>

9.2 Integrating Flow Control Languages into VoiceXML

State Chart XML (SCXML) could be used as the controller for managing the dialog in VoiceXML 3.0 applications. A recursive MVC technique allows SCXML controllers to be placed at session, document and form levels. Examples of resulting compound documents (containing V3 and SCXML namespaced elements) appear below for illustration. A graceful degradation / fallback approach could be used to ensure backwards compatibility with V2 applications. Note that the examples below use a new v3:scxmlform element.

[ISSUE: It has been suggested that using the existing v3:form element instead of a new v3:scxmlform element would be simpler and more elegant. Although the working group currently knows of no particular reason why the existing v3:form couldn't be used instead of a new v3:scxmlform element, the group has not yet discussed this in detail or agreed that using v3:form in this way is desirable. The group plans to discuss this and is interested in receiving public feedback on this possibility.]

9.2.1 SCXML for Dialog Management

Example application scenario:

  • This is an airline travel itinerary modification application
  • First order of business is to retrieve an itinerary to be modified
  • Itinerary may be identified using either a record locator or traveler's lastname and other information

Below are two flavors of this application using SCXML as the form-level controller, a system-driven and a user-driven approach. These use similar set of fields in the form but different dialog management styles. In the simpler example here, the VUI might appear similar though there is a system vs. user driven flavor.

9.2.1.1 System-driven Dialog
  • Starts off by asking if the record locator is available
  • If the locator is available, it's requested
  • If the locator isn't available, the last name and some other pieces of information are requested to uniquely identify the itinerary
  • Once the itinerary is identified, we proceed with application functions

Consider the following sketch of a V3 form for this purpose:

<v3:scxmlform>
  <scxml:scxml initial="choice">
    <scxml:state id="choice">
      <scxml:invoke type="vxml3field" src="#choicefield"/>
      <scxml:transition event="filled.choice" cond="choicefield"
                    target="locator"/>
      <scxml:transition event="filled.choice" cond="!choicefield"
                    target="lastname"/>
    </scxml:state>
    <scxml:state id="locator">
      <scxml:invoke type="vxml3field" src="#locatorfield"/>
      <!-- Retrieve record, transition to app menu -->
    </scxml:state>
    <scxml:state id="lastname">
      <scxml:invoke type="vxml3field" src="#lastnamefield"/>
      <!-- Collect other information needed to retrieve record,
           then retrieve record and go to app menu -->
    </scxml:state>
    <!-- Remaining dialog control flow logic omitted -->
  </scxml:scxml>
  <v3:field name="choicefield">
    <v3:grammar src="boolean.grxml" type="application/srgs+xml"/>
    <v3:prompt>
      Welcome. Do you have the record locator for your itinerary?
    <v3:prompt>
    <v3:filled>
      <v3:throw event="filled.choice"/>
    </v3:filled>
  </v3:field>
  <v3:field name="locatorfield">
    <v3:grammar src="locator.grxml" type="application/srgs+xml"/>
    <v3:prompt>
      What is the record locator for the itinerary?
    <v3:prompt>
    <v3:filled>
      <v3:throw event="filled.locator"/>
    </v3:filled>
  </v3:field>
  <v3:field name="lastnamefield">
    <v3:grammar src="lastname.grxml" type="application/srgs+xml"/>
    <v3:prompt>
      Please say or spell your last name.
    <v3:prompt>
    <v3:filled>
      <v3:throw event="filled.lastname"/>
    </v3:filled>
  </v3:field>
  <!-- Other form items, such as the subsequent application menu omitted
-->
</v3:scxmlform>
9.2.1.2 User-driven Dialog
  • Starts off by asking what information the user would like to supply to identify the itinerary
  • If the user indicates the record locator will be provided, it's retrieved
  • If the user indicates the last name will be provided, it's retrieved (some other pieces of information may be retrieved to uniquely identify the itinerary)
  • Once the itinerary is identified, we proceed with application functions

Consider the following sketch of a V3 form for this purpose:

<v3:scxmlform>
  <scxml:scxml initial="choice">
    <scxml:state id="choice">
      <scxml:invoke type="vxml3field" src="#choicefield"/>
      <scxml:transition event="filled.choice" cond="choicefield == 'locator'"
                    target="locator"/>
      <scxml:transition event="filled.choice" cond="choicefield == 'lastname'"
                    target="lastname"/>
    </scxml:state>
    <scxml:state id="locator">
      <scxml:invoke type="vxml3field" src="#locatorfield"/>
      <!-- Retrieve record, transition to app menu -->
    </scxml:state>
    <scxml:state id="lastname">
      <scxml:invoke type="vxml3field" src="#lastnamefield"/>
      <!-- Collect other information needed to retrieve record,
           then retrieve record and go to app menu -->
    </scxml:state>
    <!-- Remaining dialog control flow logic omitted -->
  </scxml:scxml>
  <v3:field name="choicefield">
    <v3:grammar src="choice.grxml" type="application/srgs+xml"/>
    <v3:prompt>
      Welcome. How would you like to look up your itinerary?
    <v3:prompt>
    <v3:filled>
      <v3:throw event="filled.choice"/>
    </v3:filled>
  </v3:field>
  <v3:field name="locatorfield">
    <v3:grammar src="locator.grxml" type="application/srgs+xml"/>
    <v3:prompt>
      What is the record locator for the itinerary?
    <v3:prompt>
    <v3:filled>
      <v3:throw event="filled.locator"/>
    </v3:filled>
  </v3:field>
  <v3:field name="lastnamefield">
    <v3:grammar src="lastname.grxml" type="application/srgs+xml"/>
    <v3:prompt>
      Please say or spell your last name.
    <v3:prompt>
    <v3:filled>
      <v3:throw event="filled.lastname"/>
    </v3:filled>
  </v3:field>
  <!-- Other form items, such as the subsequent application menu omitted
-->
</v3:scxmlform>

A Acknowledgements

This version of VoiceXML was written with the participation of members of the W3C Voice Browser Working Group. The work of the following members has significantly facilitated the development of this specification:

  • Emily Bateman, Comverse
  • Skip Cave, Intervoice
  • Andrew Fuller, VoxPilot
  • Jeff Hoepfinger, Sandcherry
  • Jim Larson, Intervoice
  • Lakshmi Krishnamurthy, Genesys
  • Satya Palivela, Intervoice
  • Joseph Wong, Genesys

The W3C Voice Browser Working Group would like to thank the W3C team, especially Kazuyuki Ashimura and Matt Womer, for their invaluable administrative and technical support.

B References

B.1 Normative References

DFP
The Voice Browser DFP Framework W3C Informative Note, February 2006. (See http://www.w3.org/Voice/2006/DFP.)
MMI
Multimodal Architecture and Interfaces W3C Working Draft, December 2009. (See http://www.w3.org/TR/2009/WD-mmi-arch-20091201/.)
DOM3Events
Document Object Model (DOM) Level 3 Events Specification Schepers, Höhrmann, Le Hégaret and Pixley. W3C Working Draft, September 2009. (See http://www.w3.org/TR/2009/WD-DOM-Level-3-Events-20090908/.)
RFC2119
Key words for use in RFCs to Indicate Requirement Levels IETF RFC 2119, 1997. (See http://www.ietf.org/rfc/rfc2119.txt.)
ECMASCRIPT
Standard ECMA-262 ECMAScript Language Specification , Standard ECMA-262, December 1999. (See http://www.ecma-international.org/publications/standards/Ecma-262.htm.)
VOICEXML20
Voice Extensible Markup Language (VoiceXML) Version 2.0 McGlashan et al. W3C Recommendation, March 2004. (See http://www.w3.org/TR/voicexml20/.)
VOICEXML21
Voice Extensible Markup Language (VoiceXML) Version 2.1 Oshry et al. W3C Recommendation, May 2007. (See http://www.w3.org/TR/voicexml21/.)
SSML
Speech Synthesis Markup Language Version 1.0 Burnett, Walker and Hunt. W3C Recommendation, September 2004. (See http://www.w3.org/TR/speech-synthesis/.)
SRGS
Speech Recognition Grammar Specification Version 1.0 Hunt and McGlashan. W3C Recommendation, March 2004. (See http://www.w3.org/TR/speech-grammar/.)
RFC2616
Hypertext Transfer Protocol -- HTTP/1.1 IETF RFC 2616, 1999. (See http://www.ietf.org/rfc/rfc2616.txt.)
RFC2396
Uniform Resource Identifiers (URI): Generic Syntax IETF RFC 2396, 1998. (See http://www.ietf.org/rfc/rfc2396.txt.)
BCP47
Tags for Identifying Languages and Matching of Language Tags A. Phillips and M. Davis, Editors. IETF, September 2009. (See http://www.rfc-editor.org/bcp/bcp47.txt.)

C Glossary of Terms

active grammar
A speech or DTMF grammar that is currently active. This is based on the currently executing element, and the scope elements of the currently defined grammars.
application
A collection of VoiceXML documents that are tagged with the same application name attribute.
ASR
Automatic speech recognition.
author
The creator of a VoiceXML document.
catch element
A <catch> block or one of its abbreviated forms. Certain default catch elements are defined by the VoiceXML interpreter .
control item
A form item whose purpose is either to contain a block of procedural logics (<block>) or to allow initial prompts for a mixed initiative dialog (<initial>).
CSS W3C Cascading Style Sheet specification.
See [CSS2]
dialog
An interaction with the user specified in a VoiceXML document . Types of dialogs include forms and menus .
DTMF (Dual Tone Multi-Frequency)
Touch-tone or push-button dialing. Pushing a button on a telephone keypad generates a sound that is a combination of two tones, one high frequency and the other low frequency.
ECMAScript
A standard version of JavaScript backed by the European Computer Manufacturer's Association. See [ECMASCRIPT]
event
A notification "thrown" by the implementation platform , VoiceXML interpreter context , VoiceXML interpreter , or VoiceXML code. Events include exceptional conditions (semantic errors), normal errors (user did not say something recognizable), normal events (user wants to exit), and user defined events.
executable content
Procedural logic that occurs in <block>, <filled>, and event handlers .
form
A dialog that interacts with the user in a highly flexible fashion with the computer and the user sharing the initiative.
FIA (Form Interpretation Algorithm)
An algorithm implemented in a VoiceXML interpreter which drives the interaction between the user and a VoiceXML form or menu. See vxml2: Section 2.1.6, vxml2: Appendix C.
form item
An element of <form> that can be visited during form execution: <initial>, <block>, <field>, <record>, <object>, <subdialog>, and <transfer>.
form item variable
A variable, either implicitly or explicitly defined, associated with each form item in a form . If the form item variable is undefined, the form interpretation algorithm will visit the form item and use it to interact with the user.
implementation platform
A computer with the requisite software and/or hardware to support the types of interaction defined by VoiceXML.
input item
A form item whose purpose is to input a input item variable. Input items include <field>, <record>, <object>, <subdialog>, and <transfer>.
[ Definition : language identifier]
A language identifier labels information content as being of a particular human language variant. Following the XML specification for language identification [XML] , a legal language identifier is identified by BCP 47 [BCP47] .
link
A set of grammars that when matched by something the user says or keys in, either transitions to a new dialog or document or throws an event in the current form item.
menu
A dialog presenting the user with a set of choices and takes action on the selected one.
mixed initiative
A computer-human interaction in which either the computer or the human can take initiative and decide what to do next.
JSGF
Java API Speech Grammar Format. A proposed standard for representing speech grammars. See [JSGF]
object
A platform-specific capability with an interface available via VoiceXML.
request
A collection of data including: a URI specifying a document server for the data, a set of name-value pairs of data to be processed (optional), and a method of submission for processing (optional).
script
A fragment of logic written in a client-side scripting language, especially ECMAScript , which is a scripting language that must be supported by any VoiceXML interpreter .
session
A connection between a user and an implementation platform , e.g. a telephone call to a voice response system. One session may involve the interpretation of more than one VoiceXML document .
SRGS (Speech Recognition Grammar Specification)
A standard format for context-free speech recognition grammars being developed by the W3C Voice Browser group. Both ABNF and XML formats are defined [SRGS] .
SSML (Speech Synthesis Markup Language)
A standard format for speech synthesis being developed by the W3C Voice Browser group [SSML] .
subdialog
A VoiceXML dialog (or document) invoked from the current dialog in a manner analogous to function calls.
tapered prompts
A set of prompts used to vary a message given to the human. Prompts may be tapered to be more terse with use (field prompting), or more explicit (help prompts).
throw
An element that fires an event .
TTS
text-to-speech; speech synthesis.
user
A person whose interaction with an implementation platform is controlled by a VoiceXML interpreter .
URI
Uniform Resource Indicator.
URL
Uniform Resource Locator.
VoiceXML document
An XML document conforming to the VoiceXML specification.
VoiceXML interpreter
A computer program that interprets a VoiceXML document to control an implementation platform for the purpose of conducting an interaction with a user.
VoiceXML interpreter context
A computer program that uses a VoiceXML interpreter to interpret a VoiceXML Document and that may also interact with the implementation platform independently of the VoiceXML interpreter .
W3C
World Wide Web Consortium http://www.w3.org/

D VoiceXML 3.0 XML Schema

D.8 Schema for Legacy Profile

<?xml version="1.0" encoding="UTF-8"?>
<xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns="TBD"
                   targetNamespace="TBD" blockDefault="#all">
    <xsd:annotation>
        <xsd:documentation>
              This is the XML Schema driver for Legacy Profile of Vxml 3.0 specification.
              Please use this namespace for the Legacy Profile:
              "TBD:URL to schema"
        </xsd:documentation>
        <xsd:documentation source="vxml3-copyright.xsd"/>
    </xsd:annotation>
 
    <xsd:annotation>
        <xsd:documentation>
            This is the Schema Driver file for Legacy Profile of Vxml 3.0 Specification
            This schema
                + sets the namespace for Legacy Profile of Vxml 3.0 Specification
                + imports external schemas (xml.xsd)
                + imports schema modules 
 
                  Legacy Profile includes the following Modules
 
                   * Vxml Root module 
                   * Form module 
                   * Field module 
                   * Prompt module 
                   * Grammar module 
                   * Data Access and Manipulation Module
          </xsd:documentation>
    </xsd:annotation>
    <xsd:import namespace="http://www.w3.org/XML/1998/namespace"
        schemaLocation="http://www.w3.org/2001/xml.xsd">
        <xsd:annotation>
            <xsd:documentation>
                This import brings in the XML namespace attributes
                The XML attributes are used by various modules.
            </xsd:documentation>
        </xsd:annotation>
    </xsd:import>
 
    <xsd:include schemaLocation="vxml-datatypes.xsd">
        <xsd:annotation>
            <xsd:documentation>
                    This imports brings in the common datatypes for Vxml.
            </xsd:documentation>
        </xsd:annotation>
    </xsd:include>
    <xsd:include schemaLocation="vxml-attribs.xsd">
        <xsd:annotation>
            <xsd:documentation>
                This imports brings in the common attributes for Vxml.
            </xsd:documentation>
        </xsd:annotation>
    </xsd:include>
    <xsd:include schemaLocation="vxml3-module-vxmlroot.xsd">
        <xsd:annotation>
            <xsd:documentation>
                This imports the Vxml Root module for Vxml 3.0
            </xsd:documentation>
        </xsd:annotation>
    </xsd:include>
    <xsd:include schemaLocation="vxml3-module-form.xsd">
        <xsd:annotation>
            <xsd:documentation>
                This imports the Form module for Vxml 3.0
            </xsd:documentation>
        </xsd:annotation>
    </xsd:include>
    <xsd:include schemaLocation="vxml3-module-field.xsd">
        <xsd:annotation>
            <xsd:documentation>
                This imports the Field module for Vxml 3.0
            </xsd:documentation>
        </xsd:annotation>
    </xsd:include>
    <xsd:include schemaLocation="vxml3-module-prompt.xsd">
        <xsd:annotation>
            <xsd:documentation>
                This imports the Prompt module for Vxml 3.0
            </xsd:documentation>
        </xsd:annotation>
    </xsd:include>
    <xsd:include schemaLocation="vxml3-module-grammar.xsd">
        <xsd:annotation>
            <xsd:documentation>
                This imports the Grammar module for Vxml 3.0
            </xsd:documentation>
        </xsd:annotation>
    </xsd:include>
    <xsd:include schemaLocation="vxml3-module-dataacces.xsd">
        <xsd:annotation>
            <xsd:documentation>
                This imports the Data Access and Manipulation module for Vxml 3.0
            </xsd:documentation>
        </xsd:annotation>
    </xsd:include>
</xsd:schema>
Editorial note  
The schema is incomplete. It merely imports the schemas for various modules, but doesn't contain parent/child relationships between modules or constraints on them. These all need to be specified in the future.

E Convenience Syntax in VoiceXML 2.x

VoiceXML 2 defines shorthand notation for several fundamental capabilities. For example, some <catch> elements can be represented in a shortened form:

<noinput>I didn’t hear anything. </noinput>

is equivalent to:

<catch event="noinput"> I didn’t hear anything. </catch>

This notation could be transformed via standard text substitution tools.

E.2 Examples

The following examples demonstrate the different V2 syntactic mechanisms that provide identical functionality.

F Major changes since the last Working Draft

  • Added Timer resource Added Real Time Controls module Added Transition Controller module Revised Legacy profile description to match current thinking.
  • Updated text of Recognition Removed SIV Resource (section 5.3) to support SIV processing 5.4) since it is now covered along with the recognition resource in section 5.3.
  • Updated Initialization (section 4.5.1) Update section 4.4 (Event Model) to separate match our current thinking about DOM processing from property lookup, etc. events as the underlying model for all flow control
  • Cleaned up text in sections 1, 2, and 5