Voice Extensible Markup Language (VoiceXML) 3.0

1 Terminology

In this document, the key words "must", "must not", "required", "shall", "shall not", "should", "should not", "recommended", "may", and "optional" are to be interpreted as described in [RFC2119] and indicate required levels for compliant VoiceXML 3.0 implementations.

Terms used in this specification are defined in Appendix C Glossary of Terms.

2 Overview

How does one build a successor to VoiceXML 2.0/2.1? Requests for improvements to VoiceXML fell into two main categories: extensibility and new functionality.

To accommodate both, the Voice Browser Working Group first developed the detailed semantic descriptions of VoiceXML that versions 2.0 and 2.1 lacked. From there it was possible to describe semantics for new functionality and to restructure the language syntactically to improve extensibility.

One of the other benefits of detailed semantic descriptions is improving portability within VoiceXML. However there are many factors that contribute to portability that are outside the scope of this document (e.g. speech recognition capabilities, telephony).

2.1 Structure of VoiceXML 3.0

This document covers the following:

This document explains the core of VoiceXML 3.0, an extensible framework that describes how syntax is defined, how semantics are defined, and how the two are connected together. In this document, "semantics" means both functional and text definitions of core functionality, such as might be used by an implementer of VoiceXML 3.0. The term "syntax" refers to XML elements and attributes that are an application author's programming interface to the functionality defined by the "semantics".
Within this document, all the functionality of VoiceXML 3.0 is grouped into modules of related capabilities.
Modules can be combined together to create complete profiles (languages). This document describes how to define both modules and profiles.
In addition to describing the general framework, this document explicitly defines a broad range of functionality, several modules, and two profiles.

2.2 Structure of this document

The remainder of this document is structured as follows:

3 Data Flow Presentation (DFP) Framework presents the Data-Flow-Presentation Framework, its importance for the development of VoiceXML 3.0, and how VoiceXML 3.0 fits into the model.

4 Core Concepts explains the core concepts underlying the new structure for VoiceXML, including resources, resource controllers, the relationship between syntax and semantics, DOM eventing, modules, and profiles.

5 Resources presents the resources defined for the language. These provide the key presentation-related functionality in the language.

6 Modules presents the modules defined for the language. Each module consists of a syntax piece (with its user-visible events), a semantics piece (with its behind-the-scenes events), and a description of how the two are connected.

7 Profiles presents two profiles. The first, the VoiceXML 2.1 profile, shows how a language similar to VoiceXML 2.1 can be created using the structure and functionality of VoiceXML 3.0. The second, the Media Server profile, is a simple compilation of all of the functionality available in VoiceXML 3.0.

The Appendices provide useful references and a glossary of terms used in the specification.

2.3 How to read this document

For everyone: Please first read 3 Data Flow Presentation (DFP) Framework. The data-flow- presentation distinction applies not only to VoiceXML 3.0, but to many of W3C's specifications. Understanding VoiceXML's role as a presentation language is crucial context for understanding the rest of the specification.

For application authors: we recommend that you begin with syntax and only gradually explore details of the semantics as you need to understand behavioral specifics.

If you are familiar with VoiceXML 2 you might want to begin with the VoiceXML 2.1 profile in 7.1 VoiceXML 2.1 Profile to see an example of all the syntactic pieces in the finished profile.
You should then review the syntax sections of each of the modules in 6 Modules, along with the Media Server profile in 7.2 Media Server Profile. When you need to understand how a bit of syntax is implemented, read the semantics section corresponding to that syntax.
Along the way you will definitely want to review the parts of 4 Core Concepts that are relevant to your other reading (profiles, modules, syntax, semantics, and DOM eventing).

For VoiceXML platform developers: we recommend that you begin with the functionality and framework and only focus on syntax later.

If you are familiar with VoiceXML 2 you might want to begin with the VoiceXML 2.1 profile in 7.1 VoiceXML 2.1 Profile to see the user-visible differences between the original VoiceXML 2.1 language and the new VoiceXML 2.1 profile. A brief review of the Media Server profile in 7.2 Media Server Profile would be good as well.
Next you should review 4 Core Concepts in detail, since the rest of the language is built upon the framework described there.
5 Resources and 6 Modules (the semantics part) should be the bulk of your focus. Remember that they are semantic descriptions only and that you can implement the functionality any way you wish as long as the semantics remain the same.
For document authors one significant difference with VoiceXML 2.1 is support for DOM 4.3 Event Model .

3 Data Flow Presentation (DFP) Framework

Unlike VoiceXML 2.0/2.1, the focus in VoiceXML 3.0 is almost exclusively on the user interface portions of the language. By choice, very little work has gone into the development of data storage and manipulation or control flow capabilities. In short, VoiceXML 3.0 has been designed from the ground up as a *presentation* language, according to the definition presented in the Data Flow Presentation ([DFP]) Framework.

The Data Flow Presentation (DFP) Framework is an instance of the Model-View-Controller paradigm, where computation and control flow are kept distinct from application data and from the way in which the application communicates with the outside world. This partitioning of an application allows for any one layer to be replaced independently of the other two. In addition, it is possible to simultaneously make use of more than one Data (Model) language, Flow (Controller), and/or Presentation (View) language.

3.1 Data

The Data layer is responsible for maintaining all information in a format that is easily accessible and easily editable.

Although data that is independent of the Presentation medium (such as flight reservation data stored in the back-end database) would be stored outside of the VoiceXML application, there is still a need to keep some presentation-specific data, e.g. the status of the dialog in collecting certain information, which prompts have just been played, and how many of various error conditions have occurred so far.

Within VoiceXML 3.0 the Data layer is realized through a pluggable data language and a data access or manipulation language. Access to and use of the data is aligned with options available in SCXML for simpler interaction with the Flow layer (see the next section). This specification defines two specific data languages, XML and ECMAScript, and two data access and manipulation languages, E4X/DOM and XPath. Others may be defined by implementers.

3.2 Flow

The Flow layer is responsible for all application control flow, including business logic, dialog management, and anything else that is not strictly data or presentation. VoiceXML 3.0 provides primitives that contain the control flow needed to implement them, but all combinations between and among the elements at the syntax level is done via calls to external control flow processors. Two that are likely to be used with VoiceXML are CCXML and SCXML. Note that flow control components written outside of VoiceXML may be communicating not only with a VoiceXML processor but with an HTML browser, a video game controller, or any of a variety of other input and output components.

3.3 Presentation

The Presentation layer is responsible for all interaction with the outside world, i.e., human beings and external software components. VoiceXML 3.0 *is* the Presentation layer. Designed originally for human-computer interaction, VoiceXML "presents" a dialog by accepting audio and dtmf input and producing audio and video output.

4 Core Concepts

This document specifies the VoiceXML 3.0 language as a collection of modules. Each module is described at two levels:

Syntax level -- The syntax is a set of XML elements, attributes, and events used by VoiceXML 3.0 application developers to specify applications. The VoiceXML 3.0 elements and attributes are specified within each module and in the XML schema in appendix TBD. The events are DOM level 3 events. This document provides a textual description of each element, attribute, and event.
Semantics level -- The semantics of each module is described in terms of resources, resource controllers, and semantic events that the resource controllers may generate and consume. Semantics is described by both UML state chart visual diagrams and textual SCXML representations

The resources, resource controllers, and the events they generate are intended only to describe the semantics of VoiceXML 3. Implementations are not required to use SCXML to implement VoiceXML 3, nor must they create objects corresponding to resources, resource controllers, and the SCXML events they raise. The logical components are useful for describing how different syntax use similar resources or for future extensions to the language that may use these resources or hook into specific places in the semantic framework, but only the behavior exposed is necessary for a conformant VoiceXML 3 interpreter.

It is important to note that the semantic framework described here is a logical one. The resources, resource controllers, and the events they generate are intended only to describe the semantics of VoiceXML 3. Implementations are not required to use SCXML to implement VoiceXML 3, nor must they create objects corresponding to resources, resource controllers, and the SCXML events they raise. These logical SCXML events must be distinguished from the author-visible DOM events that are a mandatory part of the VoiceXML 3 language. Implementations MUST raise these DOM events and process them in the manner described in Section 4.3 Event Model . The interaction between actual DOM events and logical SCXML events is described in Section 4.4 Document Initialization and Execution, below.

4.1 Semantics

The semantic model is a conceptual representation of the underlying behavior of VoiceXML (form interpretation, prompt selection, etc). Each VoiceXML 3.0 module such as form interpretation, prompt selection, etc, contains a conceptual representation of its the underlying behavior expressed in terms of resources and resource controllers. While the resources and resource controllers are not exposed directly in the markup, they are used to define the semantics of VoiceXML 3.0 markup elements.

For example, Figure 1 presents a high-level semantic description of the PromptQueue which consists of the PromptController, Prompt Queue resource, and SSML/media player. For a detailed description of the semantics of the Prompt Queue, see state chart representation in Section 5.2.1 State Chart Representation and the SCMXL representation in Section 5.2.2 SCXML Representation. Section 5.2.3 Defined Events defines each event.

(Additional examples TBD)

The VoiceXML 3.0 semantic model is illustrated in Figure 1.

Editorial note

Section 4.1. Replace Figure 1 by a picture illustrating the three levels of (resource controller, resources, devices) with three examples corresponding to the examples of section 4.1.2

-- JimL

Figure 1: Semantic model with Resources and Resource Controllers

Editorial note
More architecture diagrams will be added in later versions.

It is important to note that this model places no burden or requirements that a VoiceXML interpreter must implement behavior as described in the model. Rather, the requirement is that the behavior must be the same as if it were implemented as described, but it is permitted to have optimizations or different architecture behind the implementation of the markup interpretation.

4.1.1 Resources

Resources are the building blocks of the semantic model. Each resource is a self contained object in the semantic model that is capable of providing a service. The resources are singletons, global in scope and persist for the whole session (e.g. even over subdialogs). Multiple different resources may be simultaneously active.

Controllers communicate with resources by sending and receiving events. Resources do not communicate with one another directly.

Different modules can use the same resources, and language profiles can require specified resources. Different profiles may only have the ability to support limited resources and some profiles modules may require new resources.

Examples of resources required for the VoiceXML 2.1 profile include a Prompt Queue/Player, a DTMF Recognizer, an ASR recognizer, an recording service, a transfer service, and a hierarchically scoped (ECMAScript) data model. Other modules and profiles of VoiceXML 3.0 may require that existing resources are extended, or that new ones are created. Examples of new resources may include: a whole call recording service, an XML data model and a SIV service.

The semantics of VoiceXML markup elements may be captured by saying a markup element interacts with a resource; for example, the semantic representation of the <value> element may be that it represents a single data resource lookup result (part of the conceptual API the data model resource offers) that is expected to execute the expression in the expr attribute and to return either the result or an error.

4.1.2 Resource Controllers (RCs)

The conceptual objects responsible for coordinating input and output across multiple resources are resource controllers RCs. Each resource controller may interact with resources and other resource controllers to model the semantics of one or more parts of the markup. Each VoiceXML 3.0 markup element can be represented as 0 or more resource controllers. If there are more than one resource controller associated with a markup element, one of these is designated as the primary RC.

Examples of resource controllers include:

a controller for the part of field's collect phase that coordinates the prompt playing and bargein with the grammars being prepared with the recognition starting.
a controller for implementing that part of the form tag that deals with slot filling and moving from the collect phase to the process phase dealing with the filled handling algorithm.
a controller for implementing the semantic behavior of a link that gets matched.

Note that this means that the Form Interpretation Algorithm is realized through the interactions of multiple resource controllers.

The lifecycle and scope of resource controllers are not session and global respectively and there may be multiple conceptual instances of the controllers instantiated (waiting for an event, containing state) at the same time. However, conceptually only one controller may be active doing work (handling an event) at a time.

4.2 Syntax

VoiceXML 3.0 elements are defined using Schema and represented in DOM (Level 3).

4.3 Event Model

The event model for VoiceXML 3.0 builds upon the DOM Level 3 Events [DOM3Events] specification. DOM Level 3 Events offer a robust set of interfaces for managing the listener registration, dispatching, propagation, and handling of events, as well as a description of how events flow through an XML tree.

The DOM 3.0 event model offers VoiceXML developers a rich set of interfaces that allow them to easily add behavior to their applications. In addition, conforming to the standard DOM event model enables authors to integrate their Voice applications in next generation multimodal or multi-namespaced frameworks such as MMI and CDF with minimal efforts.

Within the VoiceXML 3.0 semantic model, the DOM Level 3 Events APIs are available to all Resource Controllers that have markup elements associated with them. Indeed, this section covers the eventing APIs as available to VoiceXML 3.0 markup elements. The following section describes how the semantic model ties in with the DOM eventing model.

4.3.1 Event Interfaces

All VoiceXML 3.0 markup elements implement interfaces that support the following:

Subscription to events by event listeners and, symmetrically, the removal of event listeners.
Publishing of the events emitted by their resources.
Event handling.

4.3.1.1 Event

The VoiceXML 3.0 Event interface extends the DOM Level 3 Event interface to support voice specific event information. In particular, the VoiceXML 3.0 Event interface supports a count integer that stores the number of times a resources emits a particular event type. The semantic model manages the count field by incrementing its value and resetting it as described in the section that follows.

Note:

RH: should we expose the count to authors? If so, should we have a special variable like event.count or something similar ?

4.3.1.2 EventTarget

VoiceXML 3.0 markup elements implement the DOM Level 3 EventTarget interface.This interface allows registration and removal of event listeners as well as dispatching of events.

4.3.1.3 EventListener

The VoiceXML 3.0 markup elements implement the DOM Level 3 EventListener interface. This interface allows the activation of handlers associated with a particular event. When a listener is activated, the event handler execution is done in the semantic model as described in the section that follows.

4.3.2 Event Flow

[To be updated by Michael Bodell due April 1 2008]

Events propagate through markup elements as per the DOM event flow. Event listeners may be registered on any of VoiceXML markup elements.

When processing a VoiceXML 2.0 profile, event listeners are not allowed to be registered for the capture phase, as this contradicts the as-if-by-copy event semantics of VoiceXML 2.0. If a listener is registered with the capture phase set to true in a VoiceXML 2.0 document, an error.event.illegalphase event will be dispatched onto the root document and the listener registration will be ignored (does that sound reasonable to people?).

4.3.2.1 Event Listener Registration

The DOM Level 3 Event specification supports the notion of partial ordering using the event listener group; all events within a group are ordered. As such, in VoiceXML 3.0, event listeners are registered as they are encountered in the document. Furthermore, all event listeners registered on an element belong to the same default group. Both of these provisions ensure that event handlers will execute in document order.

4.3.2.2 Event Listener Activation

An event listener is triggered if:

The type of the event propagating through the markup element matches the type or category of the event listener.
The event listener is registered for the same phase. Note that in a VoiceXML 2.0 profile, only the at-target and bubble phases are supported.
The event propagation has not been stopped for the listener's group.
The conditional expression, if present on the handler's cond attribute, must evaluate to true.
The value of the count attribute on a handler, if present, must be less than or equal to the count field inside the event being propagated.

Once en event listener is triggered, the execution is handled by the semantic model as described in the section below. Event propagation blocks until it is notified by the semantic model to proceed.

4.3.3 Event Categories

The VoiceXML 3.0 specification extends the DOM 3 Event specification to support partial name matching on events. VoiceXML 3.0 creates categories of events (the list of categories needs to be specified in the VoiceXML 3.0 spec ) and allows authors and the platform to register listeners for either a specific event type or for all events within a particular category or subcategory. For example, VoiceXML 3.0 may create a connection category such as:

      
      {"http://www.example.org/2007/v3","connection"}

The spec may also declare a subcategory of connection or a specific event type that belongs to this category:

      
      {"http://www.example.org/2007/v3","connection.disconnect"}
      {"http://www.example.org/2007/v3","connection.disconnect.hangup"}

Following this declaration, the VoiceXML 3.0 Event specification uses partial name matching to associate events propagating through the DOM to listeners registered on the tree. The VoiceXML 3.0 Event specification follows the prefix matching used in VoiceXML 2.0 for associating events with their categories.

Note:

It might be useful to introduce the "*" notation to be specify a catch for all events irrespective of their type and/or category.

4.4 Document Initialization and Execution

4.4.1 Initialization

The initialization ordering described here is a logical one, specifying which objects and information are available at each stage. Implementations are allowed to use a different ordering (in particular, they are allowed to interleave the construction of the DOM with the creation of semantic objects) as long as they behave as if they were following the order specified here. Similarly, we refer to a 'semantic constructor' as a cover term for whatever mechanism is used to create the Resource Controllers for a given node. No particular implementation is implied or required.

Before a VoiceXML 3.0 application is first loaded, all Resources are created. Whenever a document is loaded within that application, its DOM (level 3) is created. Then the initialization process creates the Resource Controllers by invoking the semantic constructor for the root <vxml> node of the DOM. The root <vxml> node constructor is responsible for invoking the constructors for all nodes in the document that have them. When it does this, it will call the semantic constructor routine passing it

a pointer to the node that has the constructor
a pointer to the root of the DOM
an arbitrary data structure

Editorial note

Open Issue: we must specify the operation of the root node constructor in more detail as part of the V3 specification. Other people can define modules, but we must specify how they are assembled into a full semantic representation of the application.) If there is an application root document specified, the root node constructor will have to construct its RCs as well, by calling its root node constructor.

Note that the initial construction process creates the RCs but does not necessarily fully configure them. Further initialization, including in particular the creation of variables and variable scopes, will happen only when the RCs are activated at runtime (e.g. by visiting a Form.)

Once the RCs are constructed, they are independent of the DOM, except for the interactions specified below. However, while they are running the RCs often make use of what appears to be syntactic information. For example, the concept of 'next item' relies heavily on document order, while <goto> can take a specific syntactic label as its target. We provide for this by assuming that RCs can maintain a shadow copy of relevant syntactic information, where "shadow copy" is intended to allow a variety of implementations. In particular, platforms may make an actual copy of the information or may maintain pointers back into the DOM. The construction process may create multiple RCs for a given node. In that case, one of the RCs will be marked as the primary RC. It is the one that will be invoked when the flow of control reaches that (shadow) node.

4.4.2 Execution

After initialization, the semantic control flow does a <goto> to the initial Resource Controller. Once a RC is running, it invokes Resources and other RCs by sending them events. The DOM is not involved in this process. At various points in the processing, however, an RC may decide to raise an author-visible event. It does this by creating an event targeted at a specific DOM node and sending it back to the DOM. When the DOM receives the event, it performs the standard bubble/capture cycle with the target specified in the event. In the course of the bubble/capture cycle, various event handlers may fire. Their execution is a semantic action and occurs back in the semantic 'side' of the environment. The DOM sends messages back to the appropriate semantic objects to cause this to happen. Note that this means that the DOM must store some sort of link to the appropriate RCs. The event handlers may update the data model, execute script, or raise other DOM events. When the handler finishes processing on the semantic side, it sends a notification back to the DOM so that it can resume the bubble/capture phase. (N.B. This notification is NOT a DOM event.) When the DOM finishes the bubble/capture processing of the event, it sends a notification back to the RC that raised the event so that it can continue processing.

Editorial note
Open Issue: Is this notification a standard semantic event? Note that RC processing must pause during the bubble/capture phase to avoid concurrency problems.

4.4.2.1 Subdialogs

A subdialog has a completely separate context from the invoking application. Thus it has a separate DOM and a separate set of RCs. However it shares the same set of Resources since they are global. When a subdialog is entered, the Datamodel Resource will have to create a new scope for the subdialog and hide the calling document's scopes. When the subdialog is exited, the Datamodel resource will destroy the subdialog scope(s) and restore the calling document's scope(s).

4.4.2.2 Application Root

To handle event propagation from the leaf application to the application root document, we create a Document Manager to handle all communication between the documents. This means that the DOMs of the two documents remain separate. When an event is not handled in the leaf document, the Document Manager will propagate it to the application root, where it will be targeted at the <vxml> node. Requests to fetch properties or to active grammars will be handled by the Document Manager in a similar fashion. To handle platform- and/or language-level defaults, we will create a "super-root" document above the application root. The Document Manager will pass it events and requests that are not handled in the root document. If root and superroot documents do not handle an event, the Document Manager will ensure that the event is thrown away.

4.4.2.3 Summary of Syntax/Semantics Interaction

There seem to be four kinds of interactions between RCs and the DOM at runtime:

RCs can inject DOM events into the DOM.
The DOM can invoke the RCs for specific event handlers.
The event handler RCs can signal the DOM when they have finished processing.
The DOM can signal the emitting RC when it has finished processing a DOM event.

Editorial note

Open Issue: DOM Modification. There are two possibilities: 1) we can refuse to allow the DOM to be modified (or ignore the modifications if it is) 2) we can reconstruct the relevant resource controllers when the DOM is modified. In the latter case, the straightforward approach would be: a) find the least node that is an ancestor of all the changes and that has a constructor b) call its constructor as during initialization, using the current state of the DOM and RCs as context.

5 Resources

This section describes semantic models for common VoiceXML resources. Resources have a life cycle of creation and destruction. Specific resources may specify detailed requirements on these phases. All resources must be created prior to their use by a VoiceXML interpreter.

Editorial note
Standard lifecycle events are expected to be defined in later versions: create event: from idle to created; destroy event: from created to idle.

Resources are defined in terms of a state model and events which it processes within defined states. Events may be divided into those which are defined by the resource itself and events defined by other conceptual entities which the resource receives or sends within these states. These conceptual entities include resource controllers and a 'device' which provides an implementation of the services defined by the resource.

The semantic model is specified in both UML state chart diagrams and SCXML representations. In case of ambiguity, the SCXML representation takes precedence over UML diagrams. Note that SCXML is used here to define the states and events for resources and this definitional usage should not be confused with the use of SCXML to specify application flow (see 3.2 Flow). Furthermore, these resource events are conceptual, not DOM events: they are used to define relationship with other conceptual entities and are not exposed at the markup level. The relationship between conceptual events and DOM events is described in XXX.

The following resources are defined: data model (5.1 Datamodel Resource), prompt queue (5.2 Prompt Queue Resource) and DTMF and ASR recognition (5.3 Recognition Resources).

[Later versions will defined the following resources: recorder, SIV. Later versions may define the following resources: session recorder, ...]

5.1 Datamodel Resource

Editorial note
Later versions of this document will clarify that different datamodels may be instanced, such as ECMAScript, XML, etc. Conformance requirements will be stated at a later stage.

The datamodel is a repository for both user- and system-defined data and properties. To simplify variable lookup,we define the datamodel with a synchronous function-call API, rather than an asynchronous one based on events. The data model API does not assume any particular underlying representation of the data or any specific access language, thus allowing implementations to plug in different concrete data model languages.

There is a single global data model that is created when the system is first initialized. Access to data is controlled by means of scopes, which are stored in a stack. Data is always accessed within a particular scope, which may be specified by name but defaults to being the top scope in the stack. At initialization time, a single scope named "Global" is created. Thereafter scopes are explicitly created and destroyed by the data model's clients.

Editorial note
Resource and Resource controller description to be updated with API calls rather than events.

5.1.1 Data Model Resource API

Table 1: Data Model API
Function	Arguments	Return Value	Sequencing	Description
CreateScope	name(optional)	Success or Failure		Creates a new scope object and pushes it on top of the scope stack. If no name is provided the scope is anonymous and may be accessed only when it on the top of the scope stack. A Failure status is returned if a scope already exists with the specified name.
DeleteScope	name(optional)	Success or Failure		Removes a scope from the scope stack. If no name is provided, the topmost scope is removed. Otherwise the scope with provided name is removed. A Failure status is returned if the stack is empty or no scope with the specified name exists.
CreateVariable	variableName, value(optional), scopeName(optional)	Success or Error		Creates a variable. If scopeName is not specified, the variable is created in the top most scope on the scope stack. If no value is provided, the variable is created with the default value specified by the underlying datamodel. A Failure status is returned if a variable of the same name already exists in the specified scope.
DeleteVariable	variableName, scopeName(optional)	Success or Failure		Deletes the variable with the specified name from the specified scope. If no scopeName is provided, the variable is deleted from the topmost scope on the stack. The status Failure is returned if no variable with the specified name exists in the scope.
UpdateVariable	variableName, newValue, scopeName(optional)	Success or Failure		Assigns a new value to the variable specified. If scopeName is not specified, the variable is accessed in the topmost scope on the stack. A Failure status is returned if the specified variable or scope cannot be found.
ReadVariable	variableName, scopeName(optional)	value		Returns the value of the variable specified. If scopeName is not specified, the variable is accessed in the topmost scope on the stack. An error is raised if the specified variable or scope cannot be found.
EvaluateExpression	expr, scopeName(optional)	value		Evaluates the specified expression and returns its value. If scopeName is not specified, the expression is evaluated in the topmost scope on the stack. An error is raised if the specified scope cannot be found.

Issue ():

Do we need event listeners on the data model, e.g., to notify when the value of a variable changes?

Resolution:

None recorded.

5.2 Prompt Queue Resource

5.2.1 State Chart Representation

Here is a UML representation of the prompt queue. This state machine assumes that "queue" and "play" are separate commands and that a separate "play" will always be issued to trigger the play. When the "play" is issued, the systems plays any queued prompts, up to and including the first fetch audio in the queue. Then it halts, even if there are additional prompts or fetch audio in the queue and waits for another "play" command.

Editorial note
Open issue: Can queued prompt commands, either audio or TTS, be left un-fetched or un-rendered until a play command is issued to the prompt resource? This may result in delays or gaps in the production of the actual audio, as the rendering or fetching may not produce playable audio fast enough to avoid inter-prompt delays.

The prompt structure assumed here is fairly abstract. It consists of a specification of the audio along with optional parameters controlling playback (for example, speed or volume.) The audio may be presented in-line, as SSML or some other markup language, or as a pointer to a file or streaming audio source. Logically, URLs are dereferenced at the time the prompt is queued, but implementations are not required to fetch the actual media until the prompt in question is sent to the player device. Note that the player device is assumed to be able to handle both recorded prompts and TTS, and to be able to interpret SSML. Platforms are free to optimize their implementations as long as they conform to the state machine specified here. In particular, platforms may prefetch audio or begin TTS processing in the background before the prompt is sent to the player device. For applications that make use of VCR controls (speed up, skip forward, etc.), actual performance may depend on whether the platform has implemented such optimizations. For example, a request to skip forward on a platform that does not prefetch prompts may result in a long delay. Such performance issues are outside the scope of this specification.

This diagram assumes that SSML mark information is delivered in the Player.Done event, and that the player returns a Player.Done event when it is sent a 'halt' event (otherwise mark information would get lost on barge-in and hangup, etc).

Note that the "FetchAudio" state is shown stubbed out for reasons of space, and is expanded in a separate diagram below the main one.

Semantic model for prompt queue semantics

Figure X: Prompt Queue Model

Figure Y: Fetch audio Model

5.2.2 SCXML Representation

<?xml version="1.0" encoding="UTF-8"?>

<scxml initialstate="Created">
  <datamodel>
    <data name="queue"/>
    <data name="markName"/>
    <data name="markTime"/>
    <data name="bargeInType"/>
  </datamodel>

   <state id="Created">
    <initial id="Idle"/>

    <transition event="QueuePrompt">
     <insert  pos="after" loc = "datamodel/data[@name='queue']/prompt" val="_eventData/prompt"/>
    </transition>

     <transition event="QueueFetchAudio">
       <foreach var="node" nodeset="datamodel/data[@name='queue']/prompt"> 
         <if cond="$node[@fetchAudio='true']">
          <delete loc="$node"/>
         <else>
          <assign loc="$node[@bargeInType]" val="unbargeable"/>
         </else>
         </if>
       </foreach>
    <insert pos="after" name="datamodel/data[@name='queue']/prompt" val="_eventData/audio"/>
    </transition>

    <transition event="setParameter">
     <send target="player" event="setParameter" namelist="_eventData.paramName, _eventData.newValue"/>
    </transition>

    <transition event="Cancel" target="Idle">
     <send target="player" event="halt"/>
     <send event="PlayDone" namelist="/datamodel/data[@name='markName'].text(), /datamodel/data[@name='markTime'].text()"/>
     <delete loc="datamodel/data[@name='queue']/prompt"/>
    </transition>

    <transition event="CancelFetchAudio">
       <foreach var="node" nodeset="datamodel/data[@name='queue']/prompt"> 
         <if cond="$node[@fetchAudio='true']">
          <delete loc="$node"/>
         </if>
       </foreach>
    </transition>

    <state id="Idle">
     <onentry>
      <assign loc="/datamodel/data[@name='markName']" val=""/>
      <assign loc="/datamodel/data[@name='markTime']" val="-1"/>
      <assign loc="/datamodel/data[@name='bargeInType']" val=""/>
     </onentry>

     <transition event="Play" cond="/datamodel/data[@name='queue']/prompt[1][@fetchAudio] eq 'false'" target="PlayingPrompt"/>

     <transition event="Play" cond="/datamodel/data[@name='[queue']/prompt[1][@fetchAudio] eq 'true'" target="FetchAudio"/>
    </state>

    <state id="PlayingPrompt">
     <datamodel>
      <data name="currentPrompt"/>
     </datamodel>
     <onentry>
      <assign loc="/datamodel/data[@name='currentPrompt']/prompt" val="/datamodel/data[@name='queue']/prompt[1])"/>
      <delete loc="/datamodel/data[@name='queue']/prompt[1]"/>
      <if cond="/datamodel/data[@name='currentPrompt']/prompt[@bargeInType] != /datamodel/data[@name='bargeInType']">
       <send event="BargeInChange" namelist="/datamodel/ data[@name='currentPrompt']/prompt[@bargeInType]"/>
       <assign loc="/datamodel/data[@name='bargeInType']" expr="/ datamodel/data[@name='currentPrompt']/prompt[@bargeInType]"/>
      </if>
     </onentry>

     <invoke targettype="player" srcexpr="/datamodel/ data[@name='currentPrompt']/prompt"/>

     <finalize>
       <if cond="_eventData/MarkTime neq '-1'">
         <assign name="/datamodel/data[@name='markName']/" val="_eventData/markName.text()"/>
         <assign name="/datamodel/data[@name='markTime']/" val="_eventData/markTime.text()"/>
       </if>
     </finalize>

     <transition event="player.Done" cond="/datamodel/data[@name='queue']/prompt[last()] le '1'" target="Idle">
      <send event="PlayDone" namelist="/datamodel/data[@name='markName'].text(), /datamodel/data[@name='markTime'].text()"/>
     </transition>

     <transition event="player.Done" cond="/datamodel/data[@name='queue'/prompt[1][@fetchAudio] neq 'true'" target="PlayingPrompt"/>

     <transition event="player.Done"
         cond="/datamodel/data[@name='queue']/prompt[1][@fetchAudio] eq 'true'" target="FetchAudio"/>

    </state> <!-- end PlayingPrompt -->

    <state id="FetchAudio">
      <initial id="WaitFetchAudio"/>

      <transition event="player.Done" target="FetchAudioFinal"/>

      <state id="WaitFetchAudio">
        <onentry>
          <send target="self" event="fetchAudioDelay"
          delay="/datamodel/data[@name='queue']/prompts[1][@fetchaudiodelay]"/>
        </onentry>

       <transition event="fetchAudioDelay" next="StartFetchAudio"/>
       <transition event="cancelFetchAudio" next="FetchAudioFinal"/>
      </state>

     <state id="StartFetchAudio">
      <datamodel>
       <data name="fetchAudio"/>
      </datamodel>
      <onentry>
       <assign loc="/datamodel/data[@name='fetchAudio']" expr="/datamodel/data[@name='queue']/prompts[1]"/>
       <delete loc="/datamodel/data[@name='queue']/prompts[1]"/>
       <send target="self" event="fetchAudioMin" delay="/datamodel/data[@name='fetchAudio'][@fetchaudiominimum]"/>
       <send target="player" event="Play" namelist="/datamodel/data[@name='fetchAudio']"/>
       <if cond="/datamodel/data[@name='bargeInType'].text() ne 'fetchAudio'">
         <send event="BargeInChange" namelist="fetchAudio"/>
       </if>
      </onentry>

      <transition event="CancelFetchAudio" target="WaitFetchMinimum"/>

      <transition event="fetchAudioMin" target="WaitFetchCancel"/>
     </state>

     <state id="WaitFetchMinimum">
       <transition event="fetchAudioMin" target="FetchAudioFinal">
         <send target="player" event="halt"/>
       </transition>
     </state>

     <state id="WaitFetchCancel">
       <transition event="CancelFetchAudio" target="FetchAudioFinal">
         <send target="player" event="halt"/>
       </transition>
     </state>

     <state id="FetchAudioFinal" final="true" />
     <!-- could put cleanup handling here -->

    </state> <!-- end FetchAudio -->
   </state> <!-- end Created -->
</scxml>

5.2.3 Defined Events

The prompt queue resource can be controlled by means of the following events:

Table 2: Events received by prompt queue resource
Event	Source	Payload	Sequencing	Description
queuePrompt	any	prompt (M), properties(O)		adds prompt to queue, but does not cause it to be played
queueFetchAudio	any	prompt (M)		adds fetch audio to queue, removing any existing fetch audio from queue. Does not cause it to be played.
play	any			Causes any queued prompts or fetch audio to be played
changeParameter	any	paramName, newValue		Sets the value of paramName to newValue, which may be either an absolute or relative value. The new setting takes effect immediately, even if there is already a prompt playing.
cancelFetchAudio	any			Deletes any queued fetch audio. Also cancels any fetch audio that is already playing, unless fetchAudioMin has been specified and not yet reached.
cancel	any			Immediately cancels any prompt or fetch audio that is playing and clears the queue.

The prompt queue resource returns the following events to its invoker:

Table 3: Events sent by prompt queue resource
Event	Target	Payload	Sequencing	Description
prompt.Done	controller	markName(O), markTime(O)		Indicates prompt queue has played to completion and is now empty
bargeintypeChange	controller	one of: unbargeable, hotword, energy, fetchAudio		sent at start of prompt play and whenever a new prompt or fetch audio is played whose bargeinType differs from the preceding one.

Issue ():

Do we need 'fetchAudio' as a distinct bargein type?

Resolution:

None recorded.

5.2.4 Device Events

The prompt queue receives the following events from the underlying player:

Table 4: Prompt Queue: Events from Device
Event	Payload	Sequencing	Description
player.Done			Sent whenever a single prompt or piece of fetch audio finishes playing.

and sends the following events to the underlying device:

Table 5: Prompt Queue: Events sent to Device
Event	Payload	Sequencing	Description
play	prompt (M)		sent to platform to cause a single prompt to be played.
setParameter	paramName (M), value(O)		sent to platform to change the value of a playback parameter such as speed or volume. The new value may be absolute or relative. The change takes effect immediately.

5.2.5 Open Issue

Issue ():

Differences in PromptQueue Definition: see Details.

Resolution:

None recorded.

5.3 Recognition Resources

Two types of recognition resources are defined: DTMF recognition for recognition of DTMF input; and ASR recognition for recognition of speech input. Both recognition resources are associated with a device which implements their respective recognition services. Each device represents one or more actual recognizer instances. In case of a device implemented with multiple recognizers - for example two different speech recognition engines - it is the responsibility of the interpreter implementation to ensure that they adhere to the semantic model defined in this section.

DTMF and ASR recognition resources are semantically similar. They share the same state and eventing model as well as recognition processing, timing and result handling. However, the resources differ in the following respects:

Properties: the DTMF resource uses DTMF properties (vxml20, 6.3.3), while the ASR uses speech recognition properties (vxml20, 6.3.2).
Mode: the DTMF resource has the mode value 'dtmf' and the ASR resource has the value 'voice' (vxml20, inputmodes, 6.3.6)
Buffering: only DTMF resource may buffer input when the resource is not active (e.g. in the FIA transition state, vxml20: 4.1.8).

Otherwise, ASR and DTMF recognition resources share the same semantic model.

If a resource controller activates both DTMF and ASR recognition resources, then that resource controller is responsible for managing the resources so that only a single recognition result is produced per recognition cycle.

5.3.1 Definition

The recognition resource is defined in its created state grammars are added to the resource and subsequently prepared on the device, recognition with these grammars can be activated and suspended, and recognition results are returned.

When the recognition resource is ready to recognize (at least one active grammar), one or more recognition cycles may occur in sequence.

A recognition cycle is initiated when the resource sends the device an event instructing it to listen to the input stream.
A recognition cycle is terminated if the device sends the resource an error event, or the device is instructed to stop recognition by the resource. When terminated, the device removes partially or wholly processed input from its buffer, and resource awaits grammars to prepare.
During the recognition cycle, the device may send events to the resource indicating ongoing recognition status, and recognition results describing one or more input sequences which match active grammars.
During the recognition cycle, the device may receive instructions to suspend recognition. When the device is suspended, input is not buffered and the device must not send any events until it receives instructions to re-start or terminate recognition.
When the resource receives recognition results from the device during the recognition cycle, it passes them to its controller. A recognition cycle is now complete and the resource awaits instructions either to start another recognition cycle or to terminate recognition.

Thus a resource recognition resource may enter multiple recognition cycles (as required for 'hotword' recognition), while requiring that a device, even if it has multiple instantiations, only produces one set of recognition results per recognition cycle.

The recognition resource is defined in terms of a data model and state model.

The data model is composed of the following elements:

activeGrammars: an ordered list of grammars with which to recognize. Each item in the list contains the following information:
- content: a URI or inline content to the grammar itself
- properties: grammar-specific properties (vxml20: weight, mode, type, maxage, maxstale, etc)
- listener: a resource controller associated with this grammar
properties: properties pertaining to the recognition process. These properties differ depending on the type of the recognition resource: for a DTMF recognition resources, the properties include DTMF properties, and for ASR recognition resource, speech recognition properties. The properties may also include platform-specific recognition properties.
controller: the resource controller to which recognition status, results and error events are sent.
mode: the recognition resource's inputmode: 'voice' for an ASR recognition resource, and 'dtmf' for a DTMF recognition resource.

The state model is composed of states corresponding to functional state: idle, preparing grammars, ready to recognize, recognizing, suspended recognition and waiting for results.

In the idle state, the resource awaits events from resource controllers to activate grammars for recognition on the device. The data model - activeGrammars, properties, controller and mode - is (re-)initialized upon entry to this state: activeGrammars are cleared, properties and controllers are set to null. If the resource receives an 'addGrammar' event, a new item is added to activeGrammars using grammar, properties and listener data in the event payload. If the resource receives a 'prepare' event, it updates its data model with event data: 'properties' with the properties event data and 'controller' is updated with the controller event data. Subsequent event notifications and responses are sent to the resource controller identified as the 'controller'. The recognition resource then moves into the preparing grammars state.

In the preparing grammars state, the resource behavior depends on whether activeGrammars is empty or not. If activeGrammars is empty (i.e. no active grammars are defined for this recognition resource), the resource sends the controller a 'notPrepared' event and returns to the idle state. If activeGrammar is non-empty, the resource sends a 'prepare' event to the device. The event payload includes 'grammars' and 'properties' parameters. The 'grammars' value is an ordered list where each list item is a grammar's content and its properties extracted from activeGrammars. The order of grammars in the 'grammars' parameter must follow the order in the activeGrammar data model. If the device sends a 'prepared' event, the resource sends a 'prepared' event to the controller and transitions into the ready to recognize state.

When the recognition resource is in a ready to recognize state, it may receive a 'stop' event. In this case, the resource sends a 'stop' event to the device, and returns to the idle state. If the resource receives a 'listen' event, it sends a 'listen' event to the device and moves into the recognizing state.

When the resource is in a recognizing state, it can toggle between this state and a suspended recognizing state by receiving further 'listen' and 'suspend' events. If the resource receives a 'suspend' event, then it moves into the suspended recognizing state and sends the device a 'suspend' event which causes the device to suspend recognition and delete any buffered input. No input is buffered while the device is in a suspended state. If the resource then receives a 'listen' event, it moves back into the recognizing state.

When in the recognizing state, the resource may receive an 'inputStarted' event from the device, indicating that user input has been detected. The resource then moves into a waiting for results state. The device may send an 'error' event (for example, if maximum time has been exceeded) causing it to return to the idle state and send the controller an 'error' event. Alternatively, the device may send a 'recoResults' event, which contains a results parameter, a data structure representing recognition results in VoiceXML 2.0 or EMMA format. The structure may contain zero or more recognition results. Each result must specify the grammar associated with the recognition (using the same grammar name as used in the payload of the 'prepare' event), its recognition confidence and its input mode. The resource sends its controller a 'recoResults' event with event data containing the device's results parameter together with a listener parameter whose value is the listener associated with the grammar of the first result with the highest confidence (if there are no results, then the listener parameter is not defined). The resource then returns to the ready to recognize state, awaiting either a 'stop' event to terminate recognition or a 'listen' event to start another recognition cycle using the same active grammars and recognition properties.

5.3.2 Defined Events

A recognition resource is defined by the events it receives:

Table 6: Events received by recognition resource
Event	Source	Payload	Sequencing	Description
addGrammar	any	grammar (M), listener (M), properties (O)		creates a grammar item composed of the grammar, listener and properties, and adds it to the activeGrammars
prepare	any	controller (M), properties (M)		prepares the device for recognition using activeGrammars and properties
listen	any			initiates/resumes recognition
suspend	any			suspends recognition
stop	any			terminates recognition

and the events it sends:

Table 7: Events sent by recognition resource
Event	Target	Payload	Sequencing	Description
prepared	controller		one-of: prepared, notPrepared	positive response to prepare (activeGrammars prepared)
notPrepared	controller		one-of: prepared, notPrepared	negative response to prepare (no activeGrammars defined)
inputStarted	controller			notification that onset of input has been detected
inputFinished	controller			notification that the end of input has been detected
partialResult	controller	results (M), listener (O)		notification of a partial recognition result
recoResult	controller	results (M), listener (O)		notification of complete recognition result, including the results structure and a listener
error	controller	error status (M)		notification that an error has occurred

5.3.3 Device Events

The resource receives from the recognition device the following events:

Table 8: Recognition: Events from Device
Event	Payload	Sequencing	Description
prepared			response to prepare indicating that activeGrammars have been successfully prepared
inputStarted			notification that the onset of input has been detected
inputFinished			notification that the end of input has been detected
partialResults	results (M)		notification of a partial recognition results
recoResults	results (M)		notification of final recognition results
error	error status (M)		an error occurred

and sends to the recognition device the following events:

Table 9: Recognition: Events sent to Device
Event	Payload	Sequencing	Description
prepare	grammars (M), properties (M)		the recognition device is prepared with grammars and properties
listen			recognition is to be initiated
suspend			recognition is to be suspended
stop			recognition is to be stopped

5.3.4 State Chart Representation

The state model for an ASR recognition resource are shown in Figure 2. The DTMF resource model only differs in that the value for the mode data is 'dtmf' instead of 'voice'.

[generalize stop event returning resource to idle state ...]

Figure 2: Recognition Resource States

5.3.5 SCXML Representation

6 Modules

In VoiceXML 3.0, the language is partitioned into independent modules which can be combined in various ways. In addition to the modules defined in this section, it is also possible for third parties to define their own modules (see Section XXX).

Each module is assigned a schema, which defines its syntax, plus one or more Resource Controllers (RCs), which define its semantics, plus a "constructor" that knows how to create them from the syntactic representation at initialization time. Only DOM nodes that have schemas and constructors (and hence RCs) assigned to them can be modules in VoiceXML 3.0. However, we may choose to define constructors and RCs for nodes that are not modules. Nodes that do not have constructors and RCs ultimately depend on some module for their interpretation. (Those modules are usually ancestor nodes, but we do not require this.) There can be multiple modules associated with the same VoiceXML element. They may set properties differently, add different child elements, etc. In many cases, some of the modules will be extensions of the others, but we don't require this.

Note there is not necessarily a one-to-one relationship between semantic RCs and syntactic markup elements. It may take several RCs to implement the functionality of a single markup element.

6.1 Grammar Module

This module describes the syntactic and semantic features of a <grammar> element which defines grammars used in ASR and DTMF recognition. Grammars defined via this module are used by other modules.

The attributes and content model of <grammar> are specified in 6.1.1 Syntax. Its semantics are specified in 6.1.2 Semantics.

6.1.1 Syntax

[See XXX for schema definitions].

6.1.1.1 Attributes

The <grammar> element has the attributes specified in Table 10.

Table 10: <grammar> Attributes
Name	Type	Description	Required	Default Value
mode	The only allowed values are "voice" and "dtmf"	Defines the mode of the grammar following the modes of the W3C Speech Recognition Grammar Specification [SRGS].	No	The value of the document property "grammarmode"
weight	Weights are simple positive floating point values without exponentials. Legal formats are "n", "n.", ".n" and "n.n" where "n" is a sequence of one or many digits.	Specifies the weight of the grammar. See vxml2: Section 3.1.1.3	No	1.0
fetchhint	One of the values "safe" or "prefetch"	Defines when the interpreter context should retrieve content from the server. prefetch indicates a file may be downloaded when the page is loaded, whereas safe indicates a file that should only be downloaded when actually needed.	No	None
fetchtimeout	Time Designation	The interval to wait for the content to be returned before throwing an error.badfetch event.	No	None
maxage	An unsigned integer	Indicates that the document is willing to use content whose age is no greater than the specified time in seconds (cf. 'max-age' in HTTP 1.1 [RFC2616]). The document is not willing to use stale content, unless maxstale is also provided.	No	None
maxstale	An unsigned integer	Indicates that the document is willing to use content that has exceeded its expiration time (cf. 'max-stale' in HTTP 1.1 [RFC2616]). If maxstale is assigned a value, then the document is willing to accept content that has exceeded its expiration time by no more than the specified number of seconds.	No	None

Editorial note
The default value of the "grammarmode" document property (see XXXX) is "voice".

6.1.1.2 Content Model

The content model of <grammar> consists of exactly one of:

the <grammar> element (from the http://www.w3.org/2001/06/grammar namespace; see 6.2 Inline SRGS Grammar Module), or
the <externalgrammar> element (see 6.3 External Grammar Module)

6.1.2 Semantics

The grammar RC is the primary RC for the <grammar> element.

6.1.2.1 Definition

The grammar RC is defined in terms of a data model and state model.

The data model is composed of the following parameters:

controller: the RC controlling this grammar RC
properties: weight attribute value, fetchtimeout, maxage, maxstale, charset, encoding, and language
mode: mode
fetchhint: fetchhint

The grammar RC's state model consists of the following states: Idle, Initializing, Ready, and Executing.

While in the Idle state, the RC may receive an 'initialize' event, whose 'controller' event data is used to update the data model. The RC then transitions into the Initializing state.

In the Initializing state, the grammar RC first initializes its child.

The values of the fetchtimeout attribute and the grammarfetchtimeout property (**REF**) are used to determine the fetchtimeout property value according to section XXXX.
The values of the maxage attribute and the grammarmaxage property (**REF**) are used to determine the maxage property value according to section XXXX.
The values of the maxstale attribute and the grammarmaxstale property (**REF**) are used to determine the maxstale property value according to section XXXX.
The values of the fetchhint attribute and the grammarfetchhint property (**REF**) are used to determine the fetchhint parameter value according to section XXXX.

Next, the language, charset, and encoding parameters are set to the values in effect at this point in the document. If the fetchhint parameter value is "Prefetch", the RC sends the Prefetch event to the DTMF or ASR Recognizer resource, as appropriate (see below), with the following data: the child RC, fetchtimeout, maxage, maxstale. Finally, the RC sends the controller an 'initialized' event and transitions to the Ready state.

In the Ready state, when the grammar RC receives an 'execute' event it transitions to the Executing state.

In the Executing state,

The values of the fetchtimeout attribute and the grammarfetchtimeout property (**REF**) are used to determine the fetchtimeout property value according to section XXXX.
The values of the maxage attribute and the grammarmaxage property (**REF**) are used to determine the maxage property value according to section XXXX.
The values of the maxstale attribute and the grammarmaxstale property (**REF**) are used to determine the maxstale property value according to section XXXX.

If the child RC is an External Grammar, the grammar RC sends an 'execute' event to the child RC and waits for it to complete.

Then, the grammar RC sends an AddGrammar event to the DTMF Recognizer Resource if mode="dtmf" or to the ASR Recognizer Resource if mode="voice", with the following as event data: the child RC, the fetchhint, language, charset, and encoding parameter values, and the controller RC (e.g., link, field, or form) as the handler for recognition results.

Finally, the grammar RC sends the controller an executed event and transitions to the Ready state.

Editorial note

Initializing: Validate that behavior of sending a pointer to the child RC to the ASR resource. Is this acceptable, or do we need to extract the grammar data from the child RC and then send that data? The advantage of sending the RC pointer is that it makes clear what kind of grammar info it is -- inline SRGS or external reference.

Execute issues:

Note that a mismatch between the "mode" attribute value and any mode param returned in the media type cannot be detected at this stage because the document hasn't been fetched yet.
Still need to add a 'cond' capability as we have for prompts.
Should we allow explicit scope indication a la VoiceXML 2's "scope" attribute? How do we handle document-scoped grammars defined syntactically at a lower level? In this case should the handler be the controller RC or the controller for the document? Which RC actually executes the grammar RC?
How does 'as if by copy' change this for <link> grammars?

Editor will write new section 4.5 "Other" and subsections 4.5.1 "property/attribute resolution" and 4.5.2 "language resolution". Depending on the text, we may need to update the semantics to refer to section 4.5.2 when describing how xml:lang is used.

6.1.2.2 Defined Events

The Grammar RC is defined to receive the following events:

Table 11: Events received by Grammar RC
Event	Source	Payload	Description
initialize	any	controller(M)	causes the element and its children to be initialized
execute	controller	Adds the grammar to the appropriate Recognition Resource

and the events it sends:

Table 12: Events sent by Grammar RC
Event	Target	Payload	Description
initialized	controller		response to initialize event indicating that it has been successfully initialized
executed	controller		response to execute event indicating that it has been successfully executed

6.1.2.3 External Events

The external events sent and received by the Grammar RC are those defined in this table:

Table 13: Grammar RC External Events
Event	Source	Target	Description
addGrammar	GrammarRC	DTMF Recognition Resource or ASR Recognition Resource	Adds grammar to list of currently active grammars
Prefetch	GrammarRC	DTMF Recognition Resource or ASR Recognition Resource	Requests that the grammar be fetched/compiled in advance, if possible

6.1.2.4 State Chart Representation

6.1.3 Events

The events in this table may be raised during initialization and execution of the <grammar> element.

Table 14: <grammar> Events
Event	Description	State
error.semantic	indicates an error with data model expressions: undefined reference, invalid expression resolution, etc.	execution

Note that additional errors may occur when the grammar is fetched or added by the ASR or DTMF resource. Please check there for details.

6.1.4 Examples

[TBD: put all examples here.]

6.2 Inline SRGS Grammar Module

This module describes the syntactic and semantic features of inline SRGS grammars used in ASR and DTMF recognition.

Editorial note
Issue: Do we need to support inline ABNF SRGS?:

The attributes and content model of Inline SRGS gramamrs are specified in 6.2.1 Syntax. Its semantics are specified in 6.2.2 Semantics.

6.2.1 Syntax

[See XXX for schema definitions].

The syntax of the Inline SRGS Grammar Module is precisely all of the XML markup for a legal stand-alone XML form grammar as described in SRGS ([SRGS]), minus the XML Prolog. Note that both elements and attributes must be in the SRGS namespace (http://www.w3.org/2001/06/grammar).

6.2.2 Semantics

6.2.2.1 Definition

The Inline SRGS grammar RC is defined in terms of a data model and state model.

The data model is composed of the following parameters:

controller: the grammar RC controlling this inline grammar RC
grammar: the text of the entire grammar

Editorial note
Should the contents of the grammar parameter be parsed rather than the raw document text? For example, should it be the DOM representation of the grammar, or just the XML Info set, or what?

The grammar RC's state model consists of the following states: Idle, Initializing, and Ready. Unlike most of the other modules, this module is primarily a data model for storing a grammar. The module itself has no execution semantics.

While in the Idle state, the RC may receive an 'initialize' event, whose 'controller' event data is used to update the data model. The RC then transitions into the Initializing state.

In the Initializing state, the syntactic contents of the grammar are saved into the grammar parameter. The RC sends the controller an 'initialized' event and transitions to the Ready state.

6.2.2.2 Defined Events

The Inline SRGS Grammar RC is defined to receive the following events:

Table 15: Events received by Inline SRGS Grammar RC
Event	Source	Payload	Description
initialize	any	controller(M)	causes the element and its children to be initialized

and the events it sends:

Table 16: Events sent by Inline SRGS Grammar RC
Event	Target	Payload	Description
initialized	controller		response to initialize event indicating that it has been successfully initialized

6.2.2.3 External Events

The Inline SRGS Grammar Module does not send or receive any external events.

6.2.2.4 State Chart Representation

6.2.2.5 SCXML Representation

6.2.3 Events

No module-specific events are raised during initialization of an Inline SRGS Grammar. Note that validity failure of the inline SRGS content would be detected at document parse time.

6.2.4 Examples

[TBD: put all examples here.]

6.3 External Grammar Module

This module describes the syntactic and semantic features of an <externalgrammar> element which defines external grammars used in ASR and DTMF recognition.

Editorial note
The name of this element is still under discussion.

The attributes and content model of <externalgrammar> are specified in 6.3.1 Syntax. Its semantics are specified in 6.3.2 Semantics.

6.3.1 Syntax

[See XXX for schema definitions].

6.3.1.1 Attributes

The <externalgrammar> element has the attributes specified in Table 17.

Table 17: <externalgrammar> Attributes

Name

Type

Description

Required

Default Value

src

anyURI

The URI specifying the location of the grammar and optionally a rulename within that grammar, if it is external. The URI is interpreted as a rule reference as defined in Section 2.2 of the Speech Recognition Grammar Specification [SRGS] but not all forms of rule reference are permitted from within VoiceXML. The rule reference capabilities are described in detail below this table.

srcexpr

A data model expression

Equivalent to src, except that the URI is dynamically determined by evaluating the content as a data model expression.

type

A data model expression

The preferred media type of the grammar. A resource indicated by the URI reference in the src attribute may be available in one or more media types. The author may specify the preferred media-type via the type attribute. When the content represented by a URI is available in many data formats, a VoiceXML platform may use the preferred media-type to influence which of the multiple formats is used. For instance, on a server implementing HTTP content negotiation, the processor may use the preferred media-type to order the preferences in the negotiation.

The resource representation delivered by dereferencing the URI reference may be considered in terms of two types. The declared media-type is the asserted value for the resource and the actual media-type is the true format of its content. The actual media-type should be the same as the declared media-type, but this is not always the case (e.g. a misconfigured HTTP server might return 'text/plain for an 'application/srgs+xml' document). A specific URI scheme may require that the resource owner always, sometimes, or never return a media-type. The declared media-type is the value returned by the resource owner or, if none is returned, the preferred media type. There may be no declared media-type if the resource owner does not return a value and no preferred type is specified. Whenever specified, the declared media-type is authoritative.

Three special cases may arise. The declared media-type may not be supported by the processor; in this case, an error.unsupported.format is thrown by the platform. The declared media-type may be supported but the actual media-type may not match; an error.badfetch is thrown by the platform. Finally, there may be no declared media-type; the behavior depends on the specific URI scheme and the capabilities of the grammar processor. For instance, HTTP 1.1 allows document intraspection (see [RFC2616], section 7.2.1), the data scheme falls back to a default media type, and local file access defines no guidelines. The following table provides some informative examples:

	HTTP 1.1 request	Local file access
Media-type returned by the resource owner	text/plain	application/srgs+xml	<none>	<none>
Preferred media-type appearing in the grammar	Not applicable; the returned type takes precedence	application/srgs+xml	<none>
Declared media-type	text/plain	application/srgs+xml	application/srgs+xml	<none>
Behavior if the actual media-type is application/srgs+xml	error.badfetch thrown; the declared and actual types do not match	The declared and actual types match; success if application/srgs+xml is supported by the processor; otherwise an error.unsupported.format is thrown	Scheme specific; the processor might intraspect the document to determine the type.

None

Editorial note
Error messages for "type" attribute need to be updated.

See 6.3.1.2 Content Model for restrictions on occurrence of src and srcexpr attributes.

The value of the src attribute is a URI specifying the location of the grammar with an optional fragment for the rulename. Section 2.2 of the Speech Recognition Grammar Specification [SRGS] defines several forms of rule reference. The following are the forms that are permitted on a grammar element in VoiceXML:

Reference to a named rule in an external grammar: src attribute is an absolute or relative URI reference to a grammar which includes a fragment with a rulename. This form of rule reference to an external grammar follows the behavior defined in Section 2.2.2 of [SRGS]. If the URI cannot be fetched or if the rulename is not defined in the grammar or is not a public (activatable) rule of that grammar then an error.badfetch is thrown.
Reference to the root rule of an external grammar: src attribute is an absolute or relative URI reference to a grammar but does not include a fragment identifying a rulename. This form implicitly references the root rule of the grammar as defined in Section 2.2.2 of [SRGS]. If the URI cannot be fetched or if the grammar cannot be referenced by its root (see Section 4.7 of [SRGS]) then an error.badfetch is thrown.

The following are the forms of rule reference defined by [SRGS] that are not supported in VoiceXML 3.

Local rule reference: a fragment-only URI is not permitted. (See definition in Section 2.2.1 of [SRGS]). A fragment-only URI value for the src attribute causes an error.semantic event.
Reference to special rules: there is no support for special rule references (NULL, VOID, GARBAGE) on the <grammar> element itself. In the XML form of the SRGS specification, the only way to include NULL, VOID, and GARBAGE is via the use of the "special" attribute on the <ruleref> element. Thus, it is not possible to reference individual uses of NULL, VOID, and GARBAGE rules in a separate SRGS document, since that would require a fragment identifier to place on the end of the URI referencing the document, which in turn would require an id within that document for the given use of NULL, VOID, or GARBAGE. Note that the external grammar referenced from the <grammar> element may itself be an SRGS grammar that contains a <ruleref> element with a special attribute to reference NULL, VOID, or GARBAGE.

6.3.1.2 Content Model

The <externalgrammar> element is empty.

The <externalgrammar> element has the following co-occurrence constraints:

Exactly one of the "src" or "srcexpr" attributes must be specified; otherwise, an error.badfetch event is thrown.

Editorial note
Editor: please remove the "otherwise, an error.badfetch ..." from the above and all other co-occurrence text and write general text somewhere describing what happens when a co-occurrence constraint is violated.

6.3.2 Semantics

6.3.2.1 Definition

The External Grammar RC is defined in terms of a data model and state model.

The data model is composed of the following parameters:

properties: src, type attribute values
controller: the controller RC of this <externalgrammar>
srcexpr: srcexpr attribute value

The External Grammar RC's state model consists of the following states: Idle, Initializing, Ready, and Executing.

While in the Idle state, the RC may receive an 'initialize' event, whose 'controller' event data is used to update the data model. The RC then transitions into the Initializing state.

In the Initializing state, the RC sends the controller an 'initialized' event and transitions to the Ready state.

In the Ready state, when the External Grammar RC receives an 'execute' event it transitions to the Executing state.

In the Executing state, if the srcexpr variable is set it is evaluated against the data model as a data model expression, and the value is placed into the src variable; if srcexpr cannot be evaluated, an error.semantic event is thrown. Otherwise, the RC sends an 'executed' event to the controller RC and transitions into the Ready state.

6.3.2.2 Defined Events

The External Grammar RC is defined to receive the following events:

Table 19: Events received by External Grammar RC
Event	Source	Payload	Description
initialize	any	controller(M)	causes the element and its children to be initialized
execute	controller		Evaluates srcexpr and populates src variable

and the events it sends:

Table 20: Events sent by External Grammar RC
Event	Target	Payload	Description
initialized	controller		response to initialize event indicating that it has been successfully initialized
executed	controller		response to execute event indicating that it has been successfully executed

6.3.2.3 External Events

The External Grammar Module does not send or receive any external events.

6.3.2.4 State Chart Representation

6.3.2.5 SCXML Representation

6.3.3 Events

The events that may be raised during initialization and execution of the <externalgrammar> element are those defined in Table 21 below.

Table 21: <externalgrammar> Events
Event	Description	State
error.semantic	indicates that there was an error in the evaluation of the srcexpr attribute.

6.3.4 Examples

[TBD: put all examples here.]

6.4 Prompt Module

This module defines the syntactic and semantic features of a <prompt> element which controls media output. The content model of this element is empty: content is defined in other modules which extend this element's content model (for example 6.5 Builtin SSML Module, 6.6 Media Module and 6.7 Parseq Module).

The attributes and content model of <prompt> are specified in 6.4.1 Syntax. Its semantics are specified in 6.4.2 Semantics, including how the final prompt content is determined and how the prompt is queued for playback using the PromptQueue Resource (5.2 Prompt Queue Resource).

6.4.1 Syntax

[See XXX for schema definitions].

6.4.1.1 Attributes

The <prompt> element has the attributes specified in Table 22.

Table 22: <prompt> Attributes
Name	Type	Description	Required	Default Value
bargein	boolean	Controls whether the prompt can be interrupted.	No	bargein property
bargeintype	string	On prompts that can be interrupted, determines the type of bargein, either 'speech', or 'hotword'.	No	bargeintype property
cond	data model expression	A data model expression that must evaluate to true after conversion to boolean in order for the prompt to be played.	No	true
count	positive integer	A number indicating the repetition count, allowing a prompt to be activated or not depending on the current repetition count.	No	1
timeout	Time Designation	The time to wait for user input.	No	timeout property
xml:lang	string	The language identifier for the prompt.	No	document's "xml:lang" attribute
xml:base	string	Declares the base URI from which relative URIs in the prompt are resolved.	No	document's "xml:base" attribute

6.4.1.2 Content Model

The content model of the <prompt> element is empty.

Other modules can extend the content model. These modules must define how the content is evaluated and processed before being added to the prompt queue.

6.4.2 Semantics

The prompt RC is the primary RC for the <prompt> element.

6.4.2.1 Definition

The prompt RC is defined in terms of a data model and state model.

The data model is composed of the following parameters:

controller: the RC controlling this prompt RC
children: array of children's (primary) RC
count: count attribute value
cond: cond attribute expression
properties: bargein, bargeintype and timeout attribute values
xml:lang: xml:lang attribute value
xml:base: xml:base attribute value

The prompt RC's state model consists of the following states: Idle, Initializing, Ready, and Executing. The initial state is the Idle state.

While in the Idle state, the prompt RC may receive an 'initialize' event, whose controller event data is used to update the data model. The prompt RC then transitions into Initiating state.

In the Initializing state, the prompt RC initializes its children: this is modeled as a separate RC (see XXX). The children may return an error for initialization. If a child sends an error, then the prompt RC returns an error. When all children are initialized, the prompt RC sends the controller an 'initialized' event and transitions to the Ready state.

In the Ready state, the prompt RC can receive a 'checkStatus' event to check whether this prompt is eligible for execution or not. The value of the cond parameter in its data model is checked against the data model resource: the status is true if the value of the cond parameter evaluates to true. The status, together with its count data, is sent in a 'checkedStatus' event to the controller RC. The controller RC then determines if the prompt is selected for execution ([vxml20: 4.1.6], see PromptSelectionRC, Section XXX). If the prompt RC receives an 'execute' event it transitions to the Executing state.

In the Executing state, the prompt RC sends an evaluate event to its children. Each child returns either an error, or content (which may include parameters) for playback. If a child sends an error, then the prompt RC returns an error. Once evaluation is complete, the RC sends a queuePrompt event to the Prompt Queue Resource with the <prompt> parameters (bargein, bargeintype, timeout) with event data consisting of the list of content returned by its children. The prompt RC then sends the controller an executed event and transitions to the Ready state.

Editorial note
SSML validation issue: what if evaluation results in a non-valid structure?

6.4.2.2 Defined Events

The Prompt RC is defined to receive the following events:

Table 23: Events received by Prompt RC
Event	Source	Payload	Description
initialize	any	controller(M)		causes the element and its children to be initialized
checkStatus	controller		causes evaluation of the cond parameter against the data model
execute	controller		causes the evaluation of its content and conversion to a format suitable for queueing on the PromptQueue Resource

and the events it sends:

Table 24: Events sent by Prompt RC
Event	Target	Payload	Description
initialized	controller	response to initialize event indicating that it has been successfully initialized
checkedStatus	controller	status (M), count (M)	response to checkStatus event with count parameter and status indicating evaluation of cond parameter
executed	controller		response to execute event indicating that it has been successfully executed

6.4.2.3 External Events

Table 25 shows the events sent and received by the prompts RC to resources and other RCs which define the events.

Table 25: Prompt RC External Events
Event	Source	Target	Description
evaluate	PromptRC	DataModel	used to evaluate the cond parameter (see XXX)
queuePrompt	PromptRC	PromptQueue	adds prompt content and properties to the Prompt Queue (see XXX)

6.4.2.4 State Chart Representation

Figure 3: Prompt RC States

6.4.2.5 SCXML Representation

6.4.3 Events

The events in Table 26 may be raised during initialization and execution of the <prompt> element.

Table 26: <prompt> Events
Event	Description	State
error.unsupported.language	indicates that an unsupported language was encountered. The unsupported language is indicated in the event message variable.	execution
error.unsupported.element	indicates that the element within the <prompt> element is not supported	initialization
error.badfetch	indicates that the prompt content is malformed ...	initialization, execution
error.noresource	indicates that a Prompt Queue resource is not available for rendering the prompt content.	execution
error.semantic	indicates an error with data model expressions: undefined reference, invalid expression resolution, etc.	execution

Editorial note

The relationship between the user visible events defined in the above table, and semantic event model has yet to be defined.

Can we really determine whether errors are raised in initialization (syntax) or execution (evaluation) states? How does this fit in with errors returned when prompts are played in PromptQueue player implementation? ACTION: Clarify which specific cases are affected by 'error.badfetch' ambiguity re. initialization versus execution states.

Clarify that error.semantic doesn't apply to evaluation of src/expr with <audio> (e.g. fallback).

Clarify that errors are recorded? (vxml21??)

Should media control properties (e.g. clipBegin, speed, etc) of <media> be also available on <prompt>?

We should clarify where the error.badfetch gets thrown. For instance, if we are loading a document with malformed prompt elements, the error.badfetch may get thrown back to the calling document. If we are throwing error.badfetch during execution, then it will be thrown back to the malformed document itself?

6.4.4 Examples

[TBD: put all examples here.]

6.5 Builtin SSML Module

This module describes the syntactic and semantic features of SSML elements built into VoiceXML.

This module is designed to extend the content model of the <prompt> element defined in 6.4 Prompt Module.

The attributes and content model of SSML elements are specified in 6.5.1 Syntax. Its semantics are specified in 6.5.2 Semantics, including how elements are evaluated to yield final content for playback.

6.5.1 Syntax

[See XXX for schema definitions].

This module defines an SSML ([SSML]) Conforming Speech Synthesis Markup Language Fragment where:

there is no explicit <speak> element.
these elements are part of the VoiceXML namespace rather than the SSML namespace
the <foreach> element may appear inside these elements prior to evaluation
the <value> element may appear inside these elements prior to evaluation
The <audio> element is extended with the attributes specified in Table 27.

Table 27: <audio> Attributes defined in VoiceXML
Name	Description	Required	Default Value
fetchtimeout	See fetchtimeout definition	No	fetchtimeout property
fetchhint	See fetchhint definition	No	audiofetchhint property
maxage	See maxage definition	No	audiomaxage property
maxstale	See maxstale definition	No	audiomaxstale property
expr	A data model expression which determines the source of the audio to be played. The expression may be either a reference to audio previously recorded (see Record Module) or evaluate to the URI of an audio resource to fetch.	No	undefined

Exactly one of "src" or "expr" attributes must be specified; otherwise, an error.badfetch event is thrown.

Editorial note

SSML 1.1 required for fetching attributes like fetchtimeout? Or profile dependent?

Support for 'say-as' extension to SSML 1.0?

Support for <enumerate>?

Note that profiles specify which media formats are required

6.5.2 Semantics

When the RC receives an evaluate event, its children are evaluated in order to return an SSML Conforming Stand-Alone Speech Synthesis Markup Language Document which can be processed by a Conforming Speech Synthesis Markup Language Processor.

Evaluation comprises of:

data model expressions are evaluated against the data model.
If a <foreach> element is present, it is evaluated so as to yield content for each defined item in the array.
If an <audio> element has a specified expr attribute, then the attribute value is evaluated to provide a URI value for the src attribute. If the expr evaluation results in an ECMAScript undefined value, then the <audio> element, including its alternate content, is ignored.
If a <value> element is present, its expr attribute is evaluated to return a CDATA value.
construction of a <speak> element with appropriate version, namespace, and xml:lang attributes. The xml:lang attribute value is inherited from the <prompt> element (see 6.4 Prompt Module).
if an unsupported language is encountered, the platform throws an error.unsupported.language event which specifies the language in its message variable

Editorial note

We may want to refine the description that the output of evaluation is an SSML Document. One rationale is that we don't want to prohibit that SSML extensions are lost during evaluation. The output may be another Fragment rather than a Document.

Clarify exact nature of <audio> expr value for skipping - undefined vs. null?

Need to specify further error cases

Do these elements have RCs? They are in the VoiceXML namespace but are just enhanced SSML elements.

Need to clarify unsupported languages and external (e.g. MRCP) SSML processors.

6.5.3 Examples

In this example

<prompt>

 <foreach item="item" array="array">
    <audio expr="item.audio"><value expr="item.tts"/></audio>
    <break time="300ms"/>
 </foreach>

</prompt>

evaluation returns a sequence of content for each item in <foreach> with <audio> and <value> elements.

Assume that the array consists of 2 items where each item.audio evaluates to 'one.wav' and 'two.wav' respectively, and each item.tts evaluates to 'one' and 'two' respectively. Evaluation of <foreach> is equivalent to the following

<prompt>

    <audio expr="'one.wav'"><value expr="'one'"/></audio>
    <break time="300ms"/>

    <audio expr="'two.wav'"><value expr="'two'"/></audio>
    <break time="300ms"/>

</prompt>

further evaluation of the <audio> and <value> elements result in

<prompt>

    <audio src="one.wav">one</audio>
    <break time="300ms"/>

    <audio src="two.wav">two</audio>
    <break time="300ms"/>

</prompt>

and finally the prompt content is converted into a stand-alone SSML document (assuming the <prompt>'s xml:lang attribute evaluates to 'en'):

<speak version="1.0" xml:lang="en" 
 xmlns="http://www.w3.org/2001/10/synthesis">

    <audio src="one.wav">one</audio>
    <break time="300ms"/>

    <audio src="two.wav">two</audio>
    <break time="300ms"/>

</speak>

This content is queued and played using the PromptQueue: each audio URI, or fallback content, is played, followed by a 300 millisecond break.

6.6 Media Module

The media module defines the syntax and semantics of <media> element.

The module is designed to extend the content model of <prompt> in the prompt module (6.4 Prompt Module).

The <media> element can be seen as an enhanced and generalized version of the VoiceXML <audio> element. It is enhanced in that it provides additional attributes describing the type of media, conditional selection, as well as control over playback . It is a generalization of the <audio> element in that it permits media other than audio to be played; for example, media formats which contains audio and video tracks.

6.6.1 Syntax

[See XXX for schema definitions].

6.6.1.1 Attributes

The <media> element has the attributes specified in Table 28.

Table 28: <media> Attributes
Name	Type	Description	Required	Default Value
src		The URI specifying the location of the media source.	No	None
srcexpr		A data model expression which evaluates to a URI indicating the location of the media resource.	No	undefined
cond		A data model expression that must evaluate to true after conversion to boolean in order for the media to be played.	No	true
type		The preferred media type of the output resource. A resource indicated by the URI reference in the `src` or `expr` attributes may be available in one or more media types. The author may specify the preferred media type via the `type` attribute. When the content represented by a URI is available in many data formats, a VoiceXML platform may use the preferred media-type to influence which of the multiple formats is used. For instance, on a server implementing HTTP content negotiation, the processor may use the preferred media-type to order the preferences in the negotiation. The resource representation delivered by dereferencing the URI reference may be considered in terms of two types. The declared media-type is the asserted value for the resource and the actual media-type is the true format of its content. The actual media-type should be the same as the declared media-type, but this is not always the case (e.g. a misconfigured HTTP server might return 'text/plain' for a 'audio/x-wav' or video/3gpp' resource). A specific URI scheme may require that the resource owner always, sometimes, or never return a media-type. The declared media-type is the value returned by the resource owner or, if none is returned, the preferred media type. There may be no declared media-type if the resource owner does not return a value and no preferred type is specified. Whenever specified, the declared media-type is authoritative. Three special cases may arise. The declared media-type may not be supported by the processor. No error is thrown by the platform and the content of the `media` element is played instead. The declared media-type may be supported but the actual media-type may not match. No error is thrown by the platform and the content of the `media` element is played instead. Finally, there may be no declared media-type; the behavior depends on the specific URI scheme and the media capabilities of the VoiceXML processor. For instance, HTTP 1.1 allows document introspection (see [RFC2616], section 7.2.1), the data scheme falls back to a default media type, and local file access defines no guidelines.	No	undefined
clipBegin	Time Designation	offset from start of media to begin rendering. This offset is measured in normal media playback time from the beginning of the media.	No	0s
clipEnd	Time Designation	offset from start of media to end rendering. This offset is measured in normal media playback time from the beginning of the media.	No	None
repeatDur	Time Designation	total duration for repeatedly rendering media. This duration is measured in normal media playback time from the beginning of the media.	No	None
repeatCount	positive Real number	number of iterations of media to render. A fractional value describes a portion of the rendered media.	No	1
soundLevel	signed ("+" or "-") CSS2 numbers immediately followed by "dB"	Decibel values are interpreted as a ratio of the squares of the new signal amplitude (a1) and the current amplitude (a0) and are defined in terms of dB: soundLevel(dB) = 20 log10 (a1 / a0) A setting of a large negative value effectively plays the media silently. A value of '-6.0dB' will play the media at approximately half the amplitude of its current signal amplitude. Similarly, a value of '+6.0dB' will play the media at approximately twice the amplitude of its current signal amplitude (subject to hardware limitations). The absolute sound level of media perceived is further subject to system volume settings, which cannot be controlled with this attribute.	No	+0.0dB
speed	x% (where x is a positive real value)	the speed at which to play the referenced media, relative to the original speed. The speed is set to the requested percentage of the speed of the original media. For audio, a change in the speed will change the rate at which recorded samples are played back and this will affect the pitch.	No	100%
outputmodes	space separated list of media types	Determines the modes used for media output. See 8.2.4 Media Properties for further details.	No	outputmodes property

See occurrence constraints for restrictions on occurrence of src and srcexpr attributes.

Calculations of rendered durations and interaction with other timing properties follow SMIL 2.1 Computing the active duration where

<media> is a time container
Time Designation values for clipBegin, clipEnd, and repeatDur are a subset of SMIL Clock-value
If the length of an media clip is not known in advance then it is treated as indefinite. Consequently repeatCount will have no effect.
If clipEnd is after the end of the media, then rendering ends at the media end.
If clipBegin is after clipEnd, no audio will be produced.
repeatDur takes precedence over repeatCount in determining the maximum duration of the media.

Note that not all SMIL 2.1 Timing features are supported.

Editorial note

Use SMIL 3.0 or SMIL 2.1 reference?

should trimming and media attributes also be defined in <prompt>?

do we need expr values for type, clipBegin, clipEnd, repeatDur, repeatCount, etc? (Perhaps add implied expr for every attribute?)

when is a property evaluation error thrown?

Add fetchtimeout, fetchhint, maxage and maxstale attributes

Major attribute candidate: errormode (flexible error handling which controls whether errors are thrown or fallback is used).

Other candidate attributes: id/idref (use case?)

6.6.1.2 Content Model

The <media> element content model consists of:

Inline content: SSML <speak> (0 or 1). Note that this content may include <value> and <foreach> elements from the VoiceXML namespace.
<desc> element (0 or more)
<media>: for fallback in the case where the resource referenced by the mother <media> element is unavailable (0 or more)
<property>: so media related properties can be set (0 or more)

The <media> has the following co-occurrence constraints:

One of the src attribute or the srcexpr attribute or inline content must be specified; otherwise, an error.badfetch event is thrown.

Note that the type attribute does not affect inline content. The handling of inline XML content is in accordance to the namespace of the root element (such as SSML <speak>, SMIL <smil>, and so forth). CDATA, or mixed content with VoiceXML <foreach> or <value> elements must be treated as an SSML Fragment and evaluated as described in 6.6.2 Semantics.

Editorial note

Permit other types of inline content apart from SSML?

Are child <property> elements necessary? Alternative: extended <prompt> so that <property> children are allowed?

6.6.1.2.1 Tips (informative)

Developers should be aware that there may be performance implications when using <media> depending on which attributes are specified, the media itself, its transport and processing.

Since operations like trimming, soundLevel and speed modifications are applied to media, this requires that the SSML processor begins generating output audio before these operations are applied. If the clipBegin attribute is specified, this may required SSML generation of audio prior to clipBegin, depending on the implementation. This may lead to a gap between execution of the <media> element and start of playback.

If the media is fetched with HTTP protocol and the clipBegin attribute is specified, then, unless the the resource is cached locally, the part of the media resource before the clipBegin, will still be fetched from the origin server. This may result in a gap between the execution of the <media> element and playback actually beginning.

Note also if <media> uses the RTSP protocol, and the VoiceXML platform supports this protocol, then the clipBegin attribute value may be mapped to the RTSP Range header field, thereby reducing the gap between element execution and the onset of playback.

6.6.2 Semantics

When an media RC receives an evaluate event, the following operations are performed:

attributes with data model expressions are evaluated against the data model
if inline SSML content is specified (i.e. <speak> root), then its RC is sent an evaluate event (where any <foreach> and <value> elements are evaluated).
if inline CDATA content (i.e. CDATA root), or <foreach> and <value> root elements which evaluate to CDATA content, then it is treated as SSML <speak> content.
if the media resource has not yet been fetched, the resource is fetched. If SSML content is specified, then the evaluated content is processed by the SSML processor and a media resource returned.

The resulting media resource is returned together with resolved media operation properties (clipBegin, clipEnd, soundLevel, speed, outputmodes).

Editorial note

Semantics needs to address a mixed content model; e.g. CDATA and XML elements as children of the root.

Do we require 'application/ssml+xml' type with SSML and CDATA content?

Need to clarify where resource fetching takes place in the semantic model. Eg. in prompt initializing or executing state? or in prompt queue?

This approach assumes the prompt queue applies media processing operations. Intended to fit with the VCR/RTC approach.

What about streaming cases? Allow streams to be returned?

Specify how errors are addressed.

6.6.3 Examples

Playback of external audio media resource.

<media type="audio/x-wav" src="http://www.example.com/resource.wav"/>

Application of media operations to audio resource. The soundLevel increases the volume by approximately 50% and the speed is reduced to 50%.

<media type="audio/x-wav" soundLevel="+6.0dB" speed="50%" 
       src="http://www.example.com/resource.wav"/>

Playback of 3GPP media resource.

<media type="video/3gpp" src="http://www.example.com/resource.3gp"/>

Playback of 3GPP media resource with the speed doubled and playback ending after 5 seconds.

<media type="video/3gpp" clipEnd="5s" speed="200%" 
       src="http://www.example.com/resource.3gp"/>

Playback of external SSML document.

<media type="application/ssml+xml" 
       src="http://www.example.com/resource.ssml"/>

Inline CDATA content with a <value> element

<media>
    Ich bin ein Berliner, said <value expr="speaker"/>
</media>

which is syntactically equivalent to

<media>
   <speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis">
    Ich bin ein Berliner, said <value expr="speaker"/>
    </speak>
</media>

Inline SSML content to which gain and clipping operations are applied.


<media soundLevel="+4.0dB" clipBegin="4s">
   <speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis">
    Ich bin ein Berliner.
    </speak>
</media>

Inline SSML with audio media fallback.

<media volume="+4.0dB" clipBegin="4s">
   <speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis">
    Ich bin ein Berliner.
    </speak>
    <media type="audio/x-wav" src="ichbineinberliner.wav">
</media>

6.7 Parseq Module

This module defines the syntax and semantics of <par> and <seq> elements. The <par> element specifies playback of media in parallel, while <seq> specifies playback in sequence.

The module is designed to extend the content model of the <prompt> element (6.4 Prompt Module).

This module is dependent upon the media module (6.6 Media Module).

With connections which support multiple media streams, it is possible to simultaneously playback multiple media types. For media container formats like 3GPP, audio and video media can be generated simultaneously from the same media resource.

There are established use cases for simultaneous playback of multiple media which are specified in separate resources:

Video mail: an audio message has been left using a conventional audio only system. For playback on a system with video support, a video resource can be played simultaneously with an image of the person, or an avatar.
Enterprise: a video stream resource from a security camera with TTS voiceover providing additional information.
Education: a video resource showing medical procedure with commentary provided by lecturer in student's language.
Talking heads: an animated avatar together with audio or TTS voiceover.

The intention is provide support for basic use cases where audio or TTS output from one resource can be complemented with output from another resource as permitted by the connection and platform capabilities.

6.7.1 Syntax

The <par> element is derived from SMIL <par> element, a time container for parallel output of media resources. Media elements (or containers) within a <par> element are played back in parallel.

Editorial note
SMIL reference should be added in references section SMIL references SMIL is Synchronized Multimedia Integration Language (SMIL). Reference to SMIL 1.0 (or later) Specification

The <par> element has the attributes specified in Table 29.

Table 29: <par> Attributes
Name	Type	Description	Required	Default Value
endsync		Indicates when element is considered complete. 'first' indicates that the element is complete when any media (or container) child reports that it is complete; 'last' indicates it is complete when all media children are complete.	No	last

The content model of <par> consists of:

<media> elements (0 or more)
<seq> elements (0 or more)

The <par> element is derived from SMIL <seq> element, a time container for sequential output of media resources. Media elements within a <seq> element are played back in parallel.

No attributes are defined for <seq>.

The content model of <seq> consists of:

<media> elements (0 or more)

6.7.2 Semantics

Editorial note

Issue: how should parallel playback interact with the PromptQueue resource? The simplest assumption would be that if this module is supported, then prompt queue needs to be able to handle parallel playback.

For example when bargein event happens during the parallel execution, the synchronization between both prompt and for example video play should be handled. This information should be explained in the prompt queue resource section.

This module requires a PromptQueue resource which support playback of parallel and sequential media. The following defines its playback completion, termination and error handling.

Completion of playback of the <par> element is determined according to the value of its endsync attribute. For instance, assume a <par> element containing <media> (or <seq>) elements A and B, and that B finishes before A. If endsync has the value first, then completion is reported upon B's completion. If endsync has the value last, then completion is reported upon A's completion.

Completion of playback of the <seq> element occurs when the last <media> is complete.

If the <par> element playback is terminated, then playback of its <media> and <seq> children is terminated. Likewise, if the <seq> element playback is terminated, then playback of its (active) <media> elements is terminated.

If mark information is provided by <media> elements (for example with SSML), then, the mark information associated with last element played in sequence or parallel is exposed as described in XXX.

Editorial note

Open issue: Clarify interaction with VCR media control model(s).

<reposition> approach would require that <par> and <seq> need to be able to restart from a specific position indicated by the markname/time of a <media> element contained within them.

RTC approach would require that for <par>, media operations are applied in parallel.

Error handling policy is inherited from the element in which <par> and <seq> element are children.

For instance if the policy is to ignore errors, then the following applies:

If an error occurs when playing a <media> element in <par>, then the error is ignored.
Likewise, if there is an error playing back a <media> element in <seq>, the error is ignored and the next <media> element in the sequence, if there is one, is played.
If the <media> element in which the error occurs is the final one in the <par> element, then completion of <par> playback is signaled when the error is detected.

If the policy is to terminate playback and report the error, then the any error causes immediate termination of any playback and the error is reported.

If execution of the <par> and <seq> elements requires media capabilities which are not supported by the platform or the connection, or there is an error fetching or playing any <media> element within <par> or <seq>, then error handling follows the defined policy.

6.7.3 Examples

video avatar with audio commentary. Note the use of the outputmodes attributes of <media> to ensure that only video is played.

 <par>
   <media type="audio/x-wav" src="commentary.wav"/>
   <media type="video/3gpp" src="avatar.3gp" outputmodes="video"/>
 </par>

video avatar with a sequence of audio and TTS commentary.

 <par>
   <seq>
     <media type="audio/x-wav" src="intro.wav"/>
     <media type="application/ssml+xml" src="commentary.ssml"/>
   </seq>
   <media type="video/3gpp" src="avatar.3gp" outputmodes="video"/>
 </par>

6.8 Foreach Module

This module describes the syntactic and semantic features of the <foreach> element.

This module is designed to extend the content model of an element in another module. For example, SSML elements in the 6.5 Builtin SSML Module, the <prompt> element defined in 6.4 Prompt Module, etc.

The attributes and content model of the element are specified in 6.8.1 Syntax. Its semantics are specified in 6.8.2 Semantics.

6.8.1 Syntax

[See XXX for schema definitions].

6.8.1.1 Attributes

The <foreach> element has the attributes specified in Table 30.

Table 30: <foreach> Attributes
Name	Type	Description	Required	Default Value
array		A data model expression that must evaluate to an array; otherwise, an error.semantic event is thrown. Note that the <foreach> element operates on a shallow copy of the array specified by the array attribute.	Yes
item		A data model variable that stores each array item upon each iteration of the loop. A new variable will be declared if it is not already defined within the parent's scope.	Yes

Both "array" and "item" must be specified; otherwise, an error.badfetch event is thrown.

The iteration process starts from an index of 0 and increments by one to an index of array_name.length - 1, where array_name is the name of the shallow copied array operated on by the <foreach> element. For each index, a shallow copy or reference to the corresponding array element is assigned to the item variable (i.e. <foreach> assignment is equivalent to item = array_name[index] in ECMAScript); the assigned value could be undefined for a sparse array. Undefined array items are ignored.

VoiceXML 3.0 does not provide break functionality to interrupt a <foreach>.

Editorial note
Clarify that array items which evaluate to ECMAScript undefined are ignored?

6.8.1.2 Content Model

The content model of the <foreach> element is dependent upon the element in which it is a child. The profile in which this element is used must specify the content model(s) of this element.

6.8.2 Semantics

When the RC receives an evaluate event, the RC loops through the array to produce an evaluated content for each item in the array.

6.8.3 Examples

Editorial note
These examples may be moved to the respective profile section later.

The vxml21 profile defines the content model for the <foreach> element so that it may appear in executable content and within <prompt> elements.

Within executable content, except within a <prompt>, the <foreach> element may contain any elements of executable content; this introduces basic looping functionality by which executable content may be repeated for each element of an array.

When <foreach> appears within a <prompt> element as part Builtin SSML content, it may contain only those elements valid within <enumerate> (i.e. the same elements allowed within <prompt> less <meta>, <metadata>, and <lexicon>); this allows for sophisticated concatenation of prompts.

In this example using Builtin SSML, each item in the array has an audio property with a URI value, and a tts property with SSML content. The element loops through the array, playing the audio URI or the SSML content as fallback, with a 300 millisecond break between each iteration.

<prompt>

 <foreach item="item" array="array">
    <audio expr="item.audio"><value expr="item.tts"/></audio>
    <break time="300ms"/>
 </foreach>

</prompt>

In the mediaserver profile, <foreach> may occurs within <prompt> elements and has the content model of 0 or more <media> elements.

Play each media resource in the array.

  <foreach item="item" array="array">
   <media type="audio/x-wav" src="item.audio"/>
  </foreach>

Play each media resource in the array.

<foreach item="item" array="array">
   <media type="audio/x-wav" src="item.wav">
   <media type="application/ssml+xml">
    <speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis">     
     <value expr="item.tts"/>
     <break time=300ms"/>
     </speak>
   </media>
 </media> 
</foreach>

6.9 Form Module

6.9.1 Syntax

Forms are the key component of VoiceXML documents. A form contains:

A set of form items, elements that are visited in the main loop of the form interpretation algorithm. Form items are subdivided into input items that can be 'filled' by user input and control items that cannot.
Declarations of non-form item variables.
Event handlers.
"Filled" actions, blocks of procedural logic that execute when certain combinations of input item variables are assigned.

Table 31: Table: <form> Attributes
id	The name of the form. If specified, the form can be referenced within the document or from another document. For instance <form id="weather">, <goto next="#weather">.
scope	The default scope of the form's grammars. If it is dialog then the form grammars are active only in the form. If the scope is document, then the form grammars are active during any dialog in the same document. If the scope is document and the document is an application root document, then the form grammars are active during any dialog in any document of this application. Note that the scope of individual form grammars takes precedence over the default scope; for example, in non-root documents a form with the default scope "dialog", and a form grammar with the scope "document", then that grammar is active

6.9.2 Semantics

6.9.2.1 Form RC

The Form RC is the primary RC for the <form> element.

The Form RC interacts with resource controllers of other modules so as to provide the behavior of VoiceXML 2.1/2.0 <form> tag. Input and control form items are modeled as resource controllers: for the example, the <field> RC (6.10.2.1 Field RC) of the Field Module.

The behavior of the Form RC follows the VoiceXML FIA, although some aspects of this are not modeled directly in this RC: external transition handling is not part of the form RC; input items used separate RCs to manage coordination between media resources, while recognition results can be received directly by form, field or other RCs.

[This initial version does not address all aspects of FIA behavior; for example, event handling, error handling and external transitions are not covered.]

6.9.2.1.1 Definition

The form RC is defined in terms of a data model and state model.

The data model is composed of the following parameters:

controller: the RC controlling this form RC
children: array of children's (primary) RC
activeItem: current form item being executed
active: Boolean indicating whether this form is active. Default: false.
justFilled: array of child RC which have just been filled
recoResult: The recognition result of the previously executed form item.
previousItem: The previous form item already executed.
nextItem: The next form item which is presently scheduled for executed.
modal: The modality of the current form item being executed.
id: The form identifier.

The form RC's state model consists of the following states: Idle, Initializing, Ready, SelectingItem, PreparingItem, PreparingFormGrammars, PreparingOtherGrammars, Executing, Active, ProcessingFormResult, Evaluating and Exit.

In the Idle state, the form RC can receive an 'initialize' event whose 'controller' event data is used to update the data model. The RC then transitions into Initiating state.

In the Initializing state, the RC creates a dialog scope in the Datamodel Resource and then initializes its children: this is modeled as a separate RC. When all children are initialized, the RC sends an 'initialized' event to its controller and transitions to the Ready state.

In the Ready state, the form RC sets its active status to false. It can receive one of two events: 'prepareGrammars' or ‘execute’. ‘prepareGrammars’ indicates that another form is active, but this form's form-level grammars may be activated; an 'execute' event indicates that this form is active. If the RC receives a 'prepareGrammars' event, it transitions to the PreparingFormGrammars state. If the RC receives an 'execute' event, it sets its active data to true and transitions to the 'SelectingItem' state.

In the SelectingItem state, the RC determines which form item to select as the active item. This is defined by a FormItemSelection RC which iterates over the children sending each a 'checkStatus' event. If a child returns a true status (indicating that it ready for execution)), the activeItem is set to this child RC and the RC transitions to the PreparingItem state. If no child returns this status, then the RC is complete and transitions the Exit State.

In the PreparingItem state, the activeItem is sent a 'prepare' event causing it to prepare itself; for example, the field RC prepares its prompts and grammars for execution. When the activeItem returns a 'prepared' event, the event data indicates whether the item is modal or not. If the item is modal, then the form RC transitions to the Executing state. If the item is not modal (other grammars can be activated), then the form RC transitions to the PreparingFormGrammars state.

In the PreparingFormGrammars state, the RC prepares form-level grammars. This is defined by a separate RC which iterates through and executes grammar children. When this is complete, the RC transitions to the Active state if the form is not active (active data), and transitions to the PreparingOtherGrammars if the form is active.

In the PreparingOtherGrammars states, the RC sends a 'prepareGrammars' event to its controller RC (which in turn sends the event to appropriate form, document and application level RCs with grammars). When its receives a 'prepared' from its controller, the RC transitions to the Executing state.

In the Executing state, the form RC sends an 'execute' event to the active form item. If the form item is a field, then this will causes prompts to be played and recognition to take place. The RC then transitions to the Active state awaiting a result.

In the Active state, the RC re-initializes the justFilled data to a new array and waits for a recognition results (as active or non-active form), or for a signal from its selected form item that it has received the recognition result. Recognition results are divided into two types: form item level results, received and processed by the form item; and form level results which are received by the form RC which caused the grammar to be added. If a 'recoResult' event is received by the form RC, the RC transitions into the ProcessingFormResult state. If the active form item receives the recognition result (and locally updated itself), then the form RC receives a 'formItemResult' event, adds the active item to the justFilled array, and transitions into the Evaluating state.

In the ProcessingFormResult state, the recognition result is processed by iterating through the form item children, obtaining their name and slotname, and then attempting to match the slotname to the results. If the match is successful, the name variable in the data model result is updated with the value from the recognition result and the child is added to the justFilled data array. When this process is complete, the form RC transitions to the Evaluating state.

In the Evaluating state, the form RC then iterates through its children and if a child is a member of the 'JustFilled' array, it sends a 'evaluate' event to the form item RC causing the appropriate filled> RCs to be executed. If the child is a filled> RC, then it is executed if appropriate. When evaluation is complete, the form RC transitions to the 'selectformitem' state so that the next form item can be selected for execution.

6.9.2.1.2 Defined Events

Table 32: Table: Events received by <form> RC
Event	Source	Payload	Description
initialize	any	controller(M)	Update the data model
prepareGrammars	controller		Another form is active, but the current form's form-level grammars may be activated.
execute	controller		Current form is active

Table 33: Table: Events received by <form> RC
Event	Source	Payload	Description
initialize	controller		Notification that initialization is complete
prepareGrammars	controller		Sent to prepare grammars to appropriate form, document and application level RCs.
execute	controller		Notification of complete recognition result from the field RC.

6.9.2.1.3 External Events

The following table shows the events sent and received by the form RC to resources and other RCs which define the events.

Table 34: Table: <form> RC External Events
Event	Source	Target	Description
checkStatus	FormRC	FormItem RC	Check if ready for execution
createScope	FormRC	DataModel	Creates a scope.
destroyScope	FormRC	DataModel	Delete a scope.
evaluate	FormRC	FormItem RC	Process form item being filled.
execute	FormRC	FormItem RCs	Start execution.
prepare	FormRC	FormItem RC	Initiates preparation needed before execution.
formItemResult	FormItemRC	FormRC	Results received by the form item.
prepared	FormItemRC	FormRC	Indicates that preparation is complete.
recoResult	PlayAndRecognize RC	FormRC	Results filled at the form level and not form item level.

6.9.2.1.4 State Chart Representation

Figure 4: Form RC States

6.9.2.1.5 SCXML Representation

6.10 Field Module

6.10.1 Syntax

Editorial note
Open issue: This section plans to use the same approach as section 2.3.1 in VoiceXML 2.0, but without support for the type attribute nor the <option> tag. Builtin types will be handled through a separate module.

6.10.2 Semantics

The semantics of field elements are defined using the following resource controllers: Field (6.10.2.1 Field RC), PlayandRecognize (6.10.2.2 PlayandRecognize RC), ...

6.10.2.1 Field RC

The Field Resource Controller is the primary RC for the field element.

6.10.2.1.1 Definition

The field RC is defined in terms of a data model and state model.

The data model is composed of the following parameters:

controller: the RC controlling this field RC
children: array of children's (primary) RC
includePrompts: boolean indicating whether prompts are to be played. Default: true.
counter: prompt counter. Default: 1.
recoResult: ??
name:
expr:
cond:
modal:
slot:

The field RC's state model consists of the following states: Idle, Initializing, Ready, Preparing, Prepared, Executing and Evaluating.

While in the Idle state, the RC may receive an 'initialize' event, whose 'controller' event data is used to update the data model. The RC then transitions into Initiating state.

In the Initializing state, the RC creates a variable in the Datamodel Resource: the variable name corresponds to the name in the RC's data model, and the variable value is set to the value of the RC's data model expr, if this is defined. The field RC then initializes its children: this is modeled as a separate RC (see XXX). When all children are initialized, the RC transitions to the Ready state.

In the Ready state, the field RC can receive an 'checkStatus' event to check whether it can be executed or not. The value of name and cond in its data model are checked: the status is true if the name is undefined and the value of cond evaluates to true. The status is returned in a 'checkedStatus' event sent back to the controller RC. If the RC receives a 'prepare' event, it updates includePrompts in its data model using the event data, and transitions to the Preparing state.

In the Preparing state, the field prepares its prompts and grammars. Prompts are prepared only if the includePrompts data is true; otherwise, prompts within the field are not prepared (e.g. field prompts aren't queued following a <reprompt>). Preparation of prompts is modeled as a separate RC (see XXX), as is preparation of grammars (see YYY). These RCs are summarized below.

Prompts are prepared by iterating through the children array. In the iteration, each prompt RC child is sent a 'checkStatus' event. If the prompt child returns true (its cond parameter evaluates to true), then it is added to a 'correct count' list together with its count. Once the iteration is complete, the RC determines the highest count on the 'correct count' list: the highest count among those on the list less than or equal to the current count value. All child on the 'correct count' list whose count is not the highest count are removed. The RC then iterates through the 'correct count' list and sends an 'execute' event to each prompt RC, causing it to be queued on the PromptQueue Resource.

Grammars are prepared by recursing through the children array and sending each grammar RC child an 'execute' event. The grammar RC then, if appropriate, sends an 'addGrammar' event to the DTMF or ASR Recognizer Resource where the grammar itself, its properties and the field RC is sent as the handler for recognition results.

When prompts and grammars have been prepared, the prompt counter is incremented and the field RC sends a 'prepared' event to its controller with event data indicating its modal status and then transition into the Prepared state.

In the Prepared state, the field RC may receive an 'execute' event from its controller. The RC sends an 'execute' event to the PlayAndRecognize RC (6.10.2.2 PlayandRecognize RC), causing any queued prompts to be played and recognition to be initiated. In the event data, the controller is set to this RC, and other data is derived from data model properties. The RC transitions to the Executing state.

In the Executing state, the PlayAndRecognize RC must send recoResults (or error events: noinput, nomatch, error.semantic) to the field RC.

If the field RC receives the recoResults, then it updates its name variable in the Datamodel Resource. The field RC then sends a 'fieldResult' event to its controller indicating that a field result has been received and processed.

If the recoResult is received by the field RC's controller, then the field receives an 'evaluate' event which causes it to transition to the Evaluating state.

In the Evaluating state, the field RC iterates through its children executing each filled RC: this is modeled by a separate RC (see XXX). When evaluation is complete, the RC sends a 'evaluated' event to its controller and transitions to the Ready state.

6.10.2.1.2 Defined Events

The Field RC is defined to receive the following events:

Table 35: Events received by Field RC
Event	Source	Payload	Description
initialize	any	controller(M)
checkStatus	controller
prepare	controller	includePrompts (M)
execute	controller
evaluate	controller

and the events it sends:

Table 36: Events sent by Field RC
Event	Target	Payload	Description
initialized	controller
checkedStatus	controller
prepared	controller
fieldResult	controller
evaluated	controller

6.10.2.1.3 External Events

Table 37 shows the events sent and received by the field RC to resources and other RCs which define the events.

Table 37: Field RC External Events
Event	Source	Target	Description
create	FieldRC	DataModel
assign	FieldRC	DataModel
execute	FieldRC	PlayandRecognizeRC
recoResult	PlayandRecognizeRC	FieldRC

6.10.2.1.4 State Chart Representation

Figure 5: Field RC States

6.10.2.1.5 SCXML Representation

6.10.2.2 PlayandRecognize RC

The PlayandRecognize RC coordinates media input with Recognizer resources and media output with the PromptQueue Resource.

The following use cases are covered:

Bargein is not active and bargeintype is speech. Prompts are played to completion and the user provides positive input, negative input or no input.
Bargein is active and bargeintype is speech. Prompts are played to completion and the user provides positive input, negative input or no input.
Bargein is active and bargeintype is speech. User interrupts prompts and the user provides positive input, negative input or no input.
Bargein is not active and bargeintype is hotword. Prompts are played to completion and the user provides positive input, negative input or no input. User may provide a positive input after one or more negative inputs. The 'nomatch' event is never generated.
Bargein is active and bargeintype is hotword. User interrupts prompts and the user provides positive input, negative input or no input. User may provide a positive input after one or more negative inputs. The 'nomatch' event is never generated.
Prompt sequences alternating between bargein and no bargein.
Prompt sequences alternating between speech and hotword bargeintype.

Editorial note
Open issue: should we remove the possibility for alternating speech and hotword bargein modes within the recognition cycle?

6.10.2.2.1 Definition

The PlayandRecognize RC coordinates media input with recognition resources and media output with the PromptQueue Resource on behalf of a form item.

This RC activates prompt queue playback, activates recognition resources, manages bargein behavior and handles results from recognition resources.

The RC is defined in terms of a data model and a state model.

The data model is composed of the following parameters:

controller: the RC controlling this RC
bargein: Boolean indicates whether bargein is active or not. Default: true.
bargeintype: indicates the type of bargein, if active. Default: speech.
inputmodes: active recognition input modes. Default: voice and dtmf.
inputtimeout: timeout to wait for input. Default: 0s. (Required since the prompt queue may be empty).
dtmfProps: DTMF properties
asrProps: Speech recognition properties
maxnbest: maximum number of nbest results. Default: 1.
recoActive: boolean indicating whether recognition is active. Default: false.
markname: string indicating current markname. Default: null
marktime: time designator indicating current marktime. Default: 0s.
recoResult:
recoListener:
activeGrammars: Boolean indicating whether grammars are active. Default: false.

The RC model consists of the following states: idle, prepare recognition resources, start playing, playing prompts with bargein, playing prompts without bargein, recognizing with a timer, waiting for input, waiting for speech result and update results. The complexity of this model is partially a consequence of supporting the relationship between hotword bargein and recognition result processing.

While in the idle state, the RC may receive an 'execute' event, whose event data is used to update the data model. The event information includes: controller, inputmodes, inputtimeout, dtmfProps, asrProps and maxnbest. The RC transition to the prepare recognition resources state.

In the prepare recognition resources, the RC sends 'prepare' events to the ASR and DTMF recognition resource. Both events specify this RC as the controller parameter, while the properties parameter differs. In this state, the RC can received 'prepared' or 'notPrepared' events from either recognition resources. If neither resource returns a 'prepared' event, then activeGrammars is false (i.e. no active DTMF or speech grammar) and the RC sends an 'error.semantic' event to the controller and exits. If at least one resource returns a 'prepared' event, then the RC moves into the start playing state.

The start playing state begins by sending the PromptQueue resource a 'play' event. The PromptQueue responds with a 'playDone' event if there are no prompt in the prompt queue; as a result, this RC moves into the start recognizing with timer state. If there is at least one prompts in the queue, the PromptQueue sends this RC a 'playStarted' event whose data contains the bargein and bargeintype values for the first prompt, and the input timeout value for the last prompt in the queue. The data model is updated with this information.

Editorial note
Open issue: PromptQueue Resource doesn't currently have playStarted event. If we don't add playStarted event, then is there a better way to get the bargein, bargeintype, and timeout information from the prompts in the PromptQueue?

Interaction with the recognizer during prompt playback is determined by the data model's bargein value. If bargein is true, then this RC transitions to the playing with bargein state. If bargein is false, the RC transitions to the playing without bargein state.

Editorial note

Open Issue: The event "bargeinChange" as a one way notification could pose a problem, as it takes finite time for recognizer to suspend or resume. This might work better if PromptQueue Resource waited for an event "bargeinChangeAck" (or similar) from PlayandRecognize RC before starting the next play. PlayandRecognize RC will send the event "bargeinChangeAck" after it completed suspend or resume action on the recognizer.

In the playing without bargein state, recognition is suspended if it has been previously activated (recoActive parameter of the data model tracks this). Suspending recognition is conditional on the value of 'inputmodes' data parameter; if 'dtmf' is in inputmodes, then DTMF recognition is suspended; if 'voice' is in inputmodes, the ASR recognition is suspended. In this state, the PromptQueue can report to this RC changes in bargein and bargeintype as prompts are played: a 'bargeintypeChange' event with the values 'hotword' or 'speech' cause the data model parameter 'bargein' to the set to 'true' and the 'bargintype' parameter to be updated with event data value. If the PromptQueue resource sends a 'playDone' event, then the data model markname and marktime parameters are updated and the RC transitions to the start recognizing with timer state.

In the playing with bargein state, recognition is activated if it has not been previously activated (determined by recoActive parameter in the data model). Activating recognition is conditional on the value of 'inputmodes' data parameter; if 'dtmf' is in inputmodes, then DTMF recognition is activated; if 'voice' is in inputmodes, then ASR recognition is activated. In this state, the PromptQueue can report changes in bargein and bargeintype as prompts are played: a 'bargeintypeChange' event where the event data value is not 'unbargeable' causes the data model 'bargintype' parameter to be updated with the event data ('hotword' or 'speech'); while a 'bargeintypeChange' where the event data value is 'unbargeable' causes the data model 'bargein' parameter to set to false and the RC transitions to the playing without bargein state. If the PromptQueue resources sends a 'playDone' event, then the data model markname and marktime parameters are updated and the RC transitions to the start recognizing with timer state.

Recognition handling in this state depends upon the bargeintype data parameter. If the bargeintype is 'speech' and a recognizer sends a 'inputStarted' event, then the RC transition to the waiting for speech result state. If the bargeintype is 'hotword', then recognition results are processed within this state. In particular, if a recognition resource sends a 'recoResults' event, then its event data is processed to determine if the recognition result is positive or negative.

Editorial note

Further details on recognition processing to be added in later versions. recoResults data parameter is updated with the recognition results (truncated to maxnbest). A speech result is positive iff there is at least one result whose confidence level is equal to or greater than the recognition confidence level; otherwise the result is negative. DTMF results are always positive. The recoListener data parameter is defined as the listener associated with the best result if the result is positive.

If positive, the RC sends the PromptQueue a 'halt' event, and transitions to the update results state. If negative, the RC sends a 'listen' event to the recognition resource which sent the 'recoResults' event.

In the start recognizing with timer state, an input timer is activated for the value of the inputtimeout data parameter and, if the recognition is not already active (determined by the recoActive data parameter). Recognition activation is conditional on the value of 'inputmodes' data parameter; if 'dtmf' is in inputmodes, then DTMF recognition is activated; if 'voice' is in inputmodes, the ASR recognition is activated. The RC then transitions into the waiting for input state.

In the waiting for input state, the RC waits for user input. If it receives a 'timerExpired' event, then the RC sends a 'stop' event to all recognition resources, sends a 'noinput' event to its controller and exits. Recognition handling in this state depends upon the bargeintype data parameter. If the bargeintype is 'speech' and a recognizer sends a 'inputStarted' event, then the RC transition to the waiting for speech result state. If the bargeintype is 'hotword', then recognition results are processed within this state. In particular, if a recognition resource sends a 'recoResults' event, then its event data is processed to determine if the recognition result is positive or negative. If positive, the RC cancels the timer, and transitions to the update results state. If negative, the RC sends a 'listen' event to the recognition resource which sent the 'recoResults' event.

In the waiting for speech result state, the RC waits for a 'recoResult' event whose data is used to update the recoResult data parameter and to set the recoListener data parameter if the recognition result is positive. The RC then transitions to the update results state.

In the update results state, the RC sends 'assign' events to the data model resource, so that the lastresult object in application scope is updated with recognition results as well as markname and marktime information. If the recoListener data parameter is defined, then the RC sends a 'recoResult' event to the recognition listener RC; otherwise, it sends 'nomatch' event to its controller. The RC then exits.

Editorial note

Open issue: Behavior if one reco resource sends 'inputStarted' but other sends 'recoResults'? Race conditions between recognizers returning results? (This problem is inherent to the presence of two recognizers. For the sake of clear semantics, we could restrict only one recognizer to respond with 'inputStarted' and 'recoResults'. The other recognizer is always 'stopped'. But a better choice might be to have only one recognizer that handles both DTMF and speech, since semantically both recognizers are very similar.)

6.10.2.2.2 Defined Events

The PlayandRecognize RC is defined to receive the following events:

Table 38: Events received by PlayandRecognize RC
Event	Source	Payload	Sequencing	Description
execute	any	controller(M), inputmodes (O), inputtimeout (O), dtmfProps (M), recoProps (M), maxnbest (O)

and the events it sends:

Table 39: Events sent by PlayandRecognize RC
Event	Target	Payload	Sequencing	Description
recoResult	any	results (M)	one-of: nomatch, noinput, error.*, recoResult
nomatch	controller		one-of: nomatch, noinput, error.*, recoResult
noinput	controller		one-of: nomatch, noinput, error.*, recoResult
error.semantic	controller		one-of: nomatch, noinput, error.*, recoResult
error.badfetch.grammar	controller		one-of: nomatch, noinput, error.*, recoResult
error.noresource	controller		one-of: nomatch, noinput, error.*, recoResult
error.unsupported.builtin	controller		one-of: nomatch, noinput, error.*, recoResult
error.unsupported.format	controller		one-of: nomatch, noinput, error.*, recoResult
error.unsupported.language	controller		one-of: nomatch, noinput, error.*, recoResult

6.10.2.2.3 External Events

The events in Table 40 are sent by the PlayandRecognize RC to resources which define the events.

Table 40: External Events send by PlayandRecognize RC
Event	Target	Payload	Sequencing	Description
play	PromptQueue
halt	PromptQueue
prepare	Recognizer
listen	Recognizer
suspend	Recognizer
stop	Recognizer

The events in Table 41 are received by this RC. Their definition is provided by the sending component.

Table 41: External Events received by PlayandRecognize RC
Event	Source	Payload	Sequencing	Description
playStarted	PromptQueue	bargein (O), bargeintype (O), inputtimeout (O)	pq:play notification
playDone	PromptQueue	markname (O), marktime (O)	pq:play response
bargeinChange	PromptQueue	bargein (M)
bargeintypeChange	PromptQueue	bargeintype (M)
prepared	Recognizer		prepare positive response
notPrepared	Recognizer		prepare negative response
inputStarted	Recognizer
recoResult	Recognizer	results (M), listener (O)

6.10.2.2.4 State Chart Representation

The main states for the PlayandRecognize RC are shown in Figure 6.

Figure 6: PlayandRecognize RC States

6.10.2.2.5 SCXML Representation

7 Profiles

VoiceXML 3.0 modules can be combined to form one or more language profiles.

[Profiles are motivated on the basis of identified common use cases. Pull in general motivation from XHTML, SMIL, etc. ]

This specification defines the following profiles:

VoiceXML21: supports VoiceXML 2.1 functionality
Media Server: supports advanced media functions (including DTMF, speech recognition, media playback and control, speech synthesis and SIV) where results are returned to the control layer.

[Other profiles may be standardized at a later stage; for example, a profile which includes all VoiceXML 3.0 modules. Suggestions welcome.]

Developers can be create their own profiles by modifying an existing profile or combining modules to create a new profile.

[A profile is defined as follows: ]

7.1 VoiceXML 2.1 Profile

Editorial note
The name of this profile may change.

[Motivation: tutorial, PoC, transitional, and that vxml3 is a superset of VoiceXML 2.1.]

The VoiceXML 2.1 profile is included demonstrating how profiles are defined in VoiceXML 3.0. Using existing elements from the [VOICEXML21] specification is helpful as the semantics of these elements are already well defined and well understood. Thus changes in how they are presented are a result of the module and profile style of VoiceXML 3.0 and of making more explicit and formal the precise detailed semantics.

The VoiceXML 2.1 profile also plays a transitional role as VoiceXML 3.0 as a whole is built on top of VoiceXML 2.1. VoiceXML 3.0 is a superset of VoiceXML 2.1 and includes the traditional 2.1 functionality plus some new modules. The VoiceXML 2.1 profile is the set of modules that were always present in VoiceXML 2.1 but that weren't expressed in the specification as individual modules. This also allows a clear path for the VoiceXML application developer as applications authored in version 2.1 of VoiceXML will continue to work and the application developer will not need to learn new syntax or semantics when they develop in the VoiceXML 2.1 profile of VoiceXML 3.0.

The VoiceXML 2.1 profile also represents a proof of concept to ensure that the new modular profile method of describing the specification is in no way limited. VoiceXML 3.0 in its entirety will be in no way limited or constrained in any way because of the use of profiles and modules and formalized semantic models. Anything that was standardized in VoiceXML 2.1 can be standardized in this new format and the VoiceXML 2.1 profile reveals that.

This profile uses the prompt module (6.4 Prompt Module) extended with the Builtin SSML module (6.5 Builtin SSML Module) and (6.8 Foreach Module).

7.2 Media Server Profile

[Motivation: vxml as common interface to media server in telecom, NGN, IMS, etc. Key need is to expose media processing functionality, both simple and advanced, in a Application control and flow are typically handled outside VoiceXML: invoke 'play and collect/record/verify' functionality and return results. ]

[Not included: ]

[Issues: should ECMAScript/data capability be included? Efficient re-use of cached vxml scripts with data feed in, and results out, talks for this ...]

This profile uses the prompt module (6.4 Prompt Module) extended with the media module (6.6 Media Module), (6.8 Foreach Module) and, optionally, the parseq module (6.7 Parseq Module).

8 Environment

8.1 Resource Fetching

8.1.1 Fetching

A VoiceXML interpreter context needs to fetch VoiceXML documents, and other resources, such as media files, grammars, scripts, and XML data. Each fetch of the content associated with a URI is governed by the following attributes:

Table 42: Fetch Attributes
fetchtimeout	The interval to wait for the content to be returned before throwing an error.badfetch event. The value is a Time Designation. If not specified, a value derived from the innermost fetchtimeout property is used.
fetchhint	Defines when the interpreter context should retrieve content from the server. prefetch indicates a file may be downloaded when the page is loaded, whereas safe indicates a file that should only be downloaded when actually needed. If not specified, a value derived from the innermost relevant fetchhint property is used.
maxage	Indicates that the document is willing to use content whose age is no greater than the specified time in seconds (cf. 'max-age' in HTTP 1.1 [RFC2616]). The document is not willing to use stale content, unless maxstale is also provided. If not specified, a value derived from the innermost relevant maxage property, if present, is used.
maxstale	Indicates that the document is willing to use content that has exceeded its expiration time (cf. 'max-stale' in HTTP 1.1 [RFC2616]). If maxstale is assigned a value, then the document is willing to accept content that has exceeded its expiration time by no more than the specified number of seconds. If not specified, a value derived from the innermost relevant maxstale property, if present, is used.

When content is fetched from a URI, the fetchtimeout attribute determines how long to wait for the content (starting from the time when the resource is needed), and the fetchhint attribute determines when the content is fetched. The caching policy for a VoiceXML interpreter context utilizes the maxage and maxstale attributes and is explained in more detail below.

The fetchhint attribute, in combination with the various fetchhint properties, is merely a hint to the interpreter context about when it may schedule the fetch of a resource. Telling the interpreter context that it may prefetch a resource does not require that the resource be prefetched; it only suggests that the resource may be prefetched. However, the interpreter context is always required to honor the safe fetchhint.

When transitioning from one dialog to another, through either a <subdialog>, <goto>, <submit>, <link>, or <choice> element, there are additional rules that affect interpreter behavior. If the referenced URI names a document (e.g. "doc#dialog"), or if query data is provided (through POST or GET), then a new document is obtained (either from a local cache, intermediate cache, or from a origin Web server). When it is obtained, the document goes through its initialization phase (i.e., obtaining and initializing a new application root document if needed, initializing document variables, and executing document scripts). The requested dialog (or first dialog if none is specified) is then initialized and execution of the dialog begins.

Generally, if a URI reference contains only a fragment (e.g., "#my_dialog"), then no document is fetched, and no initialization of that document is performed. However, <submit> always results in a fetch, and if a fragment is accompanied by a namelist attribute there will also be a fetch.

Another exception is when a URI reference in a leaf document references the application root document. In this case, the root document is transitioned to without fetching and without initialization even if the URI reference contains an absolute or relative URI (see 4.4.2.2 Application Root and [RFC2396]). However, if the URI reference to the root document contains a query string or a namelist attribute, the root document is fetched.

Elements that fetch VoiceXML documents also support the following additional attribute:

Table 43: Additional Fetch Attribute
fetchaudio	The URI of the audio clip to play while the fetch is being done. If not specified, the fetchaudio property is used, and if that property is not set, no audio is played during the fetch. The fetching of the audio clip is governed by the audiofetchhint, audiomaxage, audiomaxstale, and fetchtimeout properties in effect at the time of the fetch. The playing of the audio clip is governed by the fetchaudiodelay, and fetchaudiominimum properties in effect at the time of the fetch.

The fetchaudio attribute is useful for enhancing a user experience when there may be noticeable delays while the next document is retrieved. This can be used to play background music, or a series of announcements. When the document is retrieved, the audio file is interrupted if it is still playing. If an error occurs retrieving fetchaudio from its URI, no badfetch event is thrown and no audio is played during the fetch.

8.1.2 Caching

The VoiceXML interpreter context, like [HTML] visual browsers, can use caching to improve performance in fetching documents and other resources; audio recordings (which can be quite large) are as common to VoiceXML documents as images are to HTML pages. In a visual browser it is common to include end user controls to update or refresh content that is perceived to be stale. This is not the case for the VoiceXML interpreter context, since it lacks equivalent end user controls. Thus enforcement of cache refresh is at the discretion of the document through appropriate use of the maxage, and maxstale attributes.

The caching policy used by the VoiceXML interpreter context must adhere to the cache correctness rules of HTTP 1.1 ([RFC2616]). In particular, the Expires and Cache-Control headers must be honored. The following algorithm summarizes these rules and represents the interpreter context behavior when requesting a resource:

If the resource is not present in the cache, fetch it from the server using get.
If the resource is in the cache,
- If a maxage value is provided,
  - If age of the cached resource <= maxage,
    - If the resource has expired,
      - Perform maxstale check.
    - Otherwise, use the cached copy.
  - Otherwise, fetch it from the server using get.
- Otherwise,
  - If the resource has expired,
    - Perform maxstale check.
  - Otherwise, use the cached copy.

The "maxstale check" is:

If maxstale is provided,
- If cached copy has exceeded its expiration time by no more than maxstale seconds, then use the cached copy.
- Otherwise, fetch it from the server using get.
Otherwise, fetch it from the server using get.

Note: it is an optimization to perform a "get if modified" on a document still present in the cache when the policy requires a fetch from the server.

The maxage and maxstale properties are allowed to have no default value whatsoever. If the value is not provided by the document author, and the platform does not provide a default value, then the value is undefined and the 'Otherwise' clause of the algorithm applies. All other properties must provide a default value (either as given by the specification or by the platform).

While the maxage and maxstale attributes are drawn from and directly supported by HTTP 1.1, some resources may be addressed by URIs that name protocols other than HTTP. If the protocol does not support the notion of resource age, the interpreter context shall compute a resource's age from the time it was received. If the protocol does not support the notion of resource staleness, the interpreter context shall consider the resource to have expired immediately upon receipt.

8.1.2.1 Controlling the Caching Policy

VoiceXML allows the author to override the default caching behavior for each use of each resource (except for any document referenced by the <vxml> element's application attribute: there is no markup mechanism to control the caching policy for an application root document).

Each resource-related element may specify maxage and maxstale attributes. Setting maxage to a non-zero value can be used to get a fresh copy of a resource that may not have yet expired in the cache. A fresh copy can be unconditionally requested by setting maxage to zero.

Using maxstale enables the author to state that an expired copy of a resource, that is not too stale (according to the rules of HTTP 1.1), may be used. This can improve performance by eliminating a fetch that would otherwise be required to get a fresh copy. It is especially useful for authors who may not have direct server-side control of the expiration dates of large static files.

8.1.3 Prefetching

Prefetching is an optional feature that an interpreter context may implement to obtain a resource before it is needed. A resource that may be prefetched is identified by an element whose fetchhint attribute equals "prefetch". When an interpreter context does prefetch a resource, it must ensure that the resource fetched is precisely the one needed. In particular, if the URI is computed with an expr attribute, the interpreter context must not move the fetch up before any assignments to the expression's variables. Likewise, the fetch for a <submit> must not be moved prior to any assignments of the namelist variables.

The expiration status of a resource must be checked on each use of the resource, and, if its fetchhint attribute is "prefetch", then it is prefetched. The check must follow the caching policy specified in Section 6.1.2.

8.1.4 Protocols

The "http" URI scheme must be supported by VoiceXML platforms, the "https" protocol should be supported and other URI protocols may be supported.

8.2 Properties

Properties are used to set values that affect platform behavior, such as the recognition process, timeouts, caching policy, etc.

The following types of properties are defined: speech recognition (8.2.1 Speech Recognition Properties), DTMF recognition (8.2.2 DTMF Recognition Properties), prompt and collect (8.2.3 Prompt and Collect Properties), media (8.2.4 Media Properties), fetching (8.2.5 Fetch Properties) and miscellaneous (8.2.6 Miscellaneous Properties) properties.

Editorial note

Open issue: should the specification provide specific default values rather than platform-specific?

Open issue: Should we add a 'type' column for all properties?

8.2.1 Speech Recognition Properties

The following generic speech recognition properties are defined.

Table 44: Speech Recognition Properties
Name	Description	Default
confidencelevel	The speech recognition confidence level, a float value in the range of 0.0 to 1.0. Results are rejected (a nomatch event is thrown) when application.lastresult$.confidence is below this threshold. A value of 0.0 means minimum confidence is needed for a recognition, and a value of 1.0 requires maximum confidence. The value is a Real Number Designation (see 8.4 Value Designations).	0.5
sensitivity	Set the sensitivity level. A value of 1.0 means that it is highly sensitive to quiet input. A value of 0.0 means it is least sensitive to noise. The value is a Real Number Designation (see 8.4 Value Designations).	0.5
speedvsaccuracy	A hint specifying the desired balance between speed vs. accuracy. A value of 0.0 means fastest recognition. A value of 1.0 means best accuracy. The value is a Real Number Designation (see 8.4 Value Designations).	0.5
completetimeout	The length of silence required following user speech before the speech recognizer finalizes a result (either accepting it or throwing a nomatch event). The complete timeout is used when the speech is a complete match of an active grammar. By contrast, the incomplete timeout is used when the speech is an incomplete match to an active grammar. A long complete timeout value delays the result completion and therefore makes the computer's response slow. A short complete timeout may lead to an utterance being broken up inappropriately. Reasonable complete timeout values are typically in the range of 0.3 seconds to 1.0 seconds. The value is a Time Designation (see 8.4 Value Designations). See 8.3 Speech and DTMF Input Timing Properties. Although platforms must parse the completetimeout property, platforms are not required to support the behavior of completetimeout. Platforms choosing not to support the behavior of completetimeout must so document and adjust the behavior of the incompletetimeout property as described below.	platform-dependent
incompletetimeout	The required length of silence following user speech after which a recognizer finalizes a result. The incomplete timeout applies when the speech prior to the silence is an incomplete match of all active grammars. In this case, once the timeout is triggered, the partial result is rejected (with a nomatch event). The incomplete timeout also applies when the speech prior to the silence is a complete match of an active grammar, but where it is possible to speak further and still match the grammar. By contrast, the complete timeout is used when the speech is a complete match to an active grammar and no further words can be spoken. A long incomplete timeout value delays the result completion and therefore makes the computer's response slow. A short incomplete timeout may lead to an utterance being broken up inappropriately. The incomplete timeout is usually longer than the complete timeout to allow users to pause mid-utterance (for example, to breathe). See 8.3 Speech and DTMF Input Timing Properties Platforms choosing not to support the completetimeout property (described above) must use the maximum of the completetimeout and incompletetimeout values as the value for the incompletetimeout. The value is a Time Designation (see 8.4 Value Designations).	undefined?
maxspeechtimeout	The maximum duration of user speech. If this time elapsed before the user stops speaking, the event "maxspeechtimeout" is thrown. The value is a Time Designation (see 8.4 Value Designations).	platform-dependent

8.2.2 DTMF Recognition Properties

The following generic DTMF recognition properties are defined.

Table 45: DTMF Recognition Properties
Name	Description	Default
interdigittimeout	The inter-digit timeout value to use when recognizing DTMF input. The value is a Time Designation (see 8.4 Value Designations). See 8.3 Speech and DTMF Input Timing Properties.	platform-dependent
termtimeout	The terminating timeout to use when recognizing DTMF input. The value is a Time Designation (see 8.4 Value Designations). 8.3 Speech and DTMF Input Timing Properties.	0s
termchar	The terminating DTMF character for DTMF input recognition. See 8.3 Speech and DTMF Input Timing Properties.	#

8.2.3 Prompt and Collect Properties

The following properties are defined to apply to the fundamental platform prompt and collect cycle.

Table 46: Prompt and Collect Properties
Name	Description	Default
bargein	The bargein attribute to use for prompts. Setting this to true allows bargein by default. Setting it to false disallows bargein.	true
bargeintype	Sets the type of bargein to be speech or hotword. See .	platform-specific
timeout	The time after which a noinput event is thrown by the platform. The value is a Time Designation (see 8.4 Value Designations). See 8.3 Speech and DTMF Input Timing Properties.	platform-dependent

8.2.4 Media Properties

The following properties are defined to apply to output media.

Table 47: Media Properties
Name	Description	Default
outputmodes	Determines which modes may be used for media output. The value is a space separated list of media types (see media 'type' in TBD). This property is typically used with container file formats, such as "video/3gpp", which support storage of multiple media types. For example, to play both audio and video to the remote connection, the property would be set to "audio video". To play only the video, the property is set to "video". If the value contains a media type which is not supported by the platform, the connection or the value of the <media> element `type` property, then that media type is ignored.	The default value depends on the negotiated media between the local and remote devices. It is the space separated list of media types specified in the `session.connection.media` array elements' `type` property where the associated `direction` property is `sendrecv` or `recvonly`.

8.2.5 Fetch Properties

The following properties pertain to the fetching of new documents and resources.

Note that maxage and maxstale properties may have no default value - see 8.1.2 Caching.

Table 48: Fetch Properties
Name	Description	Default
audiofetchhint	This tells the platform whether or not it can attempt to optimize dialog interpretation by pre-fetching audio. The value is either safe to say that audio is only fetched when it is needed, never before; or prefetch to permit, but not require the platform to pre-fetch the audio.	prefetch
audiomaxage	Tells the platform the maximum acceptable age, in seconds, of cached audio resources.	platform-specific
audiomaxstale	Tells the platform the maximum acceptable staleness, in seconds, of expired cached audio resources.	platform-specific
documentfetchhint	Tells the platform whether or not documents may be pre-fetched. The value is either safe (the default), or prefetch.	safe
documentmaxage	Tells the platform the maximum acceptable age, in seconds, of cached documents.	platform-specific
documentmaxstale	Tells the platform the maximum acceptable staleness, in seconds, of expired cached documents.	platform-specific
grammarfetchhint	Tells the platform whether or not grammars may be pre-fetched. The value is either prefetch (the default), or safe.	prefetch
grammarmaxage	Tells the platform the maximum acceptable age, in seconds, of cached grammars.	platform-specific
grammarmaxstale	Tells the platform the maximum acceptable staleness, in seconds, of expired cached grammars.	platform-specific.
objectfetchhint	Tells the platform whether the URI contents for <object> may be pre-fetched or not. The values are prefetch, or safe.	prefetch
objectmaxage	Tells the platform the maximum acceptable age, in seconds, of cached objects.	platform-specific
objectmaxstale	Tells the platform the maximum acceptable staleness, in seconds, of expired cached objects.	platform-specific
scriptfetchhint	Tells whether scripts may be pre-fetched or not. The values are prefetch (the default), or safe.	prefetch
scriptmaxage	Tells the platform the maximum acceptable age, in seconds, of cached scripts.	platform-specific
scriptmaxstale	Tells the platform the maximum acceptable staleness, in seconds, of expired cached scripts.	platform-specific.
fetchaudio	The URI of the audio to play while waiting for a document to be fetched. The default is not to play any audio during fetch delays. There are no fetchaudio properties for audio, grammars, objects, and scripts. The fetching of the audio clip is governed by the audiofetchhint, audiomaxage, audiomaxstale, and fetchtimeout properties in effect at the time of the fetch. The playing of the audio clip is governed by the fetchaudiodelay, and fetchaudiominimum properties in effect at the time of the fetch.	undefined
fetchaudiodelay	The time interval to wait at the start of a fetch delay before playing the fetchaudio source. The value is a Time Designation (see 8.4 Value Designations). The default interval is platform-dependent, e.g. "2s". The idea is that when a fetch delay is short, it may be better to have a few seconds of silence instead of a bit of fetchaudio that is immediately cut off.	platform-specific
fetchaudiominimum	The minimum time interval to play a fetchaudio source, once started, even if the fetch result arrives in the meantime. The value is a Time Designation (see 8.4 Value Designations). The default is platform-dependent, e.g., "5s". The idea is that once the user does begin to hear fetchaudio, it should not be stopped too quickly.	platform-specific
fetchtimeout	The timeout for fetches. The value is a Time Designation (see 8.4 Value Designations).	platform-specific

8.2.6 Miscellaneous Properties

The following miscellaneous properties are defined.

Table 49: Miscellaneous Properties
Name	Description	Default
inputmodes	This property determines which input modality to use. The input modes to enable: dtmf and voice. On platforms that support both modes, inputmodes defaults to "dtmf voice". To disable speech recognition, set inputmodes to "dtmf". To disable DTMF, set it to "voice". One use for this would be to turn off speech recognition in noisy environments. Another would be to conserve speech recognition resources by turning them off where the input is always expected to be DTMF. This property does not control the activation of grammars. For instance, voice-only grammars may be active when the inputmode is restricted to DTMF. Those grammars would not be matched, however, because the voice input modality is not active.	???
universals	Platforms may optionally provide platform-specific universal command grammars, such as "help", "cancel", or "exit" grammars, that are always active (except in the case of modal input items - see ) and which generate specific events. Production-grade applications often need to define their own universal command grammars, e.g., to increase application portability or to provide a distinctive interface. They specify new universal command grammars with <link> elements. They turn off the default grammars with this property. Default catch handlers are not affected by this property. The value "none" is the default, and means that all platform default universal command grammars are disabled. The value "all" turns them all on. Individual grammars are enabled by listing their names separated by spaces; for example, "cancel exit help".	none
maxnbest	This property controls the maximum size of the "application.lastresult$" array; the array is constrained to be no larger than the value specified by 'maxnbest'. This property has a minimum value of 1.	1

8.3 Speech and DTMF Input Timing Properties

The various timing properties for speech and DTMF recognition work together to define the user experience. The ways in which these different timing parameters function are outlined in the timing diagrams below. In these diagrams, the start for wait of DTMF input, or user speech both occur at the time that the last prompt has finished playing.

8.3.1 DTMF Grammars

DTMF grammars use timeout, interdigittimeout, termtimeout and termchar as described in 8.2.2 DTMF Recognition Properties to tailor the user experience. The effects of these are shown in the following timing diagrams.

8.3.1.1 timeout, No Input Provided

The timeout parameter determines when the <noinput> event is thrown because the user has failed to enter any DTMF. Once the first DTMF has been entered, this parameter has no further effect.

Timing diagram for timeout when no input provided

Figure 7: Timing diagram for timeout when no input provided.

8.3.1.2 interdigittimeout, Grammar is Not Ready to Terminate

In the following diagram, the interdigittimeout determines when the nomatch event is thrown because a DTMF grammar is not yet recognized, and the user has failed to enter additional DTMF.

Timing diagram for interdigittimeout, grammar is not ready to terminate

Figure 8: Timing diagram for interdigittimeout, grammar is not ready to terminate.

8.3.1.3 interdigittimeout, Grammar is Ready to Terminate

The example below shows the situation when a DTMF grammar could terminate, or extend by the addition of more DTMF input, and the user has elected not to provide any further input.

Timing diagram for interdigittimeout, grammar is ready to terminate

Figure 9: Timing diagram for interdigittimeout, grammar is ready to terminate.

8.3.1.4 termchar and interdigittimeout, Grammar Can Terminate

In the example below, a termchar is non-empty, and is entered by the user before an interdigittimeout expires, to signify that the users DTMF input is complete; the termchar is not included as part of the recognized value.

Timing diagram for termchar and interdigittimeout, grammar can terminate

Figure 10: Timing diagram for interdigittimeout, Timing diagram for termchar and interdigittimeout, grammar can terminate.

8.3.1.5 termchar Empty When Grammar Must Terminate

In the example below, the entry of the last DTMF has brought the grammar to a termination point at which no additional DTMF is expected. Since termchar is empty, there is no optional terminating character permitted, thus the recognition ends and the recognized value is returned.

Timing diagram for termchar empty when grammar must terminate

Figure 11: Timing diagram for termchar empty when grammar must terminate.

8.3.1.6 termchar Non-Empty and termtimeout When Grammar Must Terminate

In the example below, the entry of the last DTMF has brought the grammar to a termination point at which no additional DTMF is allowed by the grammar. If the termchar is non-empty, then the user can enter an optional termchar DTMF. If the user fails to enter this optional DTMF within termtimeout, the recognition ends and the recognized value is returned. If the termtimeout is 0s (the default), then the recognized value is returned immediately after the last DTMF allowed by the grammar, without waiting for the optional termchar. Note: the termtimeout applies only when no additional input is allowed by the grammar; otherwise, the interdigittimeout applies.

Timing diagram for termchar non-empty and termtimeout when grammar must terminate

Figure 12: Timing diagram for termchar non-empty and termtimeout when grammar must terminate.

8.3.1.7 termchar Non-Empty and termtimeout When Grammar Must Terminate

In this example, the entry of the last DTMF has brought the grammar to a termination point at which no additional DTMF is allowed by the grammar. Since the termchar is non-empty, the user enters the optional termchar within termtimeout causing the recognized value to be returned (excluding the termchar).

Timing diagram for termchar non-empty when grammar must terminate

Figure 13: Timing diagram for termchar non-empty when grammar must terminate.

8.3.1.8 Invalid DTMF Input

While waiting for the first or additional DTMF, three different timeouts may determine when the user's input is considered complete. If no DTMF has been entered, the timeout applies; if some DTMF has been entered but additional DTMF is valid, then the interdigittimeout applies; and if no additional DTMF is legal, then the termtimeout applies. At each point, the user may enter DTMF which is not permitted by the active grammar(s). This causes the collected DTMF string to be invalid. Additional digits will be collected until either the termchar is pressed or the interdigittimeout has elapsed. A nomatch event is then generated.

8.3.2 Speech Grammars

Speech grammars use timeout, completetimeout, and incompletetimeout as described in 8.2.3 Prompt and Collect Properties and 8.2.1 Speech Recognition Properties to tailor the user experience. The effects of these are shown in the following timing diagrams.

8.3.2.1 timeout When No Speech Provided

In the example below, the timeout parameter determines when the noinput event is thrown because the user has failed to speak.

Timing diagram for timeout when no speech provided

Figure 14: Timing diagram for timeout when no speech provided.

8.3.2.2 completetimeout With Speech Grammar Recognized

In the example above, the user provided a utterance that was recognized by the speech grammar. After a silence period of completetimeout has elapsed, the recognized value is returned.

Timing diagram for completetimeout with speech grammar recognized

Figure 15: Timing diagram for completetimeout with speech grammar recognized.

8.3.2.3 incompletetimeout with Speech Grammar Unrecognized

In the example above, the user provided a utterance that is not as yet recognized by the speech grammar but is the prefix of a legal utterance. After a silence period of incompletetimeout has elapsed, a nomatch event is thrown.

Timing diagram for incompletetimeout with speech grammar unrecognized

Figure 16: Timing diagram for incompletetimeout with speech grammar unrecognized.

8.4 Value Designations

Several VoiceXML parameter values follow the conventions used in the W3C's Cascading Style Sheet Recommendation [CSS2].

8.4.1 Integers

Integers are specified in decimal notation only. Integers may be preceded by a "-" or "+" to indicate the sign.

An integer consists of one or more digits "0" to "9".

8.4.2 Real Numbers

Real numbers are specified in decimal notation only. Real numbers may be preceded by a "-" or "+" to indicate the sign.

A real number may be an integer, or it may be zero or more digits followed by a dot (.) followed by one or more digits.

8.4.3 Times

Time designations consist of a non-negative real number followed by a time unit identifier. The time unit identifiers are:

ms: milliseconds
s: seconds

Examples include: "3s", "850ms", "0.7s", ".5s" and "+1.5s".

Voice Extensible Markup Language (VoiceXML) 3.0

W3C Working Draft 19 December 2008

Abstract

Status of this Document

Table of Contents

Appendices

1 Terminology

2 Overview

2.1 Structure of VoiceXML 3.0

2.2 Structure of this document

2.3 How to read this document

3 Data Flow Presentation (DFP) Framework

3.1 Data

3.2 Flow

3.3 Presentation

4 Core Concepts

4.1 Semantics

4.1.1 Resources

4.1.2 Resource Controllers (RCs)

4.2 Syntax

4.3 Event Model

4.3.1 Event Interfaces

4.3.1.1 Event

4.3.1.2 EventTarget

4.3.1.3 EventListener

4.3.2 Event Flow

4.3.2.1 Event Listener Registration

4.3.2.2 Event Listener Activation

4.3.3 Event Categories

4.4 Document Initialization and Execution

4.4.1 Initialization

4.4.2 Execution

4.4.2.1 Subdialogs

4.4.2.2 Application Root

4.4.2.3 Summary of Syntax/Semantics Interaction

5 Resources

5.1 Datamodel Resource

5.1.1 Data Model Resource API

5.2 Prompt Queue Resource

5.2.1 State Chart Representation

5.2.2 SCXML Representation

5.2.3 Defined Events

5.2.4 Device Events

5.2.5 Open Issue

5.3 Recognition Resources

5.3.1 Definition

5.3.2 Defined Events

5.3.3 Device Events

5.3.4 State Chart Representation

5.3.5 SCXML Representation

6 Modules

6.1 Grammar Module

6.1.1 Syntax

6.1.1.1 Attributes

6.1.1.2 Content Model

6.1.2 Semantics

6.1.2.1 Definition

6.1.2.2 Defined Events

6.1.2.3 External Events

6.1.2.4 State Chart Representation

6.1.3 Events

6.1.4 Examples

6.2 Inline SRGS Grammar Module

6.2.1 Syntax

6.2.2 Semantics

6.2.2.1 Definition

6.2.2.2 Defined Events

6.2.2.3 External Events

6.2.2.4 State Chart Representation

6.2.2.5 SCXML Representation

6.2.3 Events

6.2.4 Examples

6.3 External Grammar Module

6.3.1 Syntax

6.3.1.1 Attributes

6.3.1.2 Content Model

6.3.2 Semantics

6.3.2.1 Definition

6.3.2.2 Defined Events

6.3.2.3 External Events