Copyright © 2008 W3C® (MIT, ERCIM, Keio), All Rights Reserved. W3C liability, trademark and document use rules apply.
This document specifies VoiceXML 3.0, a modular XML language for creating interactive media dialogs that feature synthesized speech, recognition of spoken and DTMF key input, telephony, mixed initiative conversations, and recording and presentation of a variety of media formats including digitized audio, and digitized video.
Its major goal is to bring the advantages of Web-based development and content delivery to interactive voice response applications.
This section describes the status of this document at the time of its publication. Other documents may supersede this document. A list of current W3C publications and the latest revision of this technical report can be found in the W3C technical reports index at http://www.w3.org/TR/.
This is the 19 December 2008 First Public Working Draft of "Voice Extensible Markup Language (VoiceXML) 3.0".
This document is very much a work in progress. Many sections are incomplete, only stubbed out, or missing entirely. To get early feedback, the group focused on defining enough functionality, modules, and profiles to demonstrate the general framework. To complete the specification, the group expects to introduce additional functionality (for example speaker identification and verification, external eventing) and describe the existing functionality at the level of detail given for the Prompt and Field modules. We explicitly request feedback on the framework, particularly any concerns about its implementability or suitability for expected applications. By the middle of 2009 the group expects to have all existing functionality defined in detail, the new functionality stubbed out, and the VoiceXML 2.1 profile largely defined. By late-2009 the group expects to have all functionality defined and both profiles defined in detail.
Applications written as 2.1 documents can be used under a 3.0 processor using the 2.1 profile. As an example, the Implementation Report tests for 2.1 (which includes the IR tests for 2.0) will be supported on a 3.0 processor. Exceptions will be clarifications and changes needed to improve interoperability.
This document is a W3C Working Draft. It has been produced as part of the Voice Browser Activity. The authors of this document are participants in the Voice Browser Working Group (W3C members only). For more information see the Voice Browser FAQ. The Working Group expects to advance this Working Draft to Recommendation status.
Comments are welcome on www-voice@w3.org (archive). See W3C mailing list and archive usage guidelines.
Publication as a Working Draft does not imply endorsement by the W3C Membership. This is a draft document and may be updated, replaced or obsoleted by other documents at any time. It is inappropriate to cite this document as other than work in progress.
This document was produced by a group operating under the 5 February 2004 W3C Patent Policy. W3C maintains a public list of any patent disclosures made in connection with the deliverables of the group; that page also includes instructions for disclosing a patent. An individual who has actual knowledge of a patent which the individual believes contains Essential Claim(s) must disclose the information in accordance with section 6 of the W3C Patent Policy.
1 Terminology
2 Overview
2.1 Structure of VoiceXML 3.0
2.2 Structure of this document
2.3 How to
read this document
3 Data Flow Presentation (DFP) Framework
3.1 Data
3.2 Flow
3.3 Presentation
4 Core Concepts
4.1 Semantics
4.1.1 Resources
4.1.2 Resource Controllers (RCs)
4.2 Syntax
4.3 Event
Model
4.3.1 Event Interfaces
4.3.1.1
Event
4.3.1.2
EventTarget
4.3.1.3
EventListener
4.3.2 Event Flow
4.3.2.1
Event Listener Registration
4.3.2.2
Event Listener Activation
4.3.3 Event Categories
4.4 Document Initialization and
Execution
4.4.1 Initialization
4.4.2 Execution
4.4.2.1
Subdialogs
4.4.2.2
Application Root
4.4.2.3
Summary of Syntax/Semantics Interaction
5 Resources
5.1 Datamodel Resource
5.1.1 Data Model Resource API
5.2 Prompt
Queue Resource
5.2.1 State Chart
Representation
5.2.2 SCXML Representation
5.2.3 Defined Events
5.2.4 Device Events
5.2.5 Open Issue
5.3 Recognition Resources
5.3.1 Definition
5.3.2 Defined Events
5.3.3 Device Events
5.3.4 State Chart Representation
5.3.5 SCXML Representation
6 Modules
6.1 Grammar Module
6.1.1 Syntax
6.1.1.1
Attributes
6.1.1.2
Content Model
6.1.2 Semantics
6.1.2.1
Definition
6.1.2.2
Defined Events
6.1.2.3
External Events
6.1.2.4
State Chart Representation
6.1.3 Events
6.1.4 Examples
6.2 Inline SRGS Grammar
Module
6.2.1 Syntax
6.2.2 Semantics
6.2.2.1
Definition
6.2.2.2
Defined Events
6.2.2.3
External Events
6.2.2.4
State Chart Representation
6.2.2.5
SCXML Representation
6.2.3 Events
6.2.4 Examples
6.3 External Grammar Module
6.3.1 Syntax
6.3.1.1
Attributes
6.3.1.2
Content Model
6.3.2 Semantics
6.3.2.1
Definition
6.3.2.2
Defined Events
6.3.2.3
External Events
6.3.2.4
State Chart Representation
6.3.2.5
SCXML Representation
6.3.3 Events
6.3.4 Examples
6.4 Prompt
Module
6.4.1 Syntax
6.4.1.1
Attributes
6.4.1.2
Content Model
6.4.2 Semantics
6.4.2.1
Definition
6.4.2.2
Defined Events
6.4.2.3
External Events
6.4.2.4
State Chart Representation
6.4.2.5
SCXML Representation
6.4.3 Events
6.4.4 Examples
6.5 Builtin
SSML Module
6.5.1 Syntax
6.5.2 Semantics
6.5.3 Examples
6.6 Media
Module
6.6.1 Syntax
6.6.1.1
Attributes
6.6.1.2
Content Model
6.6.1.2.1
Tips (informative)
6.6.2 Semantics
6.6.3 Examples
6.7 Parseq
Module
6.7.1 Syntax
6.7.2 Semantics
6.7.3 Examples
6.8 Foreach
Module
6.8.1 Syntax
6.8.1.1
Attributes
6.8.1.2
Content Model
6.8.2 Semantics
6.8.3 Examples
6.9 Form
Module
6.9.1 Syntax
6.9.2 Semantics
6.9.2.1
Form RC
6.9.2.1.1
Definition
6.9.2.1.2
Defined Events
6.9.2.1.3
External Events
6.9.2.1.4
State Chart Representation
6.9.2.1.5
SCXML Representation
6.10 Field
Module
6.10.1 Syntax
6.10.2 Semantics
6.10.2.1
Field RC
6.10.2.1.1
Definition
6.10.2.1.2
Defined Events
6.10.2.1.3
External Events
6.10.2.1.4
State Chart Representation
6.10.2.1.5
SCXML Representation
6.10.2.2
PlayandRecognize RC
6.10.2.2.1
Definition
6.10.2.2.2
Defined Events
6.10.2.2.3
External Events
6.10.2.2.4
State Chart Representation
6.10.2.2.5
SCXML Representation
7 Profiles
7.1 VoiceXML
2.1 Profile
7.2 Media
Server Profile
8 Environment
8.1 Resource
Fetching
8.1.1 Fetching
8.1.2 Caching
8.1.2.1
Controlling the Caching
Policy
8.1.3 Prefetching
8.1.4 Protocols
8.2 Properties
8.2.1 Speech Recognition Properties
8.2.2 DTMF Recognition Properties
8.2.3 Prompt and Collect
Properties
8.2.4 Media Properties
8.2.5 Fetch Properties
8.2.6 Miscellaneous Properties
8.3 Speech and
DTMF Input Timing Properties
8.3.1 DTMF Grammars
8.3.1.1
timeout, No Input
Provided
8.3.1.2
interdigittimeout,
Grammar is Not Ready to Terminate
8.3.1.3
interdigittimeout,
Grammar is Ready to Terminate
8.3.1.4
termchar
and interdigittimeout, Grammar Can Terminate
8.3.1.5
termchar
Empty When Grammar Must Terminate
8.3.1.6
termchar Non-Empty and termtimeout When Grammar
Must Terminate
8.3.1.7
termchar Non-Empty and termtimeout When Grammar
Must Terminate
8.3.1.8
Invalid DTMF Input
8.3.2 Speech Grammars
8.3.2.1
timeout When No Speech
Provided
8.3.2.2
completetimeout
With Speech Grammar Recognized
8.3.2.3
incompletetimeout
with Speech Grammar Unrecognized
8.4 Value
Designations
8.4.1 Integers
8.4.2 Real Numbers
8.4.3 Times
A Acknowledgements
B References
B.1 Normative
References
B.2 Informative
References
C Glossary of Terms
In this document, the key words "must", "must not", "required", "shall", "shall not", "should", "should not", "recommended", "may", and "optional" are to be interpreted as described in [RFC2119] and indicate required levels for compliant VoiceXML 3.0 implementations.
Terms used in this specification are defined in Appendix C Glossary of Terms.
How does one build a successor to VoiceXML 2.0/2.1? Requests for improvements to VoiceXML fell into two main categories: extensibility and new functionality.
To accommodate both, the Voice Browser Working Group first developed the detailed semantic descriptions of VoiceXML that versions 2.0 and 2.1 lacked. From there it was possible to describe semantics for new functionality and to restructure the language syntactically to improve extensibility.
One of the other benefits of detailed semantic descriptions is improving portability within VoiceXML. However there are many factors that contribute to portability that are outside the scope of this document (e.g. speech recognition capabilities, telephony).
This document covers the following:
The remainder of this document is structured as follows:
3 Data Flow Presentation (DFP) Framework presents the Data-Flow-Presentation Framework, its importance for the development of VoiceXML 3.0, and how VoiceXML 3.0 fits into the model.
4 Core Concepts explains the core concepts underlying the new structure for VoiceXML, including resources, resource controllers, the relationship between syntax and semantics, DOM eventing, modules, and profiles.
5 Resources presents the resources defined for the language. These provide the key presentation-related functionality in the language.
6 Modules presents the modules defined for the language. Each module consists of a syntax piece (with its user-visible events), a semantics piece (with its behind-the-scenes events), and a description of how the two are connected.
7 Profiles presents two profiles. The first, the VoiceXML 2.1 profile, shows how a language similar to VoiceXML 2.1 can be created using the structure and functionality of VoiceXML 3.0. The second, the Media Server profile, is a simple compilation of all of the functionality available in VoiceXML 3.0.
The Appendices provide useful references and a glossary of terms used in the specification.
For everyone: Please first read 3 Data Flow Presentation (DFP) Framework. The data-flow- presentation distinction applies not only to VoiceXML 3.0, but to many of W3C's specifications. Understanding VoiceXML's role as a presentation language is crucial context for understanding the rest of the specification.
For application authors: we recommend that you begin with syntax and only gradually explore details of the semantics as you need to understand behavioral specifics.
For VoiceXML platform developers: we recommend that you begin with the functionality and framework and only focus on syntax later.
Unlike VoiceXML 2.0/2.1, the focus in VoiceXML 3.0 is almost exclusively on the user interface portions of the language. By choice, very little work has gone into the development of data storage and manipulation or control flow capabilities. In short, VoiceXML 3.0 has been designed from the ground up as a *presentation* language, according to the definition presented in the Data Flow Presentation ([DFP]) Framework.
The Data Flow Presentation (DFP) Framework is an instance of the Model-View-Controller paradigm, where computation and control flow are kept distinct from application data and from the way in which the application communicates with the outside world. This partitioning of an application allows for any one layer to be replaced independently of the other two. In addition, it is possible to simultaneously make use of more than one Data (Model) language, Flow (Controller), and/or Presentation (View) language.
The Data layer is responsible for maintaining all information in a format that is easily accessible and easily editable.
Although data that is independent of the Presentation medium (such as flight reservation data stored in the back-end database) would be stored outside of the VoiceXML application, there is still a need to keep some presentation-specific data, e.g. the status of the dialog in collecting certain information, which prompts have just been played, and how many of various error conditions have occurred so far.
Within VoiceXML 3.0 the Data layer is realized through a pluggable data language and a data access or manipulation language. Access to and use of the data is aligned with options available in SCXML for simpler interaction with the Flow layer (see the next section). This specification defines two specific data languages, XML and ECMAScript, and two data access and manipulation languages, E4X/DOM and XPath. Others may be defined by implementers.
The Flow layer is responsible for all application control flow, including business logic, dialog management, and anything else that is not strictly data or presentation. VoiceXML 3.0 provides primitives that contain the control flow needed to implement them, but all combinations between and among the elements at the syntax level is done via calls to external control flow processors. Two that are likely to be used with VoiceXML are CCXML and SCXML. Note that flow control components written outside of VoiceXML may be communicating not only with a VoiceXML processor but with an HTML browser, a video game controller, or any of a variety of other input and output components.
The Presentation layer is responsible for all interaction with the outside world, i.e., human beings and external software components. VoiceXML 3.0 *is* the Presentation layer. Designed originally for human-computer interaction, VoiceXML "presents" a dialog by accepting audio and dtmf input and producing audio and video output.
This document specifies the VoiceXML 3.0 language as a collection of modules. Each module is described at two levels:
The resources, resource controllers, and the events they generate are intended only to describe the semantics of VoiceXML 3. Implementations are not required to use SCXML to implement VoiceXML 3, nor must they create objects corresponding to resources, resource controllers, and the SCXML events they raise. The logical components are useful for describing how different syntax use similar resources or for future extensions to the language that may use these resources or hook into specific places in the semantic framework, but only the behavior exposed is necessary for a conformant VoiceXML 3 interpreter.
It is important to note that the semantic framework described here is a logical one. The resources, resource controllers, and the events they generate are intended only to describe the semantics of VoiceXML 3. Implementations are not required to use SCXML to implement VoiceXML 3, nor must they create objects corresponding to resources, resource controllers, and the SCXML events they raise. These logical SCXML events must be distinguished from the author-visible DOM events that are a mandatory part of the VoiceXML 3 language. Implementations MUST raise these DOM events and process them in the manner described in Section 4.3 Event Model . The interaction between actual DOM events and logical SCXML events is described in Section 4.4 Document Initialization and Execution, below.
The semantic model is a conceptual representation of the underlying behavior of VoiceXML (form interpretation, prompt selection, etc). Each VoiceXML 3.0 module such as form interpretation, prompt selection, etc, contains a conceptual representation of its the underlying behavior expressed in terms of resources and resource controllers. While the resources and resource controllers are not exposed directly in the markup, they are used to define the semantics of VoiceXML 3.0 markup elements.
For example, Figure 1 presents a high-level semantic description of the PromptQueue which consists of the PromptController, Prompt Queue resource, and SSML/media player. For a detailed description of the semantics of the Prompt Queue, see state chart representation in Section 5.2.1 State Chart Representation and the SCMXL representation in Section 5.2.2 SCXML Representation. Section 5.2.3 Defined Events defines each event.
(Additional examples TBD)
The VoiceXML 3.0 semantic model is illustrated in Figure 1.
Editorial note | |
Section 4.1. Replace Figure 1 by a picture illustrating the three levels of (resource controller, resources, devices) with three examples corresponding to the examples of section 4.1.2 -- JimL |
Editorial note | |
More architecture diagrams will be added in later versions. |
It is important to note that this model places no burden or requirements that a VoiceXML interpreter must implement behavior as described in the model. Rather, the requirement is that the behavior must be the same as if it were implemented as described, but it is permitted to have optimizations or different architecture behind the implementation of the markup interpretation.
Resources are the building blocks of the semantic model. Each resource is a self contained object in the semantic model that is capable of providing a service. The resources are singletons, global in scope and persist for the whole session (e.g. even over subdialogs). Multiple different resources may be simultaneously active.
Controllers communicate with resources by sending and receiving events. Resources do not communicate with one another directly.
Different modules can use the same resources, and language profiles can require specified resources. Different profiles may only have the ability to support limited resources and some profiles modules may require new resources.
Examples of resources required for the VoiceXML 2.1 profile include a Prompt Queue/Player, a DTMF Recognizer, an ASR recognizer, an recording service, a transfer service, and a hierarchically scoped (ECMAScript) data model. Other modules and profiles of VoiceXML 3.0 may require that existing resources are extended, or that new ones are created. Examples of new resources may include: a whole call recording service, an XML data model and a SIV service.
The semantics of VoiceXML markup elements may be captured by saying a markup element interacts with a resource; for example, the semantic representation of the <value> element may be that it represents a single data resource lookup result (part of the conceptual API the data model resource offers) that is expected to execute the expression in the expr attribute and to return either the result or an error.
The conceptual objects responsible for coordinating input and output across multiple resources are resource controllers RCs. Each resource controller may interact with resources and other resource controllers to model the semantics of one or more parts of the markup. Each VoiceXML 3.0 markup element can be represented as 0 or more resource controllers. If there are more than one resource controller associated with a markup element, one of these is designated as the primary RC.
Examples of resource controllers include:
Note that this means that the Form Interpretation Algorithm is realized through the interactions of multiple resource controllers.
The lifecycle and scope of resource controllers are not session and global respectively and there may be multiple conceptual instances of the controllers instantiated (waiting for an event, containing state) at the same time. However, conceptually only one controller may be active doing work (handling an event) at a time.
The event model for VoiceXML 3.0 builds upon the DOM Level 3 Events [DOM3Events] specification. DOM Level 3 Events offer a robust set of interfaces for managing the listener registration, dispatching, propagation, and handling of events, as well as a description of how events flow through an XML tree.
The DOM 3.0 event model offers VoiceXML developers a rich set of interfaces that allow them to easily add behavior to their applications. In addition, conforming to the standard DOM event model enables authors to integrate their Voice applications in next generation multimodal or multi-namespaced frameworks such as MMI and CDF with minimal efforts.
Within the VoiceXML 3.0 semantic model, the DOM Level 3 Events APIs are available to all Resource Controllers that have markup elements associated with them. Indeed, this section covers the eventing APIs as available to VoiceXML 3.0 markup elements. The following section describes how the semantic model ties in with the DOM eventing model.
All VoiceXML 3.0 markup elements implement interfaces that support the following:
The VoiceXML 3.0 Event interface extends the DOM Level 3 Event interface to support voice specific event information. In particular, the VoiceXML 3.0 Event interface supports a count integer that stores the number of times a resources emits a particular event type. The semantic model manages the count field by incrementing its value and resetting it as described in the section that follows.
Note:
RH: should we expose the count to authors? If so, should we have a special variable like event.count or something similar ?VoiceXML 3.0 markup elements implement the DOM Level 3 EventTarget interface.This interface allows registration and removal of event listeners as well as dispatching of events.
The VoiceXML 3.0 markup elements implement the DOM Level 3 EventListener interface. This interface allows the activation of handlers associated with a particular event. When a listener is activated, the event handler execution is done in the semantic model as described in the section that follows.
[To be updated by Michael Bodell due April 1 2008]
Events propagate through markup elements as per the DOM event flow. Event listeners may be registered on any of VoiceXML markup elements.
When processing a VoiceXML 2.0 profile, event listeners are not allowed to be registered for the capture phase, as this contradicts the as-if-by-copy event semantics of VoiceXML 2.0. If a listener is registered with the capture phase set to true in a VoiceXML 2.0 document, an error.event.illegalphase event will be dispatched onto the root document and the listener registration will be ignored (does that sound reasonable to people?).
The DOM Level 3 Event specification supports the notion of partial ordering using the event listener group; all events within a group are ordered. As such, in VoiceXML 3.0, event listeners are registered as they are encountered in the document. Furthermore, all event listeners registered on an element belong to the same default group. Both of these provisions ensure that event handlers will execute in document order.
An event listener is triggered if:
Once en event listener is triggered, the execution is handled by the semantic model as described in the section below. Event propagation blocks until it is notified by the semantic model to proceed.
The VoiceXML 3.0 specification extends the DOM 3 Event specification to support partial name matching on events. VoiceXML 3.0 creates categories of events (the list of categories needs to be specified in the VoiceXML 3.0 spec ) and allows authors and the platform to register listeners for either a specific event type or for all events within a particular category or subcategory. For example, VoiceXML 3.0 may create a connection category such as:
{"http://www.example.org/2007/v3","connection"}
The spec may also declare a subcategory of connection or a specific event type that belongs to this category:
{"http://www.example.org/2007/v3","connection.disconnect"} {"http://www.example.org/2007/v3","connection.disconnect.hangup"}
Following this declaration, the VoiceXML 3.0 Event specification uses partial name matching to associate events propagating through the DOM to listeners registered on the tree. The VoiceXML 3.0 Event specification follows the prefix matching used in VoiceXML 2.0 for associating events with their categories.
Note:
It might be useful to introduce the "*" notation to be specify a catch for all events irrespective of their type and/or category.The initialization ordering described here is a logical one, specifying which objects and information are available at each stage. Implementations are allowed to use a different ordering (in particular, they are allowed to interleave the construction of the DOM with the creation of semantic objects) as long as they behave as if they were following the order specified here. Similarly, we refer to a 'semantic constructor' as a cover term for whatever mechanism is used to create the Resource Controllers for a given node. No particular implementation is implied or required.
Before a VoiceXML 3.0 application is first loaded, all Resources are created. Whenever a document is loaded within that application, its DOM (level 3) is created. Then the initialization process creates the Resource Controllers by invoking the semantic constructor for the root <vxml> node of the DOM. The root <vxml> node constructor is responsible for invoking the constructors for all nodes in the document that have them. When it does this, it will call the semantic constructor routine passing it
Editorial note | |
Open Issue: we must specify the operation of the root node constructor in more detail as part of the V3 specification. Other people can define modules, but we must specify how they are assembled into a full semantic representation of the application.) If there is an application root document specified, the root node constructor will have to construct its RCs as well, by calling its root node constructor. |
Note that the initial construction process creates the RCs but does not necessarily fully configure them. Further initialization, including in particular the creation of variables and variable scopes, will happen only when the RCs are activated at runtime (e.g. by visiting a Form.)
Once the RCs are constructed, they are independent of the DOM, except for the interactions specified below. However, while they are running the RCs often make use of what appears to be syntactic information. For example, the concept of 'next item' relies heavily on document order, while <goto> can take a specific syntactic label as its target. We provide for this by assuming that RCs can maintain a shadow copy of relevant syntactic information, where "shadow copy" is intended to allow a variety of implementations. In particular, platforms may make an actual copy of the information or may maintain pointers back into the DOM. The construction process may create multiple RCs for a given node. In that case, one of the RCs will be marked as the primary RC. It is the one that will be invoked when the flow of control reaches that (shadow) node.
After initialization, the semantic control flow does a <goto> to the initial Resource Controller. Once a RC is running, it invokes Resources and other RCs by sending them events. The DOM is not involved in this process. At various points in the processing, however, an RC may decide to raise an author-visible event. It does this by creating an event targeted at a specific DOM node and sending it back to the DOM. When the DOM receives the event, it performs the standard bubble/capture cycle with the target specified in the event. In the course of the bubble/capture cycle, various event handlers may fire. Their execution is a semantic action and occurs back in the semantic 'side' of the environment. The DOM sends messages back to the appropriate semantic objects to cause this to happen. Note that this means that the DOM must store some sort of link to the appropriate RCs. The event handlers may update the data model, execute script, or raise other DOM events. When the handler finishes processing on the semantic side, it sends a notification back to the DOM so that it can resume the bubble/capture phase. (N.B. This notification is NOT a DOM event.) When the DOM finishes the bubble/capture processing of the event, it sends a notification back to the RC that raised the event so that it can continue processing.
Editorial note | |
Open Issue: Is this notification a standard semantic event? Note that RC processing must pause during the bubble/capture phase to avoid concurrency problems. |
A subdialog has a completely separate context from the invoking application. Thus it has a separate DOM and a separate set of RCs. However it shares the same set of Resources since they are global. When a subdialog is entered, the Datamodel Resource will have to create a new scope for the subdialog and hide the calling document's scopes. When the subdialog is exited, the Datamodel resource will destroy the subdialog scope(s) and restore the calling document's scope(s).
To handle event propagation from the leaf application to the application root document, we create a Document Manager to handle all communication between the documents. This means that the DOMs of the two documents remain separate. When an event is not handled in the leaf document, the Document Manager will propagate it to the application root, where it will be targeted at the <vxml> node. Requests to fetch properties or to active grammars will be handled by the Document Manager in a similar fashion. To handle platform- and/or language-level defaults, we will create a "super-root" document above the application root. The Document Manager will pass it events and requests that are not handled in the root document. If root and superroot documents do not handle an event, the Document Manager will ensure that the event is thrown away.
There seem to be four kinds of interactions between RCs and the DOM at runtime:
Editorial note | |
Open Issue: DOM Modification. There are two possibilities: 1) we can refuse to allow the DOM to be modified (or ignore the modifications if it is) 2) we can reconstruct the relevant resource controllers when the DOM is modified. In the latter case, the straightforward approach would be: a) find the least node that is an ancestor of all the changes and that has a constructor b) call its constructor as during initialization, using the current state of the DOM and RCs as context. |
This section describes semantic models for common VoiceXML resources. Resources have a life cycle of creation and destruction. Specific resources may specify detailed requirements on these phases. All resources must be created prior to their use by a VoiceXML interpreter.
Editorial note | |
Standard lifecycle events are expected to be defined in later versions: create event: from idle to created; destroy event: from created to idle. |
Resources are defined in terms of a state model and events which it processes within defined states. Events may be divided into those which are defined by the resource itself and events defined by other conceptual entities which the resource receives or sends within these states. These conceptual entities include resource controllers and a 'device' which provides an implementation of the services defined by the resource.
The semantic model is specified in both UML state chart diagrams and SCXML representations. In case of ambiguity, the SCXML representation takes precedence over UML diagrams. Note that SCXML is used here to define the states and events for resources and this definitional usage should not be confused with the use of SCXML to specify application flow (see 3.2 Flow). Furthermore, these resource events are conceptual, not DOM events: they are used to define relationship with other conceptual entities and are not exposed at the markup level. The relationship between conceptual events and DOM events is described in XXX.
The following resources are defined: data model (5.1 Datamodel Resource), prompt queue (5.2 Prompt Queue Resource) and DTMF and ASR recognition (5.3 Recognition Resources).
[Later versions will defined the following resources: recorder, SIV. Later versions may define the following resources: session recorder, ...]
Editorial note | |
Later versions of this document will clarify that different datamodels may be instanced, such as ECMAScript, XML, etc. Conformance requirements will be stated at a later stage. |
The datamodel is a repository for both user- and system-defined data and properties. To simplify variable lookup,we define the datamodel with a synchronous function-call API, rather than an asynchronous one based on events. The data model API does not assume any particular underlying representation of the data or any specific access language, thus allowing implementations to plug in different concrete data model languages.
There is a single global data model that is created when the system is first initialized. Access to data is controlled by means of scopes, which are stored in a stack. Data is always accessed within a particular scope, which may be specified by name but defaults to being the top scope in the stack. At initialization time, a single scope named "Global" is created. Thereafter scopes are explicitly created and destroyed by the data model's clients.
Editorial note | |
Resource and Resource controller description to be updated with API calls rather than events. |
Function | Arguments | Return Value | Sequencing | Description |
CreateScope | name(optional) | Success or Failure | Creates a new scope object and pushes it on top of the scope stack. If no name is provided the scope is anonymous and may be accessed only when it on the top of the scope stack. A Failure status is returned if a scope already exists with the specified name. | |
DeleteScope | name(optional) | Success or Failure | Removes a scope from the scope stack. If no name is provided, the topmost scope is removed. Otherwise the scope with provided name is removed. A Failure status is returned if the stack is empty or no scope with the specified name exists. | |
CreateVariable | variableName, value(optional), scopeName(optional) | Success or Error | Creates a variable. If scopeName is not specified, the variable is created in the top most scope on the scope stack. If no value is provided, the variable is created with the default value specified by the underlying datamodel. A Failure status is returned if a variable of the same name already exists in the specified scope. | |
DeleteVariable | variableName, scopeName(optional) | Success or Failure | Deletes the variable with the specified name from the specified scope. If no scopeName is provided, the variable is deleted from the topmost scope on the stack. The status Failure is returned if no variable with the specified name exists in the scope. | |
UpdateVariable | variableName, newValue, scopeName(optional) | Success or Failure | Assigns a new value to the variable specified. If scopeName is not specified, the variable is accessed in the topmost scope on the stack. A Failure status is returned if the specified variable or scope cannot be found. | |
ReadVariable | variableName, scopeName(optional) | value | Returns the value of the variable specified. If scopeName is not specified, the variable is accessed in the topmost scope on the stack. An error is raised if the specified variable or scope cannot be found. | |
EvaluateExpression | expr, scopeName(optional) | value | Evaluates the specified expression and returns its value. If scopeName is not specified, the expression is evaluated in the topmost scope on the stack. An error is raised if the specified scope cannot be found. |
Issue ():
Do we need event listeners on the data model, e.g., to notify when the value of a variable changes?
Resolution:
None recorded.
Here is a UML representation of the prompt queue. This state machine assumes that "queue" and "play" are separate commands and that a separate "play" will always be issued to trigger the play. When the "play" is issued, the systems plays any queued prompts, up to and including the first fetch audio in the queue. Then it halts, even if there are additional prompts or fetch audio in the queue and waits for another "play" command.
Editorial note | |
Open issue: Can queued prompt commands, either audio or TTS, be left un-fetched or un-rendered until a play command is issued to the prompt resource? This may result in delays or gaps in the production of the actual audio, as the rendering or fetching may not produce playable audio fast enough to avoid inter-prompt delays. |
The prompt structure assumed here is fairly abstract. It consists of a specification of the audio along with optional parameters controlling playback (for example, speed or volume.) The audio may be presented in-line, as SSML or some other markup language, or as a pointer to a file or streaming audio source. Logically, URLs are dereferenced at the time the prompt is queued, but implementations are not required to fetch the actual media until the prompt in question is sent to the player device. Note that the player device is assumed to be able to handle both recorded prompts and TTS, and to be able to interpret SSML. Platforms are free to optimize their implementations as long as they conform to the state machine specified here. In particular, platforms may prefetch audio or begin TTS processing in the background before the prompt is sent to the player device. For applications that make use of VCR controls (speed up, skip forward, etc.), actual performance may depend on whether the platform has implemented such optimizations. For example, a request to skip forward on a platform that does not prefetch prompts may result in a long delay. Such performance issues are outside the scope of this specification.
This diagram assumes that SSML mark information is delivered in the Player.Done event, and that the player returns a Player.Done event when it is sent a 'halt' event (otherwise mark information would get lost on barge-in and hangup, etc).
Note that the "FetchAudio" state is shown stubbed out for reasons of space, and is expanded in a separate diagram below the main one.
Figure X: Prompt Queue Model
Figure Y: Fetch audio Model
<?xml version="1.0" encoding="UTF-8"?> <scxml initialstate="Created"> <datamodel> <data name="queue"/> <data name="markName"/> <data name="markTime"/> <data name="bargeInType"/> </datamodel> <state id="Created"> <initial id="Idle"/> <transition event="QueuePrompt"> <insert pos="after" loc = "datamodel/data[@name='queue']/prompt" val="_eventData/prompt"/> </transition> <transition event="QueueFetchAudio"> <foreach var="node" nodeset="datamodel/data[@name='queue']/prompt"> <if cond="$node[@fetchAudio='true']"> <delete loc="$node"/> <else> <assign loc="$node[@bargeInType]" val="unbargeable"/> </else> </if> </foreach> <insert pos="after" name="datamodel/data[@name='queue']/prompt" val="_eventData/audio"/> </transition> <transition event="setParameter"> <send target="player" event="setParameter" namelist="_eventData.paramName, _eventData.newValue"/> </transition> <transition event="Cancel" target="Idle"> <send target="player" event="halt"/> <send event="PlayDone" namelist="/datamodel/data[@name='markName'].text(), /datamodel/data[@name='markTime'].text()"/> <delete loc="datamodel/data[@name='queue']/prompt"/> </transition> <transition event="CancelFetchAudio"> <foreach var="node" nodeset="datamodel/data[@name='queue']/prompt"> <if cond="$node[@fetchAudio='true']"> <delete loc="$node"/> </if> </foreach> </transition> <state id="Idle"> <onentry> <assign loc="/datamodel/data[@name='markName']" val=""/> <assign loc="/datamodel/data[@name='markTime']" val="-1"/> <assign loc="/datamodel/data[@name='bargeInType']" val=""/> </onentry> <transition event="Play" cond="/datamodel/data[@name='queue']/prompt[1][@fetchAudio] eq 'false'" target="PlayingPrompt"/> <transition event="Play" cond="/datamodel/data[@name='[queue']/prompt[1][@fetchAudio] eq 'true'" target="FetchAudio"/> </state> <state id="PlayingPrompt"> <datamodel> <data name="currentPrompt"/> </datamodel> <onentry> <assign loc="/datamodel/data[@name='currentPrompt']/prompt" val="/datamodel/data[@name='queue']/prompt[1])"/> <delete loc="/datamodel/data[@name='queue']/prompt[1]"/> <if cond="/datamodel/data[@name='currentPrompt']/prompt[@bargeInType] != /datamodel/data[@name='bargeInType']"> <send event="BargeInChange" namelist="/datamodel/ data[@name='currentPrompt']/prompt[@bargeInType]"/> <assign loc="/datamodel/data[@name='bargeInType']" expr="/ datamodel/data[@name='currentPrompt']/prompt[@bargeInType]"/> </if> </onentry> <invoke targettype="player" srcexpr="/datamodel/ data[@name='currentPrompt']/prompt"/> <finalize> <if cond="_eventData/MarkTime neq '-1'"> <assign name="/datamodel/data[@name='markName']/" val="_eventData/markName.text()"/> <assign name="/datamodel/data[@name='markTime']/" val="_eventData/markTime.text()"/> </if> </finalize> <transition event="player.Done" cond="/datamodel/data[@name='queue']/prompt[last()] le '1'" target="Idle"> <send event="PlayDone" namelist="/datamodel/data[@name='markName'].text(), /datamodel/data[@name='markTime'].text()"/> </transition> <transition event="player.Done" cond="/datamodel/data[@name='queue'/prompt[1][@fetchAudio] neq 'true'" target="PlayingPrompt"/> <transition event="player.Done" cond="/datamodel/data[@name='queue']/prompt[1][@fetchAudio] eq 'true'" target="FetchAudio"/> </state> <!-- end PlayingPrompt --> <state id="FetchAudio"> <initial id="WaitFetchAudio"/> <transition event="player.Done" target="FetchAudioFinal"/> <state id="WaitFetchAudio"> <onentry> <send target="self" event="fetchAudioDelay" delay="/datamodel/data[@name='queue']/prompts[1][@fetchaudiodelay]"/> </onentry> <transition event="fetchAudioDelay" next="StartFetchAudio"/> <transition event="cancelFetchAudio" next="FetchAudioFinal"/> </state> <state id="StartFetchAudio"> <datamodel> <data name="fetchAudio"/> </datamodel> <onentry> <assign loc="/datamodel/data[@name='fetchAudio']" expr="/datamodel/data[@name='queue']/prompts[1]"/> <delete loc="/datamodel/data[@name='queue']/prompts[1]"/> <send target="self" event="fetchAudioMin" delay="/datamodel/data[@name='fetchAudio'][@fetchaudiominimum]"/> <send target="player" event="Play" namelist="/datamodel/data[@name='fetchAudio']"/> <if cond="/datamodel/data[@name='bargeInType'].text() ne 'fetchAudio'"> <send event="BargeInChange" namelist="fetchAudio"/> </if> </onentry> <transition event="CancelFetchAudio" target="WaitFetchMinimum"/> <transition event="fetchAudioMin" target="WaitFetchCancel"/> </state> <state id="WaitFetchMinimum"> <transition event="fetchAudioMin" target="FetchAudioFinal"> <send target="player" event="halt"/> </transition> </state> <state id="WaitFetchCancel"> <transition event="CancelFetchAudio" target="FetchAudioFinal"> <send target="player" event="halt"/> </transition> </state> <state id="FetchAudioFinal" final="true" /> <!-- could put cleanup handling here --> </state> <!-- end FetchAudio --> </state> <!-- end Created --> </scxml>
The prompt queue resource can be controlled by means of the following events:
Event | Source | Payload | Sequencing | Description |
queuePrompt | any | prompt (M), properties(O) | adds prompt to queue, but does not cause it to be played | |
queueFetchAudio | any | prompt (M) | adds fetch audio to queue, removing any existing fetch audio from queue. Does not cause it to be played. | |
play | any | Causes any queued prompts or fetch audio to be played | ||
changeParameter | any | paramName, newValue | Sets the value of paramName to newValue, which may be either an absolute or relative value. The new setting takes effect immediately, even if there is already a prompt playing. | |
cancelFetchAudio | any | Deletes any queued fetch audio. Also cancels any fetch audio that is already playing, unless fetchAudioMin has been specified and not yet reached. | ||
cancel | any | Immediately cancels any prompt or fetch audio that is playing and clears the queue. |
The prompt queue resource returns the following events to its invoker:
Event | Target | Payload | Sequencing | Description |
prompt.Done | controller | markName(O), markTime(O) | Indicates prompt queue has played to completion and is now empty | |
bargeintypeChange | controller | one of: unbargeable, hotword, energy, fetchAudio | sent at start of prompt play and whenever a new prompt or fetch audio is played whose bargeinType differs from the preceding one. |
Issue ():
Do we need 'fetchAudio' as a distinct bargein type?
Resolution:
None recorded.
The prompt queue receives the following events from the underlying player:
Event | Payload | Sequencing | Description |
player.Done | Sent whenever a single prompt or piece of fetch audio finishes playing. |
and sends the following events to the underlying device:
Event | Payload | Sequencing | Description |
play | prompt (M) | sent to platform to cause a single prompt to be played. | |
setParameter | paramName (M), value(O) | sent to platform to change the value of a playback parameter such as speed or volume. The new value may be absolute or relative. The change takes effect immediately. |
Two types of recognition resources are defined: DTMF recognition for recognition of DTMF input; and ASR recognition for recognition of speech input. Both recognition resources are associated with a device which implements their respective recognition services. Each device represents one or more actual recognizer instances. In case of a device implemented with multiple recognizers - for example two different speech recognition engines - it is the responsibility of the interpreter implementation to ensure that they adhere to the semantic model defined in this section.
DTMF and ASR recognition resources are semantically similar. They share the same state and eventing model as well as recognition processing, timing and result handling. However, the resources differ in the following respects:
Otherwise, ASR and DTMF recognition resources share the same semantic model.
If a resource controller activates both DTMF and ASR recognition resources, then that resource controller is responsible for managing the resources so that only a single recognition result is produced per recognition cycle.
The recognition resource is defined in its created state grammars are added to the resource and subsequently prepared on the device, recognition with these grammars can be activated and suspended, and recognition results are returned.
When the recognition resource is ready to recognize (at least one active grammar), one or more recognition cycles may occur in sequence.
Thus a resource recognition resource may enter multiple recognition cycles (as required for 'hotword' recognition), while requiring that a device, even if it has multiple instantiations, only produces one set of recognition results per recognition cycle.
The recognition resource is defined in terms of a data model and state model.
The data model is composed of the following elements:
The state model is composed of states corresponding to functional state: idle, preparing grammars, ready to recognize, recognizing, suspended recognition and waiting for results.
In the idle state, the resource awaits events from resource controllers to activate grammars for recognition on the device. The data model - activeGrammars, properties, controller and mode - is (re-)initialized upon entry to this state: activeGrammars are cleared, properties and controllers are set to null. If the resource receives an 'addGrammar' event, a new item is added to activeGrammars using grammar, properties and listener data in the event payload. If the resource receives a 'prepare' event, it updates its data model with event data: 'properties' with the properties event data and 'controller' is updated with the controller event data. Subsequent event notifications and responses are sent to the resource controller identified as the 'controller'. The recognition resource then moves into the preparing grammars state.
In the preparing grammars state, the resource behavior depends on whether activeGrammars is empty or not. If activeGrammars is empty (i.e. no active grammars are defined for this recognition resource), the resource sends the controller a 'notPrepared' event and returns to the idle state. If activeGrammar is non-empty, the resource sends a 'prepare' event to the device. The event payload includes 'grammars' and 'properties' parameters. The 'grammars' value is an ordered list where each list item is a grammar's content and its properties extracted from activeGrammars. The order of grammars in the 'grammars' parameter must follow the order in the activeGrammar data model. If the device sends a 'prepared' event, the resource sends a 'prepared' event to the controller and transitions into the ready to recognize state.
When the recognition resource is in a ready to recognize state, it may receive a 'stop' event. In this case, the resource sends a 'stop' event to the device, and returns to the idle state. If the resource receives a 'listen' event, it sends a 'listen' event to the device and moves into the recognizing state.
When the resource is in a recognizing state, it can toggle between this state and a suspended recognizing state by receiving further 'listen' and 'suspend' events. If the resource receives a 'suspend' event, then it moves into the suspended recognizing state and sends the device a 'suspend' event which causes the device to suspend recognition and delete any buffered input. No input is buffered while the device is in a suspended state. If the resource then receives a 'listen' event, it moves back into the recognizing state.
When in the recognizing state, the resource may receive an 'inputStarted' event from the device, indicating that user input has been detected. The resource then moves into a waiting for results state. The device may send an 'error' event (for example, if maximum time has been exceeded) causing it to return to the idle state and send the controller an 'error' event. Alternatively, the device may send a 'recoResults' event, which contains a results parameter, a data structure representing recognition results in VoiceXML 2.0 or EMMA format. The structure may contain zero or more recognition results. Each result must specify the grammar associated with the recognition (using the same grammar name as used in the payload of the 'prepare' event), its recognition confidence and its input mode. The resource sends its controller a 'recoResults' event with event data containing the device's results parameter together with a listener parameter whose value is the listener associated with the grammar of the first result with the highest confidence (if there are no results, then the listener parameter is not defined). The resource then returns to the ready to recognize state, awaiting either a 'stop' event to terminate recognition or a 'listen' event to start another recognition cycle using the same active grammars and recognition properties.
A recognition resource is defined by the events it receives:
Event | Source | Payload | Sequencing | Description |
addGrammar | any | grammar (M), listener (M), properties (O) | creates a grammar item composed of the grammar, listener and properties, and adds it to the activeGrammars | |
prepare | any | controller (M), properties (M) | prepares the device for recognition using activeGrammars and properties | |
listen | any | initiates/resumes recognition | ||
suspend | any | suspends recognition | ||
stop | any | terminates recognition |
and the events it sends:
Event | Target | Payload | Sequencing | Description |
prepared | controller | one-of: prepared, notPrepared | positive response to prepare (activeGrammars prepared) | |
notPrepared | controller | one-of: prepared, notPrepared | negative response to prepare (no activeGrammars defined) | |
inputStarted | controller | notification that onset of input has been detected | ||
inputFinished | controller | notification that the end of input has been detected | ||
partialResult | controller | results (M), listener (O) | notification of a partial recognition result | |
recoResult | controller | results (M), listener (O) | notification of complete recognition result, including the results structure and a listener | |
error | controller | error status (M) | notification that an error has occurred |
The resource receives from the recognition device the following events:
Event | Payload | Sequencing | Description |
prepared | response to prepare indicating that activeGrammars have been successfully prepared | ||
inputStarted | notification that the onset of input has been detected | ||
inputFinished | notification that the end of input has been detected | ||
partialResults | results (M) | notification of a partial recognition results | |
recoResults | results (M) | notification of final recognition results | |
error | error status (M) | an error occurred |
and sends to the recognition device the following events:
Event | Payload | Sequencing | Description |
prepare | grammars (M), properties (M) | the recognition device is prepared with grammars and properties | |
listen | recognition is to be initiated | ||
suspend | recognition is to be suspended | ||
stop | recognition is to be stopped |
In VoiceXML 3.0, the language is partitioned into independent modules which can be combined in various ways. In addition to the modules defined in this section, it is also possible for third parties to define their own modules (see Section XXX).
Each module is assigned a schema, which defines its syntax, plus one or more Resource Controllers (RCs), which define its semantics, plus a "constructor" that knows how to create them from the syntactic representation at initialization time. Only DOM nodes that have schemas and constructors (and hence RCs) assigned to them can be modules in VoiceXML 3.0. However, we may choose to define constructors and RCs for nodes that are not modules. Nodes that do not have constructors and RCs ultimately depend on some module for their interpretation. (Those modules are usually ancestor nodes, but we do not require this.) There can be multiple modules associated with the same VoiceXML element. They may set properties differently, add different child elements, etc. In many cases, some of the modules will be extensions of the others, but we don't require this.
Note there is not necessarily a one-to-one relationship between semantic RCs and syntactic markup elements. It may take several RCs to implement the functionality of a single markup element.
This module describes the syntactic and semantic features of a <grammar> element which defines grammars used in ASR and DTMF recognition. Grammars defined via this module are used by other modules.
The attributes and content model of <grammar> are specified in 6.1.1 Syntax. Its semantics are specified in 6.1.2 Semantics.
[See XXX for schema definitions].
The <grammar> element has the attributes specified in Table 10.
Name | Type | Description | Required | Default Value |
---|---|---|---|---|
mode | The only allowed values are "voice" and "dtmf" | Defines the mode of the grammar following the modes of the W3C Speech Recognition Grammar Specification [SRGS]. | No | The value of the document property "grammarmode" |
weight | Weights are simple positive floating point values without exponentials. Legal formats are "n", "n.", ".n" and "n.n" where "n" is a sequence of one or many digits. | Specifies the weight of the grammar. See vxml2: Section 3.1.1.3 | No | 1.0 |
fetchhint | One of the values "safe" or "prefetch" | Defines when the interpreter context should retrieve content from the server. prefetch indicates a file may be downloaded when the page is loaded, whereas safe indicates a file that should only be downloaded when actually needed. | No | None |
fetchtimeout | Time Designation | The interval to wait for the content to be returned before throwing an error.badfetch event. | No | None |
maxage | An unsigned integer | Indicates that the document is willing to use content whose age is no greater than the specified time in seconds (cf. 'max-age' in HTTP 1.1 [RFC2616]). The document is not willing to use stale content, unless maxstale is also provided. | No | None |
maxstale | An unsigned integer | Indicates that the document is willing to use content that has exceeded its expiration time (cf. 'max-stale' in HTTP 1.1 [RFC2616]). If maxstale is assigned a value, then the document is willing to accept content that has exceeded its expiration time by no more than the specified number of seconds. | No | None |
Editorial note | |
The default value of the "grammarmode" document property (see XXXX) is "voice". |
The content model of <grammar> consists of exactly one of:
The grammar RC is the primary RC for the <grammar> element.
The grammar RC is defined in terms of a data model and state model.
The data model is composed of the following parameters:
The grammar RC's state model consists of the following states: Idle, Initializing, Ready, and Executing.
While in the Idle state, the RC may receive an 'initialize' event, whose 'controller' event data is used to update the data model. The RC then transitions into the Initializing state.
In the Initializing state, the grammar RC first initializes its child.
In the Ready state, when the grammar RC receives an 'execute' event it transitions to the Executing state.
In the Executing state,
If the child RC is an External Grammar, the grammar RC sends an 'execute' event to the child RC and waits for it to complete.
Then, the grammar RC sends an AddGrammar event to the DTMF Recognizer Resource if mode="dtmf" or to the ASR Recognizer Resource if mode="voice", with the following as event data: the child RC, the fetchhint, language, charset, and encoding parameter values, and the controller RC (e.g., link, field, or form) as the handler for recognition results.
Finally, the grammar RC sends the controller an executed event and transitions to the Ready state.
Editorial note | |
Initializing: Validate that behavior of sending a pointer to the child RC to the ASR resource. Is this acceptable, or do we need to extract the grammar data from the child RC and then send that data? The advantage of sending the RC pointer is that it makes clear what kind of grammar info it is -- inline SRGS or external reference. Execute issues:
Editor will write new section 4.5 "Other" and subsections 4.5.1 "property/attribute resolution" and 4.5.2 "language resolution". Depending on the text, we may need to update the semantics to refer to section 4.5.2 when describing how xml:lang is used. |
The Grammar RC is defined to receive the following events:
Event | Source | Payload | Description |
initialize | any | controller(M) | causes the element and its children to be initialized |
execute | controller | Adds the grammar to the appropriate Recognition Resource |
and the events it sends:
Event | Target | Payload | Description |
initialized | controller | response to initialize event indicating that it has been successfully initialized | |
executed | controller | response to execute event indicating that it has been successfully executed |
The external events sent and received by the Grammar RC are those defined in this table:
Event | Source | Target | Description |
addGrammar | GrammarRC | DTMF Recognition Resource or ASR Recognition Resource | Adds grammar to list of currently active grammars |
Prefetch | GrammarRC | DTMF Recognition Resource or ASR Recognition Resource | Requests that the grammar be fetched/compiled in advance, if possible |
The events in this table may be raised during initialization and execution of the <grammar> element.
Event | Description | State |
error.semantic | indicates an error with data model expressions: undefined reference, invalid expression resolution, etc. | execution |
---|
Note that additional errors may occur when the grammar is fetched or added by the ASR or DTMF resource. Please check there for details.
This module describes the syntactic and semantic features of inline SRGS grammars used in ASR and DTMF recognition.
Editorial note | |
Issue: Do we need to support inline ABNF SRGS?: |
The attributes and content model of Inline SRGS gramamrs are specified in 6.2.1 Syntax. Its semantics are specified in 6.2.2 Semantics.
[See XXX for schema definitions].
The syntax of the Inline SRGS Grammar Module is precisely all of the XML markup for a legal stand-alone XML form grammar as described in SRGS ([SRGS]), minus the XML Prolog. Note that both elements and attributes must be in the SRGS namespace (http://www.w3.org/2001/06/grammar).
The Inline SRGS grammar RC is defined in terms of a data model and state model.
The data model is composed of the following parameters:
Editorial note | |
Should the contents of the grammar parameter be parsed rather than the raw document text? For example, should it be the DOM representation of the grammar, or just the XML Info set, or what? |
The grammar RC's state model consists of the following states: Idle, Initializing, and Ready. Unlike most of the other modules, this module is primarily a data model for storing a grammar. The module itself has no execution semantics.
While in the Idle state, the RC may receive an 'initialize' event, whose 'controller' event data is used to update the data model. The RC then transitions into the Initializing state.
In the Initializing state, the syntactic contents of the grammar are saved into the grammar parameter. The RC sends the controller an 'initialized' event and transitions to the Ready state.
The Inline SRGS Grammar RC is defined to receive the following events:
Event | Source | Payload | Description |
initialize | any | controller(M) | causes the element and its children to be initialized |
and the events it sends:
Event | Target | Payload | Description |
initialized | controller | response to initialize event indicating that it has been successfully initialized |
This module describes the syntactic and semantic features of an <externalgrammar> element which defines external grammars used in ASR and DTMF recognition.
Editorial note | |
The name of this element is still under discussion. |
The attributes and content model of <externalgrammar> are specified in 6.3.1 Syntax. Its semantics are specified in 6.3.2 Semantics.
[See XXX for schema definitions].
The <externalgrammar> element has the attributes specified in Table 17.
Name | Type | Description | Required | Default Value | |||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
src | anyURI | The URI specifying the location of the grammar and optionally a rulename within that grammar, if it is external. The URI is interpreted as a rule reference as defined in Section 2.2 of the Speech Recognition Grammar Specification [SRGS] but not all forms of rule reference are permitted from within VoiceXML. The rule reference capabilities are described in detail below this table. | No | ||||||||||||||||||||||
srcexpr | A data model expression | Equivalent to src, except that the URI is dynamically determined by evaluating the content as a data model expression. | No | ||||||||||||||||||||||
type | A data model expression |
The preferred media type of the grammar. A resource indicated by the URI reference in the src attribute may be available in one or more media types. The author may specify the preferred media-type via the type attribute. When the content represented by a URI is available in many data formats, a VoiceXML platform may use the preferred media-type to influence which of the multiple formats is used. For instance, on a server implementing HTTP content negotiation, the processor may use the preferred media-type to order the preferences in the negotiation. The resource representation delivered by dereferencing the URI reference may be considered in terms of two types. The declared media-type is the asserted value for the resource and the actual media-type is the true format of its content. The actual media-type should be the same as the declared media-type, but this is not always the case (e.g. a misconfigured HTTP server might return 'text/plain for an 'application/srgs+xml' document). A specific URI scheme may require that the resource owner always, sometimes, or never return a media-type. The declared media-type is the value returned by the resource owner or, if none is returned, the preferred media type. There may be no declared media-type if the resource owner does not return a value and no preferred type is specified. Whenever specified, the declared media-type is authoritative. Three special cases may arise. The declared media-type may not be supported by the processor; in this case, an error.unsupported.format is thrown by the platform. The declared media-type may be supported but the actual media-type may not match; an error.badfetch is thrown by the platform. Finally, there may be no declared media-type; the behavior depends on the specific URI scheme and the capabilities of the grammar processor. For instance, HTTP 1.1 allows document intraspection (see [RFC2616], section 7.2.1), the data scheme falls back to a default media type, and local file access defines no guidelines. The following table provides some informative examples:
|
No | None |
Editorial note | |
Error messages for "type" attribute need to be updated. |
See 6.3.1.2 Content Model for restrictions on occurrence of src and srcexpr attributes.
The value of the src attribute is a URI specifying the location of the grammar with an optional fragment for the rulename. Section 2.2 of the Speech Recognition Grammar Specification [SRGS] defines several forms of rule reference. The following are the forms that are permitted on a grammar element in VoiceXML:
The following are the forms of rule reference defined by [SRGS] that are not supported in VoiceXML 3.
The <externalgrammar> element has the following co-occurrence constraints:
Editorial note | |
Editor: please remove the "otherwise, an error.badfetch ..." from the above and all other co-occurrence text and write general text somewhere describing what happens when a co-occurrence constraint is violated. |
The External Grammar RC is defined in terms of a data model and state model.
The data model is composed of the following parameters:
The External Grammar RC's state model consists of the following states: Idle, Initializing, Ready, and Executing.
While in the Idle state, the RC may receive an 'initialize' event, whose 'controller' event data is used to update the data model. The RC then transitions into the Initializing state.
In the Initializing state, the RC sends the controller an 'initialized' event and transitions to the Ready state.
In the Ready state, when the External Grammar RC receives an 'execute' event it transitions to the Executing state.
In the Executing state, if the srcexpr variable is set it is evaluated against the data model as a data model expression, and the value is placed into the src variable; if srcexpr cannot be evaluated, an error.semantic event is thrown. Otherwise, the RC sends an 'executed' event to the controller RC and transitions into the Ready state.
The External Grammar RC is defined to receive the following events:
Event | Source | Payload | Description |
initialize | any | controller(M) | causes the element and its children to be initialized |
execute | controller | Evaluates srcexpr and populates src variable |
and the events it sends:
Event | Target | Payload | Description |
initialized | controller | response to initialize event indicating that it has been successfully initialized | |
executed | controller | response to execute event indicating that it has been successfully executed |
The events that may be raised during initialization and execution of the <externalgrammar> element are those defined in Table 21 below.
Event | Description | State |
error.semantic | indicates that there was an error in the evaluation of the srcexpr attribute. |
---|
This module defines the syntactic and semantic features of a <prompt> element which controls media output. The content model of this element is empty: content is defined in other modules which extend this element's content model (for example 6.5 Builtin SSML Module, 6.6 Media Module and 6.7 Parseq Module).
The attributes and content model of <prompt> are specified in 6.4.1 Syntax. Its semantics are specified in 6.4.2 Semantics, including how the final prompt content is determined and how the prompt is queued for playback using the PromptQueue Resource (5.2 Prompt Queue Resource).
[See XXX for schema definitions].
The <prompt> element has the attributes specified in Table 22.
Name | Type | Description | Required | Default Value |
---|---|---|---|---|
bargein | boolean | Controls whether the prompt can be interrupted. | No | bargein property |
bargeintype | string | On prompts that can be interrupted, determines the type of bargein, either 'speech', or 'hotword'. | No | bargeintype property |
cond | data model expression | A data model expression that must evaluate to true after conversion to boolean in order for the prompt to be played. | No | true |
count | positive integer | A number indicating the repetition count, allowing a prompt to be activated or not depending on the current repetition count. | No | 1 |
timeout | Time Designation | The time to wait for user input. | No | timeout property |
xml:lang | string | The language identifier for the prompt. | No | document's "xml:lang" attribute |
xml:base | string | Declares the base URI from which relative URIs in the prompt are resolved. | No | document's "xml:base" attribute |
The prompt RC is the primary RC for the <prompt> element.
The prompt RC is defined in terms of a data model and state model.
The data model is composed of the following parameters:
The prompt RC's state model consists of the following states: Idle, Initializing, Ready, and Executing. The initial state is the Idle state.
While in the Idle state, the prompt RC may receive an 'initialize' event, whose controller event data is used to update the data model. The prompt RC then transitions into Initiating state.
In the Initializing state, the prompt RC initializes its children: this is modeled as a separate RC (see XXX). The children may return an error for initialization. If a child sends an error, then the prompt RC returns an error. When all children are initialized, the prompt RC sends the controller an 'initialized' event and transitions to the Ready state.
In the Ready state, the prompt RC can receive a 'checkStatus' event to check whether this prompt is eligible for execution or not. The value of the cond parameter in its data model is checked against the data model resource: the status is true if the value of the cond parameter evaluates to true. The status, together with its count data, is sent in a 'checkedStatus' event to the controller RC. The controller RC then determines if the prompt is selected for execution ([vxml20: 4.1.6], see PromptSelectionRC, Section XXX). If the prompt RC receives an 'execute' event it transitions to the Executing state.
In the Executing state, the prompt RC sends an evaluate event to its children. Each child returns either an error, or content (which may include parameters) for playback. If a child sends an error, then the prompt RC returns an error. Once evaluation is complete, the RC sends a queuePrompt event to the Prompt Queue Resource with the <prompt> parameters (bargein, bargeintype, timeout) with event data consisting of the list of content returned by its children. The prompt RC then sends the controller an executed event and transitions to the Ready state.
Editorial note | |
SSML validation issue: what if evaluation results in a non-valid structure? |
The Prompt RC is defined to receive the following events:
Event | Source | Payload | Description | |
initialize | any | controller(M) | causes the element and its children to be initialized | |
checkStatus | controller | causes evaluation of the cond parameter against the data model | ||
execute | controller | causes the evaluation of its content and conversion to a format suitable for queueing on the PromptQueue Resource |
and the events it sends:
Event | Target | Payload | Description |
initialized | controller | response to initialize event indicating that it has been successfully initialized | |
checkedStatus | controller | status (M), count (M) | response to checkStatus event with count parameter and status indicating evaluation of cond parameter |
executed | controller | response to execute event indicating that it has been successfully executed |
Table 25 shows the events sent and received by the prompts RC to resources and other RCs which define the events.
Event | Source | Target | Description |
evaluate | PromptRC | DataModel | used to evaluate the cond parameter (see XXX) |
queuePrompt | PromptRC | PromptQueue | adds prompt content and properties to the Prompt Queue (see XXX) |
The events in Table 26 may be raised during initialization and execution of the <prompt> element.
Event | Description | State |
error.unsupported.language | indicates that an unsupported language was encountered. The unsupported language is indicated in the event message variable. | execution |
---|---|---|
error.unsupported.element | indicates that the element within the <prompt> element is not supported | initialization |
error.badfetch | indicates that the prompt content is malformed ... | initialization, execution |
error.noresource | indicates that a Prompt Queue resource is not available for rendering the prompt content. | execution |
error.semantic | indicates an error with data model expressions: undefined reference, invalid expression resolution, etc. | execution |
Editorial note | |
The relationship between the user visible events defined in the above table, and semantic event model has yet to be defined. Can we really determine whether errors are raised in initialization (syntax) or execution (evaluation) states? How does this fit in with errors returned when prompts are played in PromptQueue player implementation? ACTION: Clarify which specific cases are affected by 'error.badfetch' ambiguity re. initialization versus execution states. Clarify that error.semantic doesn't apply to evaluation of src/expr with <audio> (e.g. fallback). Clarify that errors are recorded? (vxml21??) Should media control properties (e.g. clipBegin, speed, etc) of <media> be also available on <prompt>? We should clarify where the error.badfetch gets thrown. For instance, if we are loading a document with malformed prompt elements, the error.badfetch may get thrown back to the calling document. If we are throwing error.badfetch during execution, then it will be thrown back to the malformed document itself? |
This module describes the syntactic and semantic features of SSML elements built into VoiceXML.
This module is designed to extend the content model of the <prompt> element defined in 6.4 Prompt Module.
The attributes and content model of SSML elements are specified in 6.5.1 Syntax. Its semantics are specified in 6.5.2 Semantics, including how elements are evaluated to yield final content for playback.
[See XXX for schema definitions].
This module defines an SSML ([SSML]) Conforming Speech Synthesis Markup Language Fragment where:
Name | Type | Description | Required | Default Value |
---|---|---|---|---|
fetchtimeout | See fetchtimeout definition | No | fetchtimeout property | |
fetchhint | See fetchhint definition | No | audiofetchhint property | |
maxage | See maxage definition | No | audiomaxage property | |
maxstale | See maxstale definition | No | audiomaxstale property | |
expr | A data model expression which determines the source of the audio to be played. The expression may be either a reference to audio previously recorded (see Record Module) or evaluate to the URI of an audio resource to fetch. | No | undefined |
Exactly one of "src" or "expr" attributes must be specified; otherwise, an error.badfetch event is thrown.
Editorial note | |
SSML 1.1 required for fetching attributes like fetchtimeout? Or profile dependent? Support for 'say-as' extension to SSML 1.0? Support for <enumerate>? Note that profiles specify which media formats are required |
When the RC receives an evaluate event, its children are evaluated in order to return an SSML Conforming Stand-Alone Speech Synthesis Markup Language Document which can be processed by a Conforming Speech Synthesis Markup Language Processor.
Evaluation comprises of:
Editorial note | |
We may want to refine the description that the output of evaluation is an SSML Document. One rationale is that we don't want to prohibit that SSML extensions are lost during evaluation. The output may be another Fragment rather than a Document. Clarify exact nature of <audio> expr value for skipping - undefined vs. null? Need to specify further error cases Do these elements have RCs? They are in the VoiceXML namespace but are just enhanced SSML elements. Need to clarify unsupported languages and external (e.g. MRCP) SSML processors. |
In this example
<prompt> <foreach item="item" array="array"> <audio expr="item.audio"><value expr="item.tts"/></audio> <break time="300ms"/> </foreach> </prompt>
evaluation returns a sequence of content for each item in <foreach> with <audio> and <value> elements.
Assume that the array consists of 2 items where each item.audio evaluates to 'one.wav' and 'two.wav' respectively, and each item.tts evaluates to 'one' and 'two' respectively. Evaluation of <foreach> is equivalent to the following
<prompt> <audio expr="'one.wav'"><value expr="'one'"/></audio> <break time="300ms"/> <audio expr="'two.wav'"><value expr="'two'"/></audio> <break time="300ms"/> </prompt>
further evaluation of the <audio> and <value> elements result in
<prompt> <audio src="one.wav">one</audio> <break time="300ms"/> <audio src="two.wav">two</audio> <break time="300ms"/> </prompt>
and finally the prompt content is converted into a stand-alone SSML document (assuming the <prompt>'s xml:lang attribute evaluates to 'en'):
<speak version="1.0" xml:lang="en" xmlns="http://www.w3.org/2001/10/synthesis"> <audio src="one.wav">one</audio> <break time="300ms"/> <audio src="two.wav">two</audio> <break time="300ms"/> </speak>
This content is queued and played using the PromptQueue: each audio URI, or fallback content, is played, followed by a 300 millisecond break.
The media module defines the syntax and semantics of <media> element.
The module is designed to extend the content model of <prompt> in the prompt module (6.4 Prompt Module).
The <media> element can be seen as an enhanced and generalized version of the VoiceXML <audio> element. It is enhanced in that it provides additional attributes describing the type of media, conditional selection, as well as control over playback . It is a generalization of the <audio> element in that it permits media other than audio to be played; for example, media formats which contains audio and video tracks.
[See XXX for schema definitions].
The <media> element has the attributes specified in Table 28.
Name | Type | Description | Required | Default Value |
---|---|---|---|---|
src | The URI specifying the location of the media source. | No | None | |
srcexpr | A data model expression which evaluates to a URI indicating the location of the media resource. | No | undefined | |
cond | A data model expression that must evaluate to true after conversion to boolean in order for the media to be played. | No | true | |
type |
The preferred media type of the output resource. A resource
indicated by the URI reference in the The resource representation delivered by dereferencing the URI reference may be considered in terms of two types. The declared media-type is the asserted value for the resource and the actual media-type is the true format of its content. The actual media-type should be the same as the declared media-type, but this is not always the case (e.g. a misconfigured HTTP server might return 'text/plain' for a 'audio/x-wav' or video/3gpp' resource). A specific URI scheme may require that the resource owner always, sometimes, or never return a media-type. The declared media-type is the value returned by the resource owner or, if none is returned, the preferred media type. There may be no declared media-type if the resource owner does not return a value and no preferred type is specified. Whenever specified, the declared media-type is authoritative. Three special cases may arise.
|
No | undefined | |
clipBegin | Time Designation | offset from start of media to begin rendering. This offset is measured in normal media playback time from the beginning of the media. | No | 0s |
clipEnd | Time Designation | offset from start of media to end rendering. This offset is measured in normal media playback time from the beginning of the media. | No | None |
repeatDur | Time Designation | total duration for repeatedly rendering media. This duration is measured in normal media playback time from the beginning of the media. | No | None |
repeatCount | positive Real number | number of iterations of media to render. A fractional value describes a portion of the rendered media. | No | 1 |
soundLevel | signed ("+" or "-") CSS2 numbers immediately followed by "dB" | Decibel values are interpreted as a ratio of the squares of the new signal amplitude (a1) and the current amplitude (a0) and are defined in terms of dB: soundLevel(dB) = 20 log10 (a1 / a0) A setting of a large negative value effectively plays the media silently. A value of '-6.0dB' will play the media at approximately half the amplitude of its current signal amplitude. Similarly, a value of '+6.0dB' will play the media at approximately twice the amplitude of its current signal amplitude (subject to hardware limitations). The absolute sound level of media perceived is further subject to system volume settings, which cannot be controlled with this attribute. | No | +0.0dB |
speed | x% (where x is a positive real value) | the speed at which to play the referenced media, relative to the original speed. The speed is set to the requested percentage of the speed of the original media. For audio, a change in the speed will change the rate at which recorded samples are played back and this will affect the pitch. | No | 100% |
outputmodes | space separated list of media types | Determines the modes used for media output. See 8.2.4 Media Properties for further details. | No | outputmodes property |
See occurrence constraints for restrictions on occurrence of src and srcexpr attributes.
Calculations of rendered durations and interaction with other timing properties follow SMIL 2.1 Computing the active duration where
Note that not all SMIL 2.1 Timing features are supported.
Editorial note | |
Use SMIL 3.0 or SMIL 2.1 reference? should trimming and media attributes also be defined in <prompt>? do we need expr values for type, clipBegin, clipEnd, repeatDur, repeatCount, etc? (Perhaps add implied expr for every attribute?) when is a property evaluation error thrown? Add fetchtimeout, fetchhint, maxage and maxstale attributes Major attribute candidate: errormode (flexible error handling which controls whether errors are thrown or fallback is used). Other candidate attributes: id/idref (use case?) |
The <media> element content model consists of:
The <media> has the following co-occurrence constraints:
Note that the type attribute does not affect inline content. The handling of inline XML content is in accordance to the namespace of the root element (such as SSML <speak>, SMIL <smil>, and so forth). CDATA, or mixed content with VoiceXML <foreach> or <value> elements must be treated as an SSML Fragment and evaluated as described in 6.6.2 Semantics.
Editorial note | |
Permit other types of inline content apart from SSML? Are child <property> elements necessary? Alternative: extended <prompt> so that <property> children are allowed? |
Developers should be aware that there may be performance implications when using <media> depending on which attributes are specified, the media itself, its transport and processing.
Since operations like trimming, soundLevel and speed modifications are applied to media, this requires that the SSML processor begins generating output audio before these operations are applied. If the clipBegin attribute is specified, this may required SSML generation of audio prior to clipBegin, depending on the implementation. This may lead to a gap between execution of the <media> element and start of playback.
If the media is fetched with HTTP protocol and the clipBegin attribute is specified, then, unless the the resource is cached locally, the part of the media resource before the clipBegin, will still be fetched from the origin server. This may result in a gap between the execution of the <media> element and playback actually beginning.
Note also if <media> uses the RTSP protocol, and the
VoiceXML platform supports this protocol, then the clipBegin
attribute value may be mapped to the RTSP Range
header
field, thereby reducing the gap between element execution and the
onset of playback.
When an media RC receives an evaluate event, the following operations are performed:
Editorial note | |
Semantics needs to address a mixed content model; e.g. CDATA and XML elements as children of the root. Do we require 'application/ssml+xml' type with SSML and CDATA content? Need to clarify where resource fetching takes place in the semantic model. Eg. in prompt initializing or executing state? or in prompt queue? This approach assumes the prompt queue applies media processing operations. Intended to fit with the VCR/RTC approach. What about streaming cases? Allow streams to be returned? Specify how errors are addressed. |
Playback of external audio media resource.
<media type="audio/x-wav" src="http://www.example.com/resource.wav"/>
Application of media operations to audio resource. The soundLevel increases the volume by approximately 50% and the speed is reduced to 50%.
<media type="audio/x-wav" soundLevel="+6.0dB" speed="50%" src="http://www.example.com/resource.wav"/>
Playback of 3GPP media resource.
<media type="video/3gpp" src="http://www.example.com/resource.3gp"/>
Playback of 3GPP media resource with the speed doubled and playback ending after 5 seconds.
<media type="video/3gpp" clipEnd="5s" speed="200%" src="http://www.example.com/resource.3gp"/>
Playback of external SSML document.
<media type="application/ssml+xml" src="http://www.example.com/resource.ssml"/>
Inline CDATA content with a <value> element
<media> Ich bin ein Berliner, said <value expr="speaker"/> </media>
which is syntactically equivalent to
<media> <speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis"> Ich bin ein Berliner, said <value expr="speaker"/> </speak> </media>
Inline SSML content to which gain and clipping operations are applied.
<media soundLevel="+4.0dB" clipBegin="4s"> <speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis"> Ich bin ein Berliner. </speak> </media>
Inline SSML with audio media fallback.
<media volume="+4.0dB" clipBegin="4s"> <speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis"> Ich bin ein Berliner. </speak> <media type="audio/x-wav" src="ichbineinberliner.wav"> </media>
This module defines the syntax and semantics of <par> and <seq> elements. The <par> element specifies playback of media in parallel, while <seq> specifies playback in sequence.
The module is designed to extend the content model of the <prompt> element (6.4 Prompt Module).
This module is dependent upon the media module (6.6 Media Module).
With connections which support multiple media streams, it is possible to simultaneously playback multiple media types. For media container formats like 3GPP, audio and video media can be generated simultaneously from the same media resource.
There are established use cases for simultaneous playback of multiple media which are specified in separate resources:
The intention is provide support for basic use cases where audio or TTS output from one resource can be complemented with output from another resource as permitted by the connection and platform capabilities.
The <par> element is derived from SMIL <par> element, a time container for parallel output of media resources. Media elements (or containers) within a <par> element are played back in parallel.
Editorial note | |
SMIL reference should be added in references section SMIL references SMIL is Synchronized Multimedia Integration Language (SMIL). Reference to SMIL 1.0 (or later) Specification |
The <par> element has the attributes specified in Table 29.
Name | Type | Description | Required | Default Value |
---|---|---|---|---|
endsync | Indicates when element is considered complete. 'first' indicates that the element is complete when any media (or container) child reports that it is complete; 'last' indicates it is complete when all media children are complete. | No | last |
The content model of <par> consists of:
The <par> element is derived from SMIL <seq> element, a time container for sequential output of media resources. Media elements within a <seq> element are played back in parallel.
No attributes are defined for <seq>.
The content model of <seq> consists of:
Editorial note | |
Issue: how should parallel playback interact with the PromptQueue resource? The simplest assumption would be that if this module is supported, then prompt queue needs to be able to handle parallel playback. For example when bargein event happens during the parallel execution, the synchronization between both prompt and for example video play should be handled. This information should be explained in the prompt queue resource section. |
This module requires a PromptQueue resource which support playback of parallel and sequential media. The following defines its playback completion, termination and error handling.
Completion of playback of the <par> element is determined according to the value of its endsync attribute. For instance, assume a <par> element containing <media> (or <seq>) elements A and B, and that B finishes before A. If endsync has the value first, then completion is reported upon B's completion. If endsync has the value last, then completion is reported upon A's completion.
Completion of playback of the <seq> element occurs when the last <media> is complete.
If the <par> element playback is terminated, then playback of its <media> and <seq> children is terminated. Likewise, if the <seq> element playback is terminated, then playback of its (active) <media> elements is terminated.
If mark information is provided by <media> elements (for example with SSML), then, the mark information associated with last element played in sequence or parallel is exposed as described in XXX.
Editorial note | |
Open issue: Clarify interaction with VCR media control model(s). <reposition> approach would require that <par> and <seq> need to be able to restart from a specific position indicated by the markname/time of a <media> element contained within them. RTC approach would require that for <par>, media operations are applied in parallel. |
Error handling policy is inherited from the element in which <par> and <seq> element are children.
For instance if the policy is to ignore errors, then the following applies:
If the policy is to terminate playback and report the error, then the any error causes immediate termination of any playback and the error is reported.
If execution of the <par> and <seq> elements requires media capabilities which are not supported by the platform or the connection, or there is an error fetching or playing any <media> element within <par> or <seq>, then error handling follows the defined policy.
video avatar with audio commentary. Note the use of the outputmodes attributes of <media> to ensure that only video is played.
<par> <media type="audio/x-wav" src="commentary.wav"/> <media type="video/3gpp" src="avatar.3gp" outputmodes="video"/> </par>
video avatar with a sequence of audio and TTS commentary.
<par> <seq> <media type="audio/x-wav" src="intro.wav"/> <media type="application/ssml+xml" src="commentary.ssml"/> </seq> <media type="video/3gpp" src="avatar.3gp" outputmodes="video"/> </par>
This module describes the syntactic and semantic features of the <foreach> element.
This module is designed to extend the content model of an element in another module. For example, SSML elements in the 6.5 Builtin SSML Module, the <prompt> element defined in 6.4 Prompt Module, etc.
The attributes and content model of the element are specified in 6.8.1 Syntax. Its semantics are specified in 6.8.2 Semantics.
[See XXX for schema definitions].
The <foreach> element has the attributes specified in Table 30.
Name | Type | Description | Required | Default Value |
---|---|---|---|---|
array | A data model expression that must evaluate to an array; otherwise, an error.semantic event is thrown. Note that the <foreach> element operates on a shallow copy of the array specified by the array attribute. | Yes | ||
item | A data model variable that stores each array item upon each iteration of the loop. A new variable will be declared if it is not already defined within the parent's scope. | Yes |
Both "array" and "item" must be specified; otherwise, an error.badfetch event is thrown.
The iteration process starts from an index of 0 and increments by one to an index of array_name.length - 1, where array_name is the name of the shallow copied array operated on by the <foreach> element. For each index, a shallow copy or reference to the corresponding array element is assigned to the item variable (i.e. <foreach> assignment is equivalent to item = array_name[index] in ECMAScript); the assigned value could be undefined for a sparse array. Undefined array items are ignored.
VoiceXML 3.0 does not provide break functionality to interrupt a <foreach>.
Editorial note | |
Clarify that array items which evaluate to ECMAScript undefined are ignored? |
When the RC receives an evaluate event, the RC loops through the array to produce an evaluated content for each item in the array.
Editorial note | |
These examples may be moved to the respective profile section later. |
The vxml21 profile defines the content model for the <foreach> element so that it may appear in executable content and within <prompt> elements.
Within executable content, except within a <prompt>, the <foreach> element may contain any elements of executable content; this introduces basic looping functionality by which executable content may be repeated for each element of an array.
When <foreach> appears within a <prompt> element as part Builtin SSML content, it may contain only those elements valid within <enumerate> (i.e. the same elements allowed within <prompt> less <meta>, <metadata>, and <lexicon>); this allows for sophisticated concatenation of prompts.
In this example using Builtin SSML, each item in the array has an audio property with a URI value, and a tts property with SSML content. The element loops through the array, playing the audio URI or the SSML content as fallback, with a 300 millisecond break between each iteration.
<prompt> <foreach item="item" array="array"> <audio expr="item.audio"><value expr="item.tts"/></audio> <break time="300ms"/> </foreach> </prompt>
In the mediaserver profile, <foreach> may occurs within <prompt> elements and has the content model of 0 or more <media> elements.
Play each media resource in the array.
<foreach item="item" array="array"> <media type="audio/x-wav" src="item.audio"/> </foreach>
Play each media resource in the array.
<foreach item="item" array="array"> <media type="audio/x-wav" src="item.wav"> <media type="application/ssml+xml"> <speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis"> <value expr="item.tts"/> <break time=300ms"/> </speak> </media> </media> </foreach>
Forms are the key component of VoiceXML documents. A form contains:
id | The name of the form. If specified, the form can be referenced within the document or from another document. For instance <form id="weather">, <goto next="#weather">. |
---|---|
scope | The default scope of the form's grammars. If it is dialog then the form grammars are active only in the form. If the scope is document, then the form grammars are active during any dialog in the same document. If the scope is document and the document is an application root document, then the form grammars are active during any dialog in any document of this application. Note that the scope of individual form grammars takes precedence over the default scope; for example, in non-root documents a form with the default scope "dialog", and a form grammar with the scope "document", then that grammar is active |
The Form RC is the primary RC for the <form> element.
The Form RC interacts with resource controllers of other modules so as to provide the behavior of VoiceXML 2.1/2.0 <form> tag. Input and control form items are modeled as resource controllers: for the example, the <field> RC (6.10.2.1 Field RC) of the Field Module.
The behavior of the Form RC follows the VoiceXML FIA, although some aspects of this are not modeled directly in this RC: external transition handling is not part of the form RC; input items used separate RCs to manage coordination between media resources, while recognition results can be received directly by form, field or other RCs.
[This initial version does not address all aspects of FIA behavior; for example, event handling, error handling and external transitions are not covered.]
The form RC is defined in terms of a data model and state model.
The data model is composed of the following parameters:
The form RC's state model consists of the following states: Idle, Initializing, Ready, SelectingItem, PreparingItem, PreparingFormGrammars, PreparingOtherGrammars, Executing, Active, ProcessingFormResult, Evaluating and Exit.
In the Idle state, the form RC can receive an 'initialize' event whose 'controller' event data is used to update the data model. The RC then transitions into Initiating state.
In the Initializing state, the RC creates a dialog scope in the Datamodel Resource and then initializes its children: this is modeled as a separate RC. When all children are initialized, the RC sends an 'initialized' event to its controller and transitions to the Ready state.
In the Ready state, the form RC sets its active status to false. It can receive one of two events: 'prepareGrammars' or ‘execute’. ‘prepareGrammars’ indicates that another form is active, but this form's form-level grammars may be activated; an 'execute' event indicates that this form is active. If the RC receives a 'prepareGrammars' event, it transitions to the PreparingFormGrammars state. If the RC receives an 'execute' event, it sets its active data to true and transitions to the 'SelectingItem' state.
In the SelectingItem state, the RC determines which form item to select as the active item. This is defined by a FormItemSelection RC which iterates over the children sending each a 'checkStatus' event. If a child returns a true status (indicating that it ready for execution)), the activeItem is set to this child RC and the RC transitions to the PreparingItem state. If no child returns this status, then the RC is complete and transitions the Exit State.
In the PreparingItem state, the activeItem is sent a 'prepare' event causing it to prepare itself; for example, the field RC prepares its prompts and grammars for execution. When the activeItem returns a 'prepared' event, the event data indicates whether the item is modal or not. If the item is modal, then the form RC transitions to the Executing state. If the item is not modal (other grammars can be activated), then the form RC transitions to the PreparingFormGrammars state.
In the PreparingFormGrammars state, the RC prepares form-level grammars. This is defined by a separate RC which iterates through and executes grammar children. When this is complete, the RC transitions to the Active state if the form is not active (active data), and transitions to the PreparingOtherGrammars if the form is active.
In the PreparingOtherGrammars states, the RC sends a 'prepareGrammars' event to its controller RC (which in turn sends the event to appropriate form, document and application level RCs with grammars). When its receives a 'prepared' from its controller, the RC transitions to the Executing state.
In the Executing state, the form RC sends an 'execute' event to the active form item. If the form item is a field, then this will causes prompts to be played and recognition to take place. The RC then transitions to the Active state awaiting a result.
In the Active state, the RC re-initializes the justFilled data to a new array and waits for a recognition results (as active or non-active form), or for a signal from its selected form item that it has received the recognition result. Recognition results are divided into two types: form item level results, received and processed by the form item; and form level results which are received by the form RC which caused the grammar to be added. If a 'recoResult' event is received by the form RC, the RC transitions into the ProcessingFormResult state. If the active form item receives the recognition result (and locally updated itself), then the form RC receives a 'formItemResult' event, adds the active item to the justFilled array, and transitions into the Evaluating state.
In the ProcessingFormResult state, the recognition result is processed by iterating through the form item children, obtaining their name and slotname, and then attempting to match the slotname to the results. If the match is successful, the name variable in the data model result is updated with the value from the recognition result and the child is added to the justFilled data array. When this process is complete, the form RC transitions to the Evaluating state.
In the Evaluating state, the form RC then iterates through its children and if a child is a member of the 'JustFilled' array, it sends a 'evaluate' event to the form item RC causing the appropriate filled> RCs to be executed. If the child is a filled> RC, then it is executed if appropriate. When evaluation is complete, the form RC transitions to the 'selectformitem' state so that the next form item can be selected for execution.
Event | Source | Payload | Description |
initialize | any | controller(M) | Update the data model |
prepareGrammars | controller | Another form is active, but the current form's form-level grammars may be activated. | |
execute | controller | Current form is active |
Event | Source | Payload | Description |
initialize | controller | Notification that initialization is complete | |
prepareGrammars | controller | Sent to prepare grammars to appropriate form, document and application level RCs. | |
execute | controller | Notification of complete recognition result from the field RC. |
The following table shows the events sent and received by the form RC to resources and other RCs which define the events.
Event | Source | Target | Description |
checkStatus | FormRC | FormItem RC | Check if ready for execution |
createScope | FormRC | DataModel | Creates a scope. |
destroyScope | FormRC | DataModel | Delete a scope. |
evaluate | FormRC | FormItem RC | Process form item being filled. |
execute | FormRC | FormItem RCs | Start execution. |
prepare | FormRC | FormItem RC | Initiates preparation needed before execution. |
formItemResult | FormItemRC | FormRC | Results received by the form item. |
prepared | FormItemRC | FormRC | Indicates that preparation is complete. |
recoResult | PlayAndRecognize RC | FormRC | Results filled at the form level and not form item level. |
Editorial note | |
Open issue: This section plans to use the same approach as section 2.3.1 in VoiceXML 2.0, but without support for the type attribute nor the <option> tag. Builtin types will be handled through a separate module. |
The semantics of field elements are defined using the following resource controllers: Field (6.10.2.1 Field RC), PlayandRecognize (6.10.2.2 PlayandRecognize RC), ...
The Field Resource Controller is the primary RC for the field element.
The field RC is defined in terms of a data model and state model.
The data model is composed of the following parameters:
The field RC's state model consists of the following states: Idle, Initializing, Ready, Preparing, Prepared, Executing and Evaluating.
While in the Idle state, the RC may receive an 'initialize' event, whose 'controller' event data is used to update the data model. The RC then transitions into Initiating state.
In the Initializing state, the RC creates a variable in the Datamodel Resource: the variable name corresponds to the name in the RC's data model, and the variable value is set to the value of the RC's data model expr, if this is defined. The field RC then initializes its children: this is modeled as a separate RC (see XXX). When all children are initialized, the RC transitions to the Ready state.
In the Ready state, the field RC can receive an 'checkStatus' event to check whether it can be executed or not. The value of name and cond in its data model are checked: the status is true if the name is undefined and the value of cond evaluates to true. The status is returned in a 'checkedStatus' event sent back to the controller RC. If the RC receives a 'prepare' event, it updates includePrompts in its data model using the event data, and transitions to the Preparing state.
In the Preparing state, the field prepares its prompts and grammars. Prompts are prepared only if the includePrompts data is true; otherwise, prompts within the field are not prepared (e.g. field prompts aren't queued following a <reprompt>). Preparation of prompts is modeled as a separate RC (see XXX), as is preparation of grammars (see YYY). These RCs are summarized below.
Prompts are prepared by iterating through the children array. In the iteration, each prompt RC child is sent a 'checkStatus' event. If the prompt child returns true (its cond parameter evaluates to true), then it is added to a 'correct count' list together with its count. Once the iteration is complete, the RC determines the highest count on the 'correct count' list: the highest count among those on the list less than or equal to the current count value. All child on the 'correct count' list whose count is not the highest count are removed. The RC then iterates through the 'correct count' list and sends an 'execute' event to each prompt RC, causing it to be queued on the PromptQueue Resource.
Grammars are prepared by recursing through the children array and sending each grammar RC child an 'execute' event. The grammar RC then, if appropriate, sends an 'addGrammar' event to the DTMF or ASR Recognizer Resource where the grammar itself, its properties and the field RC is sent as the handler for recognition results.
When prompts and grammars have been prepared, the prompt counter is incremented and the field RC sends a 'prepared' event to its controller with event data indicating its modal status and then transition into the Prepared state.
In the Prepared state, the field RC may receive an 'execute' event from its controller. The RC sends an 'execute' event to the PlayAndRecognize RC (6.10.2.2 PlayandRecognize RC), causing any queued prompts to be played and recognition to be initiated. In the event data, the controller is set to this RC, and other data is derived from data model properties. The RC transitions to the Executing state.
In the Executing state, the PlayAndRecognize RC must send recoResults (or error events: noinput, nomatch, error.semantic) to the field RC.
If the field RC receives the recoResults, then it updates its name variable in the Datamodel Resource. The field RC then sends a 'fieldResult' event to its controller indicating that a field result has been received and processed.
If the recoResult is received by the field RC's controller, then the field receives an 'evaluate' event which causes it to transition to the Evaluating state.
In the Evaluating state, the field RC iterates through its children executing each filled RC: this is modeled by a separate RC (see XXX). When evaluation is complete, the RC sends a 'evaluated' event to its controller and transitions to the Ready state.
The Field RC is defined to receive the following events:
Event | Source | Payload | Description |
initialize | any | controller(M) | |
checkStatus | controller | ||
prepare | controller | includePrompts (M) | |
execute | controller | ||
evaluate | controller |
and the events it sends:
Event | Target | Payload | Description |
initialized | controller | ||
checkedStatus | controller | ||
prepared | controller | ||
fieldResult | controller | ||
evaluated | controller |
Table 37 shows the events sent and received by the field RC to resources and other RCs which define the events.
Event | Source | Target | Description |
create | FieldRC | DataModel | |
assign | FieldRC | DataModel | |
execute | FieldRC | PlayandRecognizeRC | |
recoResult | PlayandRecognizeRC | FieldRC |
The PlayandRecognize RC coordinates media input with Recognizer resources and media output with the PromptQueue Resource.
The following use cases are covered:
Editorial note | |
Open issue: should we remove the possibility for alternating speech and hotword bargein modes within the recognition cycle? |
The PlayandRecognize RC coordinates media input with recognition resources and media output with the PromptQueue Resource on behalf of a form item.
This RC activates prompt queue playback, activates recognition resources, manages bargein behavior and handles results from recognition resources.
The RC is defined in terms of a data model and a state model.
The data model is composed of the following parameters:
The RC model consists of the following states: idle, prepare recognition resources, start playing, playing prompts with bargein, playing prompts without bargein, recognizing with a timer, waiting for input, waiting for speech result and update results. The complexity of this model is partially a consequence of supporting the relationship between hotword bargein and recognition result processing.
While in the idle state, the RC may receive an 'execute' event, whose event data is used to update the data model. The event information includes: controller, inputmodes, inputtimeout, dtmfProps, asrProps and maxnbest. The RC transition to the prepare recognition resources state.
In the prepare recognition resources, the RC sends 'prepare' events to the ASR and DTMF recognition resource. Both events specify this RC as the controller parameter, while the properties parameter differs. In this state, the RC can received 'prepared' or 'notPrepared' events from either recognition resources. If neither resource returns a 'prepared' event, then activeGrammars is false (i.e. no active DTMF or speech grammar) and the RC sends an 'error.semantic' event to the controller and exits. If at least one resource returns a 'prepared' event, then the RC moves into the start playing state.
The start playing state begins by sending the PromptQueue resource a 'play' event. The PromptQueue responds with a 'playDone' event if there are no prompt in the prompt queue; as a result, this RC moves into the start recognizing with timer state. If there is at least one prompts in the queue, the PromptQueue sends this RC a 'playStarted' event whose data contains the bargein and bargeintype values for the first prompt, and the input timeout value for the last prompt in the queue. The data model is updated with this information.
Editorial note | |
Open issue: PromptQueue Resource doesn't currently have playStarted event. If we don't add playStarted event, then is there a better way to get the bargein, bargeintype, and timeout information from the prompts in the PromptQueue? |
Editorial note | |
Open Issue: The event "bargeinChange" as a one way notification could pose a problem, as it takes finite time for recognizer to suspend or resume. This might work better if PromptQueue Resource waited for an event "bargeinChangeAck" (or similar) from PlayandRecognize RC before starting the next play. PlayandRecognize RC will send the event "bargeinChangeAck" after it completed suspend or resume action on the recognizer. |
In the playing without bargein state, recognition is suspended if it has been previously activated (recoActive parameter of the data model tracks this). Suspending recognition is conditional on the value of 'inputmodes' data parameter; if 'dtmf' is in inputmodes, then DTMF recognition is suspended; if 'voice' is in inputmodes, the ASR recognition is suspended. In this state, the PromptQueue can report to this RC changes in bargein and bargeintype as prompts are played: a 'bargeintypeChange' event with the values 'hotword' or 'speech' cause the data model parameter 'bargein' to the set to 'true' and the 'bargintype' parameter to be updated with event data value. If the PromptQueue resource sends a 'playDone' event, then the data model markname and marktime parameters are updated and the RC transitions to the start recognizing with timer state.
In the playing with bargein state, recognition is activated if it has not been previously activated (determined by recoActive parameter in the data model). Activating recognition is conditional on the value of 'inputmodes' data parameter; if 'dtmf' is in inputmodes, then DTMF recognition is activated; if 'voice' is in inputmodes, then ASR recognition is activated. In this state, the PromptQueue can report changes in bargein and bargeintype as prompts are played: a 'bargeintypeChange' event where the event data value is not 'unbargeable' causes the data model 'bargintype' parameter to be updated with the event data ('hotword' or 'speech'); while a 'bargeintypeChange' where the event data value is 'unbargeable' causes the data model 'bargein' parameter to set to false and the RC transitions to the playing without bargein state. If the PromptQueue resources sends a 'playDone' event, then the data model markname and marktime parameters are updated and the RC transitions to the start recognizing with timer state.
Recognition handling in this state depends upon the bargeintype data parameter. If the bargeintype is 'speech' and a recognizer sends a 'inputStarted' event, then the RC transition to the waiting for speech result state. If the bargeintype is 'hotword', then recognition results are processed within this state. In particular, if a recognition resource sends a 'recoResults' event, then its event data is processed to determine if the recognition result is positive or negative.
Editorial note | |
Further details on recognition processing to be added in later versions. recoResults data parameter is updated with the recognition results (truncated to maxnbest). A speech result is positive iff there is at least one result whose confidence level is equal to or greater than the recognition confidence level; otherwise the result is negative. DTMF results are always positive. The recoListener data parameter is defined as the listener associated with the best result if the result is positive. |
In the start recognizing with timer state, an input timer is activated for the value of the inputtimeout data parameter and, if the recognition is not already active (determined by the recoActive data parameter). Recognition activation is conditional on the value of 'inputmodes' data parameter; if 'dtmf' is in inputmodes, then DTMF recognition is activated; if 'voice' is in inputmodes, the ASR recognition is activated. The RC then transitions into the waiting for input state.
In the waiting for input state, the RC waits for user input. If it receives a 'timerExpired' event, then the RC sends a 'stop' event to all recognition resources, sends a 'noinput' event to its controller and exits. Recognition handling in this state depends upon the bargeintype data parameter. If the bargeintype is 'speech' and a recognizer sends a 'inputStarted' event, then the RC transition to the waiting for speech result state. If the bargeintype is 'hotword', then recognition results are processed within this state. In particular, if a recognition resource sends a 'recoResults' event, then its event data is processed to determine if the recognition result is positive or negative. If positive, the RC cancels the timer, and transitions to the update results state. If negative, the RC sends a 'listen' event to the recognition resource which sent the 'recoResults' event.
In the waiting for speech result state, the RC waits for a 'recoResult' event whose data is used to update the recoResult data parameter and to set the recoListener data parameter if the recognition result is positive. The RC then transitions to the update results state.
In the update results state, the RC sends 'assign' events to the data model resource, so that the lastresult object in application scope is updated with recognition results as well as markname and marktime information. If the recoListener data parameter is defined, then the RC sends a 'recoResult' event to the recognition listener RC; otherwise, it sends 'nomatch' event to its controller. The RC then exits.
Editorial note | |
Open issue: Behavior if one reco resource sends 'inputStarted' but other sends 'recoResults'? Race conditions between recognizers returning results? (This problem is inherent to the presence of two recognizers. For the sake of clear semantics, we could restrict only one recognizer to respond with 'inputStarted' and 'recoResults'. The other recognizer is always 'stopped'. But a better choice might be to have only one recognizer that handles both DTMF and speech, since semantically both recognizers are very similar.) |
The PlayandRecognize RC is defined to receive the following events:
Event | Source | Payload | Sequencing | Description |
execute | any | controller(M), inputmodes (O), inputtimeout (O), dtmfProps (M), recoProps (M), maxnbest (O) |
and the events it sends:
Event | Target | Payload | Sequencing | Description |
recoResult | any | results (M) | one-of: nomatch, noinput, error.*, recoResult | |
nomatch | controller | one-of: nomatch, noinput, error.*, recoResult | ||
noinput | controller | one-of: nomatch, noinput, error.*, recoResult | ||
error.semantic | controller | one-of: nomatch, noinput, error.*, recoResult | ||
error.badfetch.grammar | controller | one-of: nomatch, noinput, error.*, recoResult | ||
error.noresource | controller | one-of: nomatch, noinput, error.*, recoResult | ||
error.unsupported.builtin | controller | one-of: nomatch, noinput, error.*, recoResult | ||
error.unsupported.format | controller | one-of: nomatch, noinput, error.*, recoResult | ||
error.unsupported.language | controller | one-of: nomatch, noinput, error.*, recoResult |
The events in Table 40 are sent by the PlayandRecognize RC to resources which define the events.
Event | Target | Payload | Sequencing | Description |
play | PromptQueue | |||
halt | PromptQueue | |||
prepare | Recognizer | |||
listen | Recognizer | |||
suspend | Recognizer | |||
stop | Recognizer |
The events in Table 41 are received by this RC. Their definition is provided by the sending component.
Event | Source | Payload | Sequencing | Description |
playStarted | PromptQueue | bargein (O), bargeintype (O), inputtimeout (O) | pq:play notification | |
playDone | PromptQueue | markname (O), marktime (O) | pq:play response | |
bargeinChange | PromptQueue | bargein (M) | ||
bargeintypeChange | PromptQueue | bargeintype (M) | ||
prepared | Recognizer | prepare positive response | ||
notPrepared | Recognizer | prepare negative response | ||
inputStarted | Recognizer | |||
recoResult | Recognizer | results (M), listener (O) |
VoiceXML 3.0 modules can be combined to form one or more language profiles.
[Profiles are motivated on the basis of identified common use cases. Pull in general motivation from XHTML, SMIL, etc. ]
This specification defines the following profiles:
[Other profiles may be standardized at a later stage; for example, a profile which includes all VoiceXML 3.0 modules. Suggestions welcome.]
Developers can be create their own profiles by modifying an existing profile or combining modules to create a new profile.
[A profile is defined as follows: ]
Editorial note | |
The name of this profile may change. |
[Motivation: tutorial, PoC, transitional, and that vxml3 is a superset of VoiceXML 2.1.]
The VoiceXML 2.1 profile is included demonstrating how profiles are defined in VoiceXML 3.0. Using existing elements from the [VOICEXML21] specification is helpful as the semantics of these elements are already well defined and well understood. Thus changes in how they are presented are a result of the module and profile style of VoiceXML 3.0 and of making more explicit and formal the precise detailed semantics.
The VoiceXML 2.1 profile also plays a transitional role as VoiceXML 3.0 as a whole is built on top of VoiceXML 2.1. VoiceXML 3.0 is a superset of VoiceXML 2.1 and includes the traditional 2.1 functionality plus some new modules. The VoiceXML 2.1 profile is the set of modules that were always present in VoiceXML 2.1 but that weren't expressed in the specification as individual modules. This also allows a clear path for the VoiceXML application developer as applications authored in version 2.1 of VoiceXML will continue to work and the application developer will not need to learn new syntax or semantics when they develop in the VoiceXML 2.1 profile of VoiceXML 3.0.
The VoiceXML 2.1 profile also represents a proof of concept to ensure that the new modular profile method of describing the specification is in no way limited. VoiceXML 3.0 in its entirety will be in no way limited or constrained in any way because of the use of profiles and modules and formalized semantic models. Anything that was standardized in VoiceXML 2.1 can be standardized in this new format and the VoiceXML 2.1 profile reveals that.
This profile uses the prompt module (6.4 Prompt Module) extended with the Builtin SSML module (6.5 Builtin SSML Module) and (6.8 Foreach Module).
[Motivation: vxml as common interface to media server in telecom, NGN, IMS, etc. Key need is to expose media processing functionality, both simple and advanced, in a Application control and flow are typically handled outside VoiceXML: invoke 'play and collect/record/verify' functionality and return results. ]
[Not included: ]
[Issues: should ECMAScript/data capability be included? Efficient re-use of cached vxml scripts with data feed in, and results out, talks for this ...]
This profile uses the prompt module (6.4 Prompt Module) extended with the media module (6.6 Media Module), (6.8 Foreach Module) and, optionally, the parseq module (6.7 Parseq Module).
A VoiceXML interpreter context needs to fetch VoiceXML documents, and other resources, such as media files, grammars, scripts, and XML data. Each fetch of the content associated with a URI is governed by the following attributes:
fetchtimeout | The interval to wait for the content to be returned before throwing an error.badfetch event. The value is a Time Designation. If not specified, a value derived from the innermost fetchtimeout property is used. |
---|---|
fetchhint | Defines when the interpreter context should retrieve content from the server. prefetch indicates a file may be downloaded when the page is loaded, whereas safe indicates a file that should only be downloaded when actually needed. If not specified, a value derived from the innermost relevant fetchhint property is used. |
maxage | Indicates that the document is willing to use content whose age is no greater than the specified time in seconds (cf. 'max-age' in HTTP 1.1 [RFC2616]). The document is not willing to use stale content, unless maxstale is also provided. If not specified, a value derived from the innermost relevant maxage property, if present, is used. |
maxstale | Indicates that the document is willing to use content that has exceeded its expiration time (cf. 'max-stale' in HTTP 1.1 [RFC2616]). If maxstale is assigned a value, then the document is willing to accept content that has exceeded its expiration time by no more than the specified number of seconds. If not specified, a value derived from the innermost relevant maxstale property, if present, is used. |
When content is fetched from a URI, the fetchtimeout attribute determines how long to wait for the content (starting from the time when the resource is needed), and the fetchhint attribute determines when the content is fetched. The caching policy for a VoiceXML interpreter context utilizes the maxage and maxstale attributes and is explained in more detail below.
The fetchhint attribute, in combination with the various fetchhint properties, is merely a hint to the interpreter context about when it may schedule the fetch of a resource. Telling the interpreter context that it may prefetch a resource does not require that the resource be prefetched; it only suggests that the resource may be prefetched. However, the interpreter context is always required to honor the safe fetchhint.
When transitioning from one dialog to another, through either a <subdialog>, <goto>, <submit>, <link>, or <choice> element, there are additional rules that affect interpreter behavior. If the referenced URI names a document (e.g. "doc#dialog"), or if query data is provided (through POST or GET), then a new document is obtained (either from a local cache, intermediate cache, or from a origin Web server). When it is obtained, the document goes through its initialization phase (i.e., obtaining and initializing a new application root document if needed, initializing document variables, and executing document scripts). The requested dialog (or first dialog if none is specified) is then initialized and execution of the dialog begins.
Generally, if a URI reference contains only a fragment (e.g., "#my_dialog"), then no document is fetched, and no initialization of that document is performed. However, <submit> always results in a fetch, and if a fragment is accompanied by a namelist attribute there will also be a fetch.
Another exception is when a URI reference in a leaf document references the application root document. In this case, the root document is transitioned to without fetching and without initialization even if the URI reference contains an absolute or relative URI (see 4.4.2.2 Application Root and [RFC2396]). However, if the URI reference to the root document contains a query string or a namelist attribute, the root document is fetched.
Elements that fetch VoiceXML documents also support the following additional attribute:
fetchaudio | The URI of the audio clip to play while the fetch is being done. If not specified, the fetchaudio property is used, and if that property is not set, no audio is played during the fetch. The fetching of the audio clip is governed by the audiofetchhint, audiomaxage, audiomaxstale, and fetchtimeout properties in effect at the time of the fetch. The playing of the audio clip is governed by the fetchaudiodelay, and fetchaudiominimum properties in effect at the time of the fetch. |
---|
The fetchaudio attribute is useful for enhancing a user experience when there may be noticeable delays while the next document is retrieved. This can be used to play background music, or a series of announcements. When the document is retrieved, the audio file is interrupted if it is still playing. If an error occurs retrieving fetchaudio from its URI, no badfetch event is thrown and no audio is played during the fetch.
The VoiceXML interpreter context, like [HTML] visual browsers, can use caching to improve performance in fetching documents and other resources; audio recordings (which can be quite large) are as common to VoiceXML documents as images are to HTML pages. In a visual browser it is common to include end user controls to update or refresh content that is perceived to be stale. This is not the case for the VoiceXML interpreter context, since it lacks equivalent end user controls. Thus enforcement of cache refresh is at the discretion of the document through appropriate use of the maxage, and maxstale attributes.
The caching policy used by the VoiceXML interpreter context must adhere to the cache correctness rules of HTTP 1.1 ([RFC2616]). In particular, the Expires and Cache-Control headers must be honored. The following algorithm summarizes these rules and represents the interpreter context behavior when requesting a resource:
The "maxstale check" is:
Note: it is an optimization to perform a "get if modified" on a document still present in the cache when the policy requires a fetch from the server.
The maxage and maxstale properties are allowed to have no default value whatsoever. If the value is not provided by the document author, and the platform does not provide a default value, then the value is undefined and the 'Otherwise' clause of the algorithm applies. All other properties must provide a default value (either as given by the specification or by the platform).
While the maxage and maxstale attributes are drawn from and directly supported by HTTP 1.1, some resources may be addressed by URIs that name protocols other than HTTP. If the protocol does not support the notion of resource age, the interpreter context shall compute a resource's age from the time it was received. If the protocol does not support the notion of resource staleness, the interpreter context shall consider the resource to have expired immediately upon receipt.
VoiceXML allows the author to override the default caching behavior for each use of each resource (except for any document referenced by the <vxml> element's application attribute: there is no markup mechanism to control the caching policy for an application root document).
Each resource-related element may specify maxage and maxstale attributes. Setting maxage to a non-zero value can be used to get a fresh copy of a resource that may not have yet expired in the cache. A fresh copy can be unconditionally requested by setting maxage to zero.
Using maxstale enables the author to state that an expired copy of a resource, that is not too stale (according to the rules of HTTP 1.1), may be used. This can improve performance by eliminating a fetch that would otherwise be required to get a fresh copy. It is especially useful for authors who may not have direct server-side control of the expiration dates of large static files.
Prefetching is an optional feature that an interpreter context may implement to obtain a resource before it is needed. A resource that may be prefetched is identified by an element whose fetchhint attribute equals "prefetch". When an interpreter context does prefetch a resource, it must ensure that the resource fetched is precisely the one needed. In particular, if the URI is computed with an expr attribute, the interpreter context must not move the fetch up before any assignments to the expression's variables. Likewise, the fetch for a <submit> must not be moved prior to any assignments of the namelist variables.
The expiration status of a resource must be checked on each use of the resource, and, if its fetchhint attribute is "prefetch", then it is prefetched. The check must follow the caching policy specified in Section 6.1.2.
Properties are used to set values that affect platform behavior, such as the recognition process, timeouts, caching policy, etc.
The following types of properties are defined: speech recognition (8.2.1 Speech Recognition Properties), DTMF recognition (8.2.2 DTMF Recognition Properties), prompt and collect (8.2.3 Prompt and Collect Properties), media (8.2.4 Media Properties), fetching (8.2.5 Fetch Properties) and miscellaneous (8.2.6 Miscellaneous Properties) properties.
Editorial note | |
Open issue: should the specification provide specific default values rather than platform-specific? Open issue: Should we add a 'type' column for all properties? |
The following generic speech recognition properties are defined.
Name | Description | Default |
---|---|---|
confidencelevel | The speech recognition confidence level, a float value in the range of 0.0 to 1.0. Results are rejected (a nomatch event is thrown) when application.lastresult$.confidence is below this threshold. A value of 0.0 means minimum confidence is needed for a recognition, and a value of 1.0 requires maximum confidence. The value is a Real Number Designation (see 8.4 Value Designations). | 0.5 |
sensitivity | Set the sensitivity level. A value of 1.0 means that it is highly sensitive to quiet input. A value of 0.0 means it is least sensitive to noise. The value is a Real Number Designation (see 8.4 Value Designations). | 0.5 |
speedvsaccuracy | A hint specifying the desired balance between speed vs. accuracy. A value of 0.0 means fastest recognition. A value of 1.0 means best accuracy. The value is a Real Number Designation (see 8.4 Value Designations). | 0.5 |
completetimeout | The length of silence required following user speech before the speech recognizer finalizes a result (either accepting it or throwing a nomatch event). The complete timeout is used when the speech is a complete match of an active grammar. By contrast, the incomplete timeout is used when the speech is an incomplete match to an active grammar. A long complete timeout value delays the result completion and therefore makes the computer's response slow. A short complete timeout may lead to an utterance being broken up inappropriately. Reasonable complete timeout values are typically in the range of 0.3 seconds to 1.0 seconds. The value is a Time Designation (see 8.4 Value Designations). See 8.3 Speech and DTMF Input Timing Properties. Although platforms must parse the completetimeout property, platforms are not required to support the behavior of completetimeout. Platforms choosing not to support the behavior of completetimeout must so document and adjust the behavior of the incompletetimeout property as described below. | platform-dependent |
incompletetimeout | The required length of silence following user speech after which a recognizer finalizes a result. The incomplete timeout applies when the speech prior to the silence is an incomplete match of all active grammars. In this case, once the timeout is triggered, the partial result is rejected (with a nomatch event). The incomplete timeout also applies when the speech prior to the silence is a complete match of an active grammar, but where it is possible to speak further and still match the grammar. By contrast, the complete timeout is used when the speech is a complete match to an active grammar and no further words can be spoken. A long incomplete timeout value delays the result completion and therefore makes the computer's response slow. A short incomplete timeout may lead to an utterance being broken up inappropriately. The incomplete timeout is usually longer than the complete timeout to allow users to pause mid-utterance (for example, to breathe). See 8.3 Speech and DTMF Input Timing Properties Platforms choosing not to support the completetimeout property (described above) must use the maximum of the completetimeout and incompletetimeout values as the value for the incompletetimeout. The value is a Time Designation (see 8.4 Value Designations). | undefined? |
maxspeechtimeout | The maximum duration of user speech. If this time elapsed before the user stops speaking, the event "maxspeechtimeout" is thrown. The value is a Time Designation (see 8.4 Value Designations). | platform-dependent |
The following generic DTMF recognition properties are defined.
Name | Description | Default |
---|---|---|
interdigittimeout | The inter-digit timeout value to use when recognizing DTMF input. The value is a Time Designation (see 8.4 Value Designations). See 8.3 Speech and DTMF Input Timing Properties. | platform-dependent |
termtimeout | The terminating timeout to use when recognizing DTMF input. The value is a Time Designation (see 8.4 Value Designations). 8.3 Speech and DTMF Input Timing Properties. | 0s |
termchar | The terminating DTMF character for DTMF input recognition. See 8.3 Speech and DTMF Input Timing Properties. | # |
The following properties are defined to apply to the fundamental platform prompt and collect cycle.
Name | Description | Default |
---|---|---|
bargein | The bargein attribute to use for prompts. Setting this to true allows bargein by default. Setting it to false disallows bargein. | true |
bargeintype | Sets the type of bargein to be speech or hotword. See . | platform-specific |
timeout | The time after which a noinput event is thrown by the platform. The value is a Time Designation (see 8.4 Value Designations). See 8.3 Speech and DTMF Input Timing Properties. | platform-dependent |
The following properties are defined to apply to output media.
Name | Description | Default |
---|---|---|
outputmodes |
Determines which modes may be used for media output. The value is a space separated list of media types (see media 'type' in TBD). This property is typically used with container file formats, such as "video/3gpp", which support storage of multiple media types. For example, to play both audio and video to the remote connection, the property would be set to "audio video". To play only the video, the property is set to "video". If the value contains a media type which is not supported by the
platform, the connection or the value of the <media> element
|
The default value depends on the negotiated media between the
local and remote devices. It is the space separated list of media
types specified in the session.connection.media array
elements' type property where the associated
direction property is sendrecv or
recvonly . |
The following properties pertain to the fetching of new documents and resources.
Note that maxage and maxstale properties may have no default value - see 8.1.2 Caching.
Name | Description | Default |
---|---|---|
audiofetchhint | This tells the platform whether or not it can attempt to optimize dialog interpretation by pre-fetching audio. The value is either safe to say that audio is only fetched when it is needed, never before; or prefetch to permit, but not require the platform to pre-fetch the audio. | prefetch |
audiomaxage | Tells the platform the maximum acceptable age, in seconds, of cached audio resources. | platform-specific |
audiomaxstale | Tells the platform the maximum acceptable staleness, in seconds, of expired cached audio resources. | platform-specific |
documentfetchhint | Tells the platform whether or not documents may be pre-fetched. The value is either safe (the default), or prefetch. | safe |
documentmaxage | Tells the platform the maximum acceptable age, in seconds, of cached documents. | platform-specific |
documentmaxstale | Tells the platform the maximum acceptable staleness, in seconds, of expired cached documents. | platform-specific |
grammarfetchhint | Tells the platform whether or not grammars may be pre-fetched. The value is either prefetch (the default), or safe. | prefetch |
grammarmaxage | Tells the platform the maximum acceptable age, in seconds, of cached grammars. | platform-specific |
grammarmaxstale | Tells the platform the maximum acceptable staleness, in seconds, of expired cached grammars. | platform-specific. |
objectfetchhint | Tells the platform whether the URI contents for <object> may be pre-fetched or not. The values are prefetch, or safe. | prefetch |
objectmaxage | Tells the platform the maximum acceptable age, in seconds, of cached objects. | platform-specific |
objectmaxstale | Tells the platform the maximum acceptable staleness, in seconds, of expired cached objects. | platform-specific |
scriptfetchhint | Tells whether scripts may be pre-fetched or not. The values are prefetch (the default), or safe. | prefetch |
scriptmaxage | Tells the platform the maximum acceptable age, in seconds, of cached scripts. | platform-specific |
scriptmaxstale | Tells the platform the maximum acceptable staleness, in seconds, of expired cached scripts. | platform-specific. |
fetchaudio | The URI of the audio to play while waiting for a document to be fetched. The default is not to play any audio during fetch delays. There are no fetchaudio properties for audio, grammars, objects, and scripts. The fetching of the audio clip is governed by the audiofetchhint, audiomaxage, audiomaxstale, and fetchtimeout properties in effect at the time of the fetch. The playing of the audio clip is governed by the fetchaudiodelay, and fetchaudiominimum properties in effect at the time of the fetch. | undefined |
fetchaudiodelay | The time interval to wait at the start of a fetch delay before playing the fetchaudio source. The value is a Time Designation (see 8.4 Value Designations). The default interval is platform-dependent, e.g. "2s". The idea is that when a fetch delay is short, it may be better to have a few seconds of silence instead of a bit of fetchaudio that is immediately cut off. | platform-specific |
fetchaudiominimum | The minimum time interval to play a fetchaudio source, once started, even if the fetch result arrives in the meantime. The value is a Time Designation (see 8.4 Value Designations). The default is platform-dependent, e.g., "5s". The idea is that once the user does begin to hear fetchaudio, it should not be stopped too quickly. | platform-specific |
fetchtimeout | The timeout for fetches. The value is a Time Designation (see 8.4 Value Designations). | platform-specific |
The following miscellaneous properties are defined.
Name | Description | Default |
---|---|---|
inputmodes | This property determines which input modality to use. The input modes to enable: dtmf and voice. On platforms that support both modes, inputmodes defaults to "dtmf voice". To disable speech recognition, set inputmodes to "dtmf". To disable DTMF, set it to "voice". One use for this would be to turn off speech recognition in noisy environments. Another would be to conserve speech recognition resources by turning them off where the input is always expected to be DTMF. This property does not control the activation of grammars. For instance, voice-only grammars may be active when the inputmode is restricted to DTMF. Those grammars would not be matched, however, because the voice input modality is not active. | ??? |
universals | Platforms may optionally provide platform-specific universal command grammars, such as "help", "cancel", or "exit" grammars, that are always active (except in the case of modal input items - see ) and which generate specific events. Production-grade applications often need to define their own universal command grammars, e.g., to increase application portability or to provide a distinctive interface. They specify new universal command grammars with <link> elements. They turn off the default grammars with this property. Default catch handlers are not affected by this property. The value "none" is the default, and means that all platform default universal command grammars are disabled. The value "all" turns them all on. Individual grammars are enabled by listing their names separated by spaces; for example, "cancel exit help". | none |
maxnbest | This property controls the maximum size of the "application.lastresult$" array; the array is constrained to be no larger than the value specified by 'maxnbest'. This property has a minimum value of 1. | 1 |
The various timing properties for speech and DTMF recognition work together to define the user experience. The ways in which these different timing parameters function are outlined in the timing diagrams below. In these diagrams, the start for wait of DTMF input, or user speech both occur at the time that the last prompt has finished playing.
DTMF grammars use timeout, interdigittimeout, termtimeout and termchar as described in 8.2.2 DTMF Recognition Properties to tailor the user experience. The effects of these are shown in the following timing diagrams.
The timeout parameter determines when the <noinput> event is thrown because the user has failed to enter any DTMF. Once the first DTMF has been entered, this parameter has no further effect.
In the following diagram, the interdigittimeout determines when the nomatch event is thrown because a DTMF grammar is not yet recognized, and the user has failed to enter additional DTMF.
The example below shows the situation when a DTMF grammar could terminate, or extend by the addition of more DTMF input, and the user has elected not to provide any further input.
In the example below, a termchar is non-empty, and is entered by the user before an interdigittimeout expires, to signify that the users DTMF input is complete; the termchar is not included as part of the recognized value.
In the example below, the entry of the last DTMF has brought the grammar to a termination point at which no additional DTMF is expected. Since termchar is empty, there is no optional terminating character permitted, thus the recognition ends and the recognized value is returned.
In the example below, the entry of the last DTMF has brought the grammar to a termination point at which no additional DTMF is allowed by the grammar. If the termchar is non-empty, then the user can enter an optional termchar DTMF. If the user fails to enter this optional DTMF within termtimeout, the recognition ends and the recognized value is returned. If the termtimeout is 0s (the default), then the recognized value is returned immediately after the last DTMF allowed by the grammar, without waiting for the optional termchar. Note: the termtimeout applies only when no additional input is allowed by the grammar; otherwise, the interdigittimeout applies.
In this example, the entry of the last DTMF has brought the grammar to a termination point at which no additional DTMF is allowed by the grammar. Since the termchar is non-empty, the user enters the optional termchar within termtimeout causing the recognized value to be returned (excluding the termchar).
While waiting for the first or additional DTMF, three different timeouts may determine when the user's input is considered complete. If no DTMF has been entered, the timeout applies; if some DTMF has been entered but additional DTMF is valid, then the interdigittimeout applies; and if no additional DTMF is legal, then the termtimeout applies. At each point, the user may enter DTMF which is not permitted by the active grammar(s). This causes the collected DTMF string to be invalid. Additional digits will be collected until either the termchar is pressed or the interdigittimeout has elapsed. A nomatch event is then generated.
Speech grammars use timeout, completetimeout, and incompletetimeout as described in 8.2.3 Prompt and Collect Properties and 8.2.1 Speech Recognition Properties to tailor the user experience. The effects of these are shown in the following timing diagrams.
In the example below, the timeout parameter determines when the noinput event is thrown because the user has failed to speak.
Several VoiceXML parameter values follow the conventions used in the W3C's Cascading Style Sheet Recommendation [CSS2].
Integers are specified in decimal notation only. Integers may be preceded by a "-" or "+" to indicate the sign.
An integer consists of one or more digits "0" to "9".
This version of VoiceXML was written with the participation of members of the W3C Voice Browser Working Group. The work of the following members has significantly facilitated the development of this specification:
The W3C Voice Browser Working Group would like to thank the W3C team, especially Kazuyuki Ashimura and Matt Womer, for their invaluable administrative and technical support.