The W3C Voice Browser working group aims to develop specifications to enable access to the Web using spoken interaction. This document is part of a set of requirements studies for voice browsers, and provides a model architecture for processing speech within voice browsers.
This document describes a model architecture for speech processing in voice browsers as an aid to work on understanding requirements. Related requirement drafts are linked from the introduction. The requirements are being released as working drafts but are not intended to become proposed recommendations.
This specification is a Working Draft of the Voice Browser working group for review by W3C members and other interested parties. This is the first public version of this document. It is a draft document and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use W3C Working Drafts as reference material or to cite them as other than "work in progress".
Publication as a Working Draft does not imply endorsement by the W3C membership, nor of members of the Voice Browser working groups. This is still a draft document and may be updated, replaced or obsoleted by other documents at any time. It is inappropriate to cite W3C Working Drafts as other than "work in progress."
This document has been produced as part of the W3C Voice Browser Activity, following the procedures set out for the W3C Process. The authors of this document are members of the Voice Browser Working Group. This document is for public review. Comments should be sent to the public mailing list <email@example.com> (archive) by 14th January 2000.
A list of current W3C Recommendations and other technical documents can be found at http://www.w3.org/TR.
To assist in clarifying the scope of charters of each of the several subgroups of the W3C Voice Browser Working Group, a representative or model architecture for a typical voice browser application has been developed. This architecture illustrates one possible arrangement of the main components of a typical system, and should not be construed as a recommendation. Other proposed architectures for spoken language systems are currently available, and may also be compatible with voice browsers, for example the DARPA Communicator architecture.
Connections between components have been shown explicitly in the interest of clearly indicating the flow of information among the processes (and thereby indicating the interaction of the W3C subgroups). Each of the currently existing subgroups (Universal Access, Speech Synthesis, Grammar Representation, Natural Language, and Dialog) is represented in this architecture. New subgroups are currently being initiated and can contribute additional elements to this architecture for future drafts.
The design is intended to be agnostic with respect to client, proxy, or server implementation of the various components, although in practice some components will naturally fall into client or server roles in relation to other components (indeed, some components can be both clients to some components and servers to other components). An open-agent architecture, where component connections are implicit, could be used for actual implementation of such a system where components can migrate to client and/or server roles as necessary to fulfill their duties. The model architecture is designed to accommodate synchronized multi-modal input and multi-media output.
The model architecture is shown in Figure 1. Solid (green) boxes indicate system components, peripheral solid (yellow) boxes indicate points of usage for markup language, and dotted peripheral boxes indicate information flows.
Figure 1. System Architecture
Two types of clients are illustrated: telephony and data networking. The fundamental telephony client is, of course, the telephone, either wireline or wireless. The handset telephone requires PSTN (Public Switched Telephone Network) interface, which can be either tip/ring, T1, or higher level, and may include hybrid echo cancellation to remove line echoes for ASR barge-in over audio output. A speakerphone will also require an acoustic echo canceller to remove room echoes. The data network interface will require only acoustic echo cancellation if used with an open microphone since there is no line echo on data networks. The IP interface is shown for illustration only. Other data transport mechanisms can be used as well.
Once data has passed the client interface, it can be processed in a similar manner. One minor difference may be speech endpointing. Endpointing will most likely be performed either in the telephony interface or at the front-end of the ASR processor for speech input from telephony interface. For speech via the IP interface endpointing can be performed at the client as well as the ASR front-end. The choice of where endpointing occurs is coupled with the choice for echo cancellation.
It is currently not clear how non-speech data will be handled at the telephony interface. This can include inputs such as pointing device input from a "smart phone," address books and other client resident file data, and eventually even data like video. These smart telephone devices are now on the drawing boards of many suppliers. Some this traffic can be handled by WAP/WML, but there are still open issues with regards to multi-modality. Therefore voice markup language specifications should provide means for extending the language features.
Data from the ASR/DTMF (etc.) recognizer must be in a format compatible with the NL (Natural Language) interpreter. Typically this would be text, but might include non-textual components for pointing device input, in which case pointing coordinates can be associated with text and/or semantic tags. If the recognizer has detected valid input while output is still being presented, the recognizer can signal the presentation component to stop output. Barge-in may not be desirable for certain types of multi-media output, and should primarily be considered important for interrupting speech output. In some cases it may also be undesirable to interrupt speech output, such as in the processing of commands to change speaking volume or rate.
The recognizer can produce multiple outputs and associated confidence scores. The NL interpreter can also produce multiple interpretations. Interpreted NL output is coordinated with other modes of input that may require interpretation in the current NL context or may alter or augment the interpretation of the NL input. It is the responsibility of the multi-media integration module to produce possibly multiple coordinated joint interpretations of the multi-modal input and present these to the dialog manager. Context information can also be shared with the dialog manager to further refine the interpretation, including resolution of anaphora and implied expressions. The dialog manager is responsible for the final selection of best interpretation.
The dialog manager is also responsible for responding to the input statement. This responsibility can include resolving ambiguity, issuing instructions and/or queries to the task manager, collecting output from the task manager, formation of a natural language expression or visual presentation of the task manager output, and coordination of recognizer context.
The task manager is primarily an Application Program Interface (API), but can also include pragmatic and application specific reasoning. The task manager can be an agent, or proxy, can possess state, and can communicate with other agents or proxies for services. The primary application interface for the task manager is expected to be web servers, but can be other API's as well.
Finally, the presentation manager, or output media "renderer," has responsibility for formatting multi-media output in a coordinated manner. The presentation manager should be aware of the client device capabilities.