Copyright ©2000 W3C® (MIT, INRIA, Keio), All Rights Reserved. W3C liability, trademark, document use and software licensing rules apply.
Multimodal browsers allow users to interact via a combination of modalities, for instance, speech recognition and synthesis, displays, keypads and pointing devices. The Voice Browser working group is interested in adding multimodal capabilities to voice browsers. This document sets out a prioritized list of requirements for multimodal dialog interaction, which any proposed markup language (or extension thereof) should address.
This specification is a Working Draft of the Voice Browser working group for review by W3C members and other interested parties. This is the first public version of this document. It is a draft document and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use W3C Working Drafts as reference material or to cite them as other than "work in progress".
Publication as a Working Draft does not imply endorsement by the W3C membership, nor of members of the Voice Browser working groups. This is still a draft document and may be updated, replaced or obsoleted by other documents at any time. It is inappropriate to cite W3C Working Drafts as other than "work in progress."
This document has been produced as part of the W3C Voice Browser Activity, but should not be taken as evidence of consensus in the Voice Browser Working Group. The goals of the Voice Browser Working Group (members only) are discussed in the Voice Browser Working Group charter (members only). This document is for public review. Comments should be sent to the public mailing list <www-voice@w3.org> (archive).
A list of current W3C Recommendations and other technical documents can be found at http://www.w3.org/TR.
NOTE: Italicized green comments are merely that - comments. They are for use during discussions but will be removed as appropriate.
The document addresses multimodal dialog interaction.Multimodal as defined in this document is one or more speech modes:
together with one or more of the following modes:
The focus is on multimodal dialog where there is a small screen and keypad (e.g. a cell phone) or a small screen, keypad and pointing device (e.g. a palm computer with cellular connection to the Web). This document is agnostic about where the browser(s) and speech and language engines are running - e.g. they could be running on the device itself, on a server or a combination of the two.
The document addresses applications where both speech input and speech output can be available. Note that this includes applications where speech input and/or speech output may be deselected due to environment/accessibility needs.
The document does not specifically address universal access, i.e. the issue of rendering the same pages of markup to devices with different capabilities (e.g. PC, phone or PDA). Rather, the document addresses a markup language that allows an author to write an application that uses spoken dialog interaction together with other modalities (e.g. a visual interface).
The activities of the Multimodal Requirements Subgroup will be coordinated with the activities of other sub-groups within the W3C Voice Browsing Working Group and other related W3C working groups. Where possible, the specification will reuse standard visual, multimedia and aural markup languages, see Reuse of standard markup requirement (4.1).
The markup language will be scalable across devices with a range of capabilities, in order to sufficiently meet the needs of consumer and device control applications. This includes devices capable of supporting:
The server must be able to get access to client capabilities and the user's personal preferences, see reuse of standard markup requirement (4.1).
The markup language should be easy for designers to understand and author without special tools or knowledge of vendor technology or protocols (multimodal dialog design knowledge is still essential).
A characteristic of speech input is that it can be very efficient - for example, in a device with a small display and keypad, speech can bypass multiple layers of menus. A characteristic of speech output is its serial nature, which can make it a long-winded way of presenting information that could be quickly browsed on a display.
The markup will allow an author to use the different characteristics of the modalities in the most appropriate way for the application.
The markup language will allow speech output to have different content to that of simultaneous output from other media. This requirement is related to the simultaneous output requirements (3.3 and 3.4).
In a speech plus GUI system, the author will be able to choose different text for simultaneous verbal and visual outputs. For example, a list of options may be presented on screen and simultaneous speech output does not necessarily repeat them (which is long-winded) but can summarize them or present an instruction or warning.
The markup language will allow, in a given dialog state, the set of actions that can be performed using speech input to be different tosimultaneous actions that can be performed with other input modalities. This requirement is related to the simultaneous input requirements (2.3 and 2.4).
Consider a speech plus GUI system, where speech and touch screen input is available simultaneously. The application can be authored such that, in a given dialog state, there are more actions available via speech than via the touch screen. For example, the screen displays a list of flights and the user can bypass the options available on the display and say "show me later flights".
The markup will be designed such that an author can write applications where the synchronization of the various modalities is seamless from the user's point of view. That is, a cause in one modality results in a synchronous change in another. For example:
See minimally required synchronization points (4.7.1) and finer grained synchronization points (4.7.2).
See also multimodal input requirements (2.2, 2.3, 2.4) and multimodal output requirements (3.2, 3.3, 3.4).
The markup language will provide the ability to mark the language of a document.
The markup language will support rendering of multi-lingual documents - i.e. where there is a mixed-language document. For example, English and French speech output and/or input can appear in the same document - a spoken system response can be "John read the book entitled 'Viva La France'."
This is really a general requirement for voice dialog, rather than a multimodal requirement. We may move this to the dialog document.
The markup language can specify which spoken user input is interpreted by the voice browser.
The markup language specifies that speech and user input from other modalities is to be interpreted by the browser. There is no requirement that the input modalities are simultaneously active. In a particular dialog state, there is only one input mode available but in the whole interaction more than one input mode is used. Inputs from different modalities are interpreted separately. For example, a browser can interpret speech input in one dialog state and keyboard input in another.
The granularity is defined by things like input events. Synchronization does not occur at any finer granularity. When the user takes some action, only one mode of input will be available at that time. See requirement 4.7.1 - minimally required synchronization points.
Examples:
The markup language specifies that speech and user input from other modalities is to be interpreted by the browser and that input modalities are simultaneously active. There is no requirement that interpretation of the input modalities are coordinated (i.e. interpreted together). In a particular dialog state, there is more than one input mode available but only input from one of the modalities is interpreted (e.g. the first input - see 2.13 Resolve conflicting input requirement). For example, a voice browser in a desktop environment could accept either keyboard input or spoken input in same dialog state.
The granularity is defined by things like input events. Synchronization does not occur at any finer granularity. When the user takes some action, it can be in one of several input modes - only one mode of input will be accepted by the browser. See requirement 4.7.1 - minimally required synchronization points.
Examples:
The markup language specifies that speech and user input from other modalities is allowed at the same time and that interpretation of the inputs are coordinated. In a particular dialog state, there is more than one input mode available and input from multiple modalities is interpreted (e.g. within a given time window). When the user takes some action it can be composed of inputs from several modalities - for example, a voice browser in a desktop environment could accept keyboard input and spoken input together in same dialog state.
Examples:
See also 2.11 Composite Meaning requirement, 2.13 Resolve conflicting input requirement.
The markup language will support the following input modes, in addition to speech:
DTMF will be supported using the dialog markup specified by the W3C Voice Browsing Group's dialog requirements.
Character and pointing input will be supported using other markup languages together with scripting (e.g. html with Javascript).
See reuse standard markup requirement (4.1).
The markup language will support other input modes, including:
The model will be abstract enough so any new or exotic input media (e.g. gesture captured by video) could fit into it.
The markup language should support semantic tokens that are generated by UI components other than speech. These tokens can be considered in a similar way to action tags and speech grammars. For example, in a pizza application, if a topping can be selected from an option list on the screen, the author can declare that the semantic token 'topping' can be generated by a GUI component.
The markup language should support a modality-independent method of representing the meaning of user input. This should be annotated with a record of the modality type. This is related to the XForms requirement (4.3) and to the work on Natural Language within the W3C Voice activity.
The markup language supports the same semantic representation of input from different modalities. For example, in a pizza application, if a topping can be selected from an option list on the screen or by speaking, the same semantic token, e.g. 'topping' can be used to represent the input.
The markup language coordinates the grammars for modalities other than speech with speech grammars to avoid duplication of effort in authoring multimodal grammars.
Multimodal input must be able to be combined to form a composite meaning. This is related to the Coordinated, Simultaneous Multi-modal Input (2.4). For example, the user points at Bristol on a map and says "Give me directions from London to here". The formal representation of the meaning of each input needs to be combined to get a composite meaning - "Give me directions from London to Bristol". See also Semantics of input generated by UI components other than speech (2.8) and Modality independent semantic representation (2.9)
The markup language supports specification of timing information to determine whether input from multiple modalities should combine to form an integrated semantic representation. See coordinated multimodal input requirement (2.4). This could, for example, take the form of a time window which is specified in the markup, where input events from different modalities that occur within this window are combined into one semantic entity.
The markup language will support the detection of conflicting input from several modalities.For example, in a speech + GUI interface, there may be simultaneous but conflicting speech and mouse inputs; the markup language should allow the conflict to be detected so that an appropriate action can be taken. Consider a music application, the user says "play Madonna" while entering "Elvis" in an artist text box on screen; an application might resolve this by asking "Did you mean Madonna or Elvis?". This is related to 2.3 uncoordinated simultaneous multimodal input.and 2.4 coordinated simultaneous input requirement.
The markup language should allow features of the display to indicate a context for voice interaction. For example:
Interpretation of the input must provide enough information to the natural language system to be able to resolve speech input that refers to items in the visual context. For example: the screen is displaying a list of possible flights that match a user's requirements and the user says "I'll take the third one".
All input events will be time-stamped, in addition to the time stamping covered by the Dialog Requirements. This includes, for example, time-stamping speech, key press and pointing events. For finer grained synchronization, time stamping at the start and the end of each word within speech may be needed.
The markup language can specify the content rendered as spoken output by the voice browser.
The markup language specifies that content is rendered in speech and other media types. There is no requirement that the output media are rendered simultaneously. For example, a browser can output speech in one dialog state and graphics in another.
The granularity is defined by things like input events. Synchronization does not occur at any finer granularity. When the user takes some action - either spoken or by pointing, for example - a response is rendered in one of the output media - either visual or voice, for example. See requirement 4.7.1 - minimally required synchronization points.
Examples:
The markup language specifies that content is rendered in speech and other media at the same time (i.e. in the same dialog state). There is no requirement that the rendering of output media are coordinated (i.e. synchronized) any further.Where appropriate, synchronization of speech with other output media should be supported with SMIL or a related standard.
The granularity of the synchronization for this requirement is coarser than for the coordinated simultaneous output requirement (3.4). The granularity is defined by things like input events. When the user takes some action - either spoken or by pointing, for example - something happens with the visual and the voice channels but there is no further synchronization at a finer granularity than that. I.e., a browser can output speech and graphics in one dialog state, but the two outputs are not synchronized in any other way. See requirement 4.7.1 - minimally required synchronization points.
Examples:
The markup language specifies that content is to be simultaneously rendered in speech and other media and that output rendering is further coordinated (i.e. synchronized). The granularity is defined by things that happen within the response to a given user input - see 4.7.2 Finer grained synchronization points. Where appropriate, synchronization of speech with other output media should be supported with SMIL or a related standard.
Examples:
See also Synchronization of Multimedia with voice input requirement (3.5).
The markup language specifies that media output and voice input are synchronized. The granularity is defined by: things that happen within the response to a given user input, e.g. play a video and 30 seconds after it has started activate a speech grammar; things that happen within a speech input, e.g. detect the start of a spoken input and 5 seconds later play a video. Where appropriate, synchronization of speech with other output media should be supported with SMIL or a related standard. See Coordinated simultaneous multimedia output requirement (3.4); 4.7.2 Finer grained synchronization points.
The markup language will have clear temporal semantics so that it can be integrated into the SMIL multimedia framework. Multi-media frameworks are characterized by precise temporal synchronization of output and input. For example, the SMIL notation is based on timing primitives that allow the composition of complex behaviors. See Synchronization with Multimedia with voice input requirement (3.5) and 3.4 coordinated simultaneous multimodal output requirement.
The markup language will support visual output of text, using other markup languages such as html or wml (see reuse of standard markup requirement, 4.1). For example, the following may be presented as text on the display:
Example 1:
Example 2:
The markup language supports output defined in other W3C Voice Browsing Group specifications - for example, recorded audio (Speech Synthesis Requirements). See reuse of standard markup requirement (4.1).
The markup language supports output of media objects supported by SMIL (animation, audio, img, video, text, textstream), using other markup languages (see reuse of standard markup requirement, 4.1).
The markup language supports output of the following media, using other markup languages (see reuse of standard markup requirement, 4.1).
The markup language will be extensible to support new output media types (e.g. 3D graphics).
The markup language should support a media-independent method of representing the meaning of output. E.g. the output could be represented in a frame format and rendered in speech or on the display by the browser. This is related to XForms requirement (4.3)
Visual output will be renderable on displays of different sizes. This should be by using standard visual markup languages e.g., HTML, CHTML, WML, where appropriate, see reuse standard markup requirement (4.1).
This requirement applies to two kinds of visual markup:
The markup language supports the identification of the display window. This is to support applications where there is more than one window.
All output events will be time-stamped, in addition to the
time stamping covered by the Dialog
Requirements. This includes time-stamping the start and the end
of a speech event. For finer grained synchronization, time
stamping at the start and the end of each word within speech may
be needed.
Where possible, the specification must reuse standard visual, multimedia and aural markup languages, including:
The specification should avoid unnecessary differences with these markup languages.
In addition, the markup will be compatible with the W3C's work on Client Capabilities and Personal Preferences (CC/PP).
The results of the work should mesh with the modular architecture proposed for XHTML, where different markup modules are expected to cohabit and inter-operate gracefully within an overall XHTML container.
As part of this goal the design should be capable of incorporating multiple visual and aural markup languages.
The markup language should be compatible with the W3C's work on X-Forms.
Related to requirements: media independent representation of output (3.12) and media independent representation of input (2.11).
The markup language will allow identification of the modalities available. This will allow an author to identify that a given modality is/is not present and as a result switch to a different dialog. E.g. there is a visible construct that an author can query. This can be used to provide for accessibility requirements and for environmental factors (e.g. noise). The availability of input and output modalities can be controlled by the user or by the system. The extent to which the functionality is retained when modalities are not available is the responsibility of the author.
The following is a list of use cases regarding a multimodal document that specifies speech and GUI input and output. The document could be designed such that:
Note that this is a requirement on the system and not on the markup language. For example, when there is temporarily high background noise, the application may disable speech input and output but enable them again when the noise lessens.This is a requirement for an event handling mechanism.
The mark-up language should support loosely coupled documents, where separate markup streams for each modality are synchronized at well-defined points. For example, separate voice and visual markup streams could be synchronized at the following points: visiting a form, following a link.
The mark-up language should support tightly coupled documents. Tightly coupled documents have document elements for each interaction modality interspersed in the same document. I.e. a tightly coupled document contains sub-documents from different interaction modalities (e.g. HTML and voice markup) and has been authored to achieve explicit synchrony across the interaction streams.
Tightly coupled documents should be viewed as an optimization of the loosely-coupled approach, and should be defined by describing a reversible transformation from a tightly-coupled document to multiple loosely-coupled documents. For example, a tightly coupled document that includes HTML and voice markup sub-documents should be transformable to a pair of documents, where one is HTML only and the other is voice markup only - see transformation requirement (4.6.3).
The markup language should be designed such that tightly coupled documents are transformable to documents for a specific interaction modalities by standard tree transformations as expressible in XSLT. Conversely, tightly coupled documents should be viewed as a simple transformation applied to the individual sub-documents, with the transformation playing the role of tightly coupling the sub-documents into a single document.
This requirement will ensure content re-use, keep implementation of multimodal browsers manageable and provide for accessibility requirements.
It is important to note that all the interaction information from the tightly coupled document may not be preserved. If, for example, you have a speech + GUI design, when you take out the GUI, the application is not necessarily equivalently usable. It is up to the author to decide whether the speech document has all the information that the speech plus GUI document has.Depending on how the author created the multimodal document, the transformation could be entirely lossy, could degrade gracefully by preserving some information from the GUI or could preserve all information from the GUI. If the author's intent is that the application should be usable in the presence or absence of either modality, it is the author's responsibility to design the application to achieve this.
The markup language should minimally enable synchronization across different modalities at well known interaction points in today's browsers, for example, entering and exiting specific interaction widgets:
For example:
See multimedia output requirements (3.2, 3.3 and 3.4) and multimodal input requirements (2.2, 2.3 and 2.4).
The markup language should support finer-grained synchronization. Where appropriate, synchronization of speech with other output media should be supported with SMIL or a related standard.
For example:
Synchronization points include:
See 3.4 coordinated simultaneous multimodal output requirement.
The notion of synchronization points (or navigation sign posts) are important; they should also be tied into a discussion of what canonized browser functions like "back, "undo", and "forward" mean, and what they mean to the global state of the MM browser. The notion of 'back' is unclear in a voice context.
The markup language must support a generic component interface to allow for the use of external components on the client and/or server side. The interface provides a mechanism for transferring data between the markup language's variables and the component. Examples of such data are: semantic representations of user input (such as attribute-value pairs); URL of markup for different modalities (e.g. URL of an HTML page). The markup language also supports Interaction with External Components that is supported by the W3C Voice Browsing Dialog Requirements (Requirement 2.10).
Examples of external components are components for interaction modalities other than speech (e.g. an HTML browser) and server scripts. Server scripts can be used to interact with remote services, devices or databases.
The following people participated in the multimodal subgroup of the Voice Browser working group and contributed to this document