W3C

Multimodal Requirements
for Voice Markup Languages

W3C Working Draft 10 July 2000

This version:
http://www.w3.org/TR/2000/WD-multimodal-reqs-20000710
Latest version:
http://www.w3.org/TR/multimodal-reqs
Editors:
Marianne Hickey, Hewlett Packard

Abstract

Multimodal browsers allow users to interact via a combination of modalities, for instance, speech recognition and synthesis, displays, keypads and pointing devices. The Voice Browser working group is interested in adding multimodal capabilities to voice browsers. This document sets out a prioritized list of requirements for multimodal dialog interaction, which any proposed markup language (or extension thereof) should address.

Status of this document

This specification is a Working Draft of the Voice Browser working group for review by W3C members and other interested parties. This is the first public version of this document. It is a draft document and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use W3C Working Drafts as reference material or to cite them as other than "work in progress".

Publication as a Working Draft does not imply endorsement by the W3C membership, nor of members of the Voice Browser working groups. This is still a draft document and may be updated, replaced or obsoleted by other documents at any time. It is inappropriate to cite W3C Working Drafts as other than "work in progress."

This document has been produced as part of the W3C Voice Browser Activity, but should not be taken as evidence of consensus in the Voice Browser Working Group. The goals of the Voice Browser Working Group (members only) are discussed in the Voice Browser Working Group charter (members only). This document is for public review. Comments should be sent to the public mailing list <www-voice@w3.org> (archive).

A list of current W3C Recommendations and other technical documents can be found at http://www.w3.org/TR.

NOTE: Italicized green comments are merely that - comments. They are for use during discussions but will be removed as appropriate.

Scope

The document addresses multimodal dialog interaction.Multimodal as defined in this document is one or more speech modes:

together with one or more of the following modes:

The focus is on multimodal dialog where there is a small screen and keypad (e.g. a cell phone) or a small screen, keypad and pointing device (e.g. a palm computer with cellular connection to the Web). This document is agnostic about where the browser(s) and speech and language engines are running - e.g. they could be running on the device itself, on a server or a combination of the two.

The document addresses applications where both speech input and speech output can be available. Note that this includes applications where speech input and/or speech output may be deselected due to environment/accessibility needs.

The document does not specifically address universal access, i.e. the issue of rendering the same pages of markup to devices with different capabilities (e.g. PC, phone or PDA). Rather, the document addresses a markup language that allows an author to write an application that uses spoken dialog interaction together with other modalities (e.g. a visual interface).

Interaction with Other Groups

The activities of the Multimodal Requirements Subgroup will be coordinated with the activities of other sub-groups within the W3C Voice Browsing Working Group and other related W3C working groups. Where possible, the specification will reuse standard visual, multimedia and aural markup languages, see Reuse of standard markup requirement (4.1).

1. General Requirements

1.1 Scalable across end user devices (must address)

The markup language will be scalable across devices with a range of capabilities, in order to sufficiently meet the needs of consumer and device control applications. This includes devices capable of supporting:

  1. audio I/O plus keypad input - e.g. the plain phone with speech plus dtmf, MP3 player with speech input and output and with cellular connection to the Web;
  2. audio, keypad and small screen - e.g. WAP phones, smart phones with displays;
  3. audio, soft keyboard, small screen and pointing - e.g. palm-top personal organizers with cellular connection to the Web.
  4. audio, keyboard, full screen and pointing - e.g. desktop PC, information kiosk.

The server must be able to get access to client capabilities and the user's personal preferences, see reuse of standard markup requirement (4.1).

1.2 Easy to implement (must address)

The markup language should be easy for designers to understand and author without special tools or knowledge of vendor technology or protocols (multimodal dialog design knowledge is still essential).

1.3 Complimentary use of modalities

A characteristic of speech input is that it can be very efficient - for example, in a device with a small display and keypad, speech can bypass multiple layers of menus. A characteristic of speech output is its serial nature, which can make it a long-winded way of presenting information that could be quickly browsed on a display.

The markup will allow an author to use the different characteristics of the modalities in the most appropriate way for the application.

1.3.1 Output media (must address)

The markup language will allow speech output to have different content to that of simultaneous output from other media. This requirement is related to the simultaneous output requirements (3.3 and 3.4).

In a speech plus GUI system, the author will be able to choose different text for simultaneous verbal and visual outputs. For example, a list of options may be presented on screen and simultaneous speech output does not necessarily repeat them (which is long-winded) but can summarize them or present an instruction or warning.

1.3.2 Input modalities (must address)

The markup language will allow, in a given dialog state, the set of actions that can be performed using speech input to be different tosimultaneous actions that can be performed with other input modalities. This requirement is related to the simultaneous input requirements (2.3 and 2.4).

Consider a speech plus GUI system, where speech and touch screen input is available simultaneously. The application can be authored such that, in a given dialog state, there are more actions available via speech than via the touch screen. For example, the screen displays a list of flights and the user can bypass the options available on the display and say "show me later flights".

1.4 Seamless synchronization of the various modalities (should address)

The markup will be designed such that an author can write applications where the synchronization of the various modalities is seamless from the user's point of view. That is, a cause in one modality results in a synchronous change in another. For example:

  1. an end-user selects something using voice and the visual display changes to match;
  2. an end-user specifies focus with a mouse and enters the data with voice - the application knows which field the user is talking to and therefore what it might expect;

See minimally required synchronization points (4.7.1) and finer grained synchronization points (4.7.2).

See also multimodal input requirements (2.2, 2.3, 2.4) and multimodal output requirements (3.2, 3.3, 3.4).

1.5 Multilingual & international rendering

1.5.1 One language per document (must address)

The markup language will provide the ability to mark the language of a document.

1.5.2 Multiple languages in the same document (nice to address)

The markup language will support rendering of multi-lingual documents - i.e. where there is a mixed-language document. For example, English and French speech output and/or input can appear in the same document - a spoken system response can be "John read the book entitled 'Viva La France'."

This is really a general requirement for voice dialog, rather than a multimodal requirement. We may move this to the dialog document.

2. Input modality requirements

2.1 Audio Modality Input (must address)

The markup language can specify which spoken user input is interpreted by the voice browser.

2.2 Sequential multi-modal Input (must address)

The markup language specifies that speech and user input from other modalities is to be interpreted by the browser. There is no requirement that the input modalities are simultaneously active. In a particular dialog state, there is only one input mode available but in the whole interaction more than one input mode is used. Inputs from different modalities are interpreted separately. For example, a browser can interpret speech input in one dialog state and keyboard input in another.

The granularity is defined by things like input events. Synchronization does not occur at any finer granularity. When the user takes some action, only one mode of input will be available at that time. See requirement 4.7.1 - minimally required synchronization points.

Examples:

  1. In a bank application accessed via a phone, the browser renders the speech "Speak your name", the user must respond in speech and says "Jack Jones", the browser renders the speech "Using the keypad, enter your pin number", the user must enter the number via the keypad.
  2. In an insurance application accessed via a PDA, the browser renders the speech "Please say your postcode", the user must reply in speech and says "BS34 8QZ", the browser renders the speech "I'm having trouble understanding you, please enter your postcode using the soft keyboard." The user must respond using the soft keyboard (i.e. not in speech).

2.3 Uncoordinated, Simultaneous, Multi-modal Input (must address)

The markup language specifies that speech and user input from other modalities is to be interpreted by the browser and that input modalities are simultaneously active. There is no requirement that interpretation of the input modalities are coordinated (i.e. interpreted together). In a particular dialog state, there is more than one input mode available but only input from one of the modalities is interpreted (e.g. the first input - see 2.13 Resolve conflicting input requirement). For example, a voice browser in a desktop environment could accept either keyboard input or spoken input in same dialog state.

The granularity is defined by things like input events. Synchronization does not occur at any finer granularity. When the user takes some action, it can be in one of several input modes - only one mode of input will be accepted by the browser. See requirement 4.7.1 - minimally required synchronization points.

Examples:

  1. In a bank application accessed via a phone, the browser renders the speech "Enter your name", the user says "Jack Jones" or enters his name via the keypad, the browser renders the speech "Enter your account number", the user enters the number via the keypad or speaks the account number.
  2. In a music application accessed via a PDA, the user asks to hear clips of new releases, either using speech or by selecting a button on screen. The browser renders a list of titles on screen. The user selects by pointing to the title with the pen or by speaking the title of the track.

2.4 Coordinated, Simultaneous Multi-modal Input (nice to address)

The markup language specifies that speech and user input from other modalities is allowed at the same time and that interpretation of the inputs are coordinated. In a particular dialog state, there is more than one input mode available and input from multiple modalities is interpreted (e.g. within a given time window). When the user takes some action it can be composed of inputs from several modalities - for example, a voice browser in a desktop environment could accept keyboard input and spoken input together in same dialog state.

Examples:

  1. In a telephony environment, the user can type200 on the keypad and say transfer to checking account and the interpretations are coordinated so that they are understood as transfer 200 to checking account.
  2. In a route finding application, the user points at Bristol on a map and says "Give me directions from London to here".

See also 2.11 Composite Meaning requirement, 2.13 Resolve conflicting input requirement.

2.5 Input modes supported (must address)

The markup language will support the following input modes, in addition to speech:

DTMF will be supported using the dialog markup specified by the W3C Voice Browsing Group's dialog requirements.

Character and pointing input will be supported using other markup languages together with scripting (e.g. html with Javascript).

See reuse standard markup requirement (4.1).

2.6 Input modes supported (nice to address)

The markup language will support other input modes, including:

2.7 Extensible to new input media types (nice to address)

The model will be abstract enough so any new or exotic input media (e.g. gesture captured by video) could fit into it.

2.8 Semantics of input generated by UI components other than speech (nice to address)

The markup language should support semantic tokens that are generated by UI components other than speech. These tokens can be considered in a similar way to action tags and speech grammars. For example, in a pizza application, if a topping can be selected from an option list on the screen, the author can declare that the semantic token 'topping' can be generated by a GUI component.

2.9 Modality-independent representation of the meaning of user input (nice to address)

The markup language should support a modality-independent method of representing the meaning of user input. This should be annotated with a record of the modality type. This is related to the XForms requirement (4.3) and to the work on Natural Language within the W3C Voice activity.

The markup language supports the same semantic representation of input from different modalities. For example, in a pizza application, if a topping can be selected from an option list on the screen or by speaking, the same semantic token, e.g. 'topping' can be used to represent the input.

2.10 Coordinate speech grammar with grammar for other input modalities (future revision)

The markup language coordinates the grammars for modalities other than speech with speech grammars to avoid duplication of effort in authoring multimodal grammars.

2.11 Composite meaning (nice to address)

Multimodal input must be able to be combined to form a composite meaning. This is related to the Coordinated, Simultaneous Multi-modal Input (2.4). For example, the user points at Bristol on a map and says "Give me directions from London to here". The formal representation of the meaning of each input needs to be combined to get a composite meaning - "Give me directions from London to Bristol". See also Semantics of input generated by UI components other than speech (2.8) and Modality independent semantic representation (2.9)

2.12 Time window for coordinated multimodal input (nice to address)

The markup language supports specification of timing information to determine whether input from multiple modalities should combine to form an integrated semantic representation. See coordinated multimodal input requirement (2.4). This could, for example, take the form of a time window which is specified in the markup, where input events from different modalities that occur within this window are combined into one semantic entity.

2.13 Support for conflicting input from different modalities (must address)

The markup language will support the detection of conflicting input from several modalities.For example, in a speech + GUI interface, there may be simultaneous but conflicting speech and mouse inputs; the markup language should allow the conflict to be detected so that an appropriate action can be taken. Consider a music application, the user says "play Madonna" while entering "Elvis" in an artist text box on screen; an application might resolve this by asking "Did you mean Madonna or Elvis?". This is related to 2.3 uncoordinated simultaneous multimodal input.and 2.4 coordinated simultaneous input requirement.

2.14 Context for recognizer (nice to address)

The markup language should allow features of the display to indicate a context for voice interaction. For example:

2.15 Resolve spoken reference to display (future revision)

Interpretation of the input must provide enough information to the natural language system to be able to resolve speech input that refers to items in the visual context. For example: the screen is displaying a list of possible flights that match a user's requirements and the user says "I'll take the third one".

2.16 Time stamping (should address)

All input events will be time-stamped, in addition to the time stamping covered by the Dialog Requirements. This includes, for example, time-stamping speech, key press and pointing events. For finer grained synchronization, time stamping at the start and the end of each word within speech may be needed.

3. Output media requirements

3.1 Audio Media Output (must address)

The markup language can specify the content rendered as spoken output by the voice browser.

3.2 Sequential multimedia output (must address)

The markup language specifies that content is rendered in speech and other media types. There is no requirement that the output media are rendered simultaneously. For example, a browser can output speech in one dialog state and graphics in another.

The granularity is defined by things like input events. Synchronization does not occur at any finer granularity. When the user takes some action - either spoken or by pointing, for example - a response is rendered in one of the output media - either visual or voice, for example. See requirement 4.7.1 - minimally required synchronization points.

Examples:

  1. In a speech plus WML banking application, accessed via a WAP phone, the user asks "What's my balance". The browser renders the account balance on the display only. The user clicks OK, the browser renders the response as speech only - "Would you like another service?"...
  2. In a music application accessed via a PDA, the user asks to hear clips of new releases. The browser renders a list of titles on screen, together with the text instruction to select a title to hear the track. The user selects a track by speaking the number. The browser plays the selected track - the screen does not change.

3.3 Uncoordinated, Simultaneous, Multi-media Output (must address)

The markup language specifies that content is rendered in speech and other media at the same time (i.e. in the same dialog state). There is no requirement that the rendering of output media are coordinated (i.e. synchronized) any further.Where appropriate, synchronization of speech with other output media should be supported with SMIL or a related standard.

The granularity of the synchronization for this requirement is coarser than for the coordinated simultaneous output requirement (3.4). The granularity is defined by things like input events. When the user takes some action - either spoken or by pointing, for example - something happens with the visual and the voice channels but there is no further synchronization at a finer granularity than that. I.e., a browser can output speech and graphics in one dialog state, but the two outputs are not synchronized in any other way. See requirement 4.7.1 - minimally required synchronization points.

Examples:

  1. In a cinema-ticket application accessed via a WAP phone, the user asks what films are showing. The browser renders the list of films on the screen and renders an instruction in speech - "Here are today's films. Select one to hear a full description".
  2. A browser in a smart phone environment plays a prompt "Which service do you require?", while displaying a list of options such as "Do you want to: (a) transfer money; (b) get account info; (c) quit."
  3. In a music application accessed via a PDA, the user asks to hear clips of new releases. The browser renders a list of titles on screen, and renders an instruction in speech "Here are the five recommended new releases. Select one to hear a clip". The user selects one by speaking the title. The browser renders the audio clip and, at the same time, displays the price and information about the band. When the track has finished, the user selects a button on screen to return to the list of tracks.

3.4 Coordinated, Simultaneous Multi-media Output (nice to address)

The markup language specifies that content is to be simultaneously rendered in speech and other media and that output rendering is further coordinated (i.e. synchronized). The granularity is defined by things that happen within the response to a given user input - see 4.7.2 Finer grained synchronization points. Where appropriate, synchronization of speech with other output media should be supported with SMIL or a related standard.

Examples:

  1. In a news application, accessed via a PDA, a browser highlights each paragraph of text (e.g. headline) as it renders the corresponding speech.
  2. In a learn-to-read application accessed via a PC, the lips of an animated character are synchronized with speech output, the words are highlighted on screen as they are spoken and pictures are displayed as the corresponding words are spoken (e.g. a cat is displayed as the word cat is spoken).
  3. In a music application accessed via a PDA, the user asks to hear clips of new releases. The browser renders a list of titles on screen, highlights the first and starts playing it. When the first track has finished, the browser highlights the second title on screen and starts playing the second track, and so on.
  4. Display an image 5 seconds after a spoken prompt has started.
  5. Display an image for 5 seconds then render a speech prompt.

See also Synchronization of Multimedia with voice input requirement (3.5).

3.5 Synchronization of multimedia with voice input (nice to address)

The markup language specifies that media output and voice input are synchronized. The granularity is defined by: things that happen within the response to a given user input, e.g. play a video and 30 seconds after it has started activate a speech grammar; things that happen within a speech input, e.g. detect the start of a spoken input and 5 seconds later play a video. Where appropriate, synchronization of speech with other output media should be supported with SMIL or a related standard. See Coordinated simultaneous multimedia output requirement (3.4); 4.7.2 Finer grained synchronization points.

3.6 Temporal semantics for synchronization of voice input and output with multimedia (nice to address)

The markup language will have clear temporal semantics so that it can be integrated into the SMIL multimedia framework. Multi-media frameworks are characterized by precise temporal synchronization of output and input. For example, the SMIL notation is based on timing primitives that allow the composition of complex behaviors. See Synchronization with Multimedia with voice input requirement (3.5) and 3.4 coordinated simultaneous multimodal output requirement.

3.7 Visual output of text (must address)

The markup language will support visual output of text, using other markup languages such as html or wml (see reuse of standard markup requirement, 4.1). For example, the following may be presented as text on the display:

Example 1:

Example 2:

3.8 Media supported by other Voice Browsing Requirements (must address)

The markup language supports output defined in other W3C Voice Browsing Group specifications - for example, recorded audio (Speech Synthesis Requirements). See reuse of standard markup requirement (4.1).

3.9 Media objects supported by SMIL (should address)

The markup language supports output of media objects supported by SMIL (animation, audio, img, video, text, textstream), using other markup languages (see reuse of standard markup requirement, 4.1).

3.10 Other output media (nice to address)

The markup language supports output of the following media, using other markup languages (see reuse of standard markup requirement, 4.1).

3.11 Extensible to new media (nice to address)

The markup language will be extensible to support new output media types (e.g. 3D graphics).

3.12 Media-independent representation of the meaning of output (future revision)

The markup language should support a media-independent method of representing the meaning of output. E.g. the output could be represented in a frame format and rendered in speech or on the display by the browser. This is related to XForms requirement (4.3)

3.13 Display size (should address)

Visual output will be renderable on displays of different sizes. This should be by using standard visual markup languages e.g., HTML, CHTML, WML, where appropriate, see reuse standard markup requirement (4.1).

This requirement applies to two kinds of visual markup:

3.14 Output to more than one window (future revision)

The markup language supports the identification of the display window. This is to support applications where there is more than one window.

3.15 Time stamping (should address)

All output events will be time-stamped, in addition to the time stamping covered by the Dialog
Requirements. This includes time-stamping the start and the end of a speech event. For finer grained synchronization, time stamping at the start and the end of each word within speech may be needed.

4. Architecture, Integration and Synchronization points

4.1 Reuse standard markup languages (must address)

Where possible, the specification must reuse standard visual, multimedia and aural markup languages, including:

The specification should avoid unnecessary differences with these markup languages.

In addition, the markup will be compatible with the W3C's work on Client Capabilities and Personal Preferences (CC/PP).

4.2 Mesh with modular architecture proposed for XHTML (nice to address)

The results of the work should mesh with the modular architecture proposed for XHTML, where different markup modules are expected to cohabit and inter-operate gracefully within an overall XHTML container.

As part of this goal the design should be capable of incorporating multiple visual and aural markup languages.

4.3 Compatibility with W3C work on X-Forms (nice to address)

The markup language should be compatible with the W3C's work on X-Forms.

  1. Have an explicit data model for the back end (i.e. the data) and map it to the front end.
  2. Separate the data model from the presentation. The presentation depends on the device modality.
  3. Application data and logic should be modality independent.

Related to requirements: media independent representation of output (3.12) and media independent representation of input (2.11).

4.4 Detect that a given modality is available (must address)

The markup language will allow identification of the modalities available. This will allow an author to identify that a given modality is/is not present and as a result switch to a different dialog. E.g. there is a visible construct that an author can query. This can be used to provide for accessibility requirements and for environmental factors (e.g. noise). The availability of input and output modalities can be controlled by the user or by the system. The extent to which the functionality is retained when modalities are not available is the responsibility of the author.

The following is a list of use cases regarding a multimodal document that specifies speech and GUI input and output. The document could be designed such that:

  1. when the speech input error count is high, the user can make equivalent selections via the GUI;
  2. where a user has a speech impairment, speech input can be deselected and the user controls the application via the GUI;
  3. when the user cannot hear a verbal prompt due to a noisy environment (detected, for example, by no response), an equivalent prompt is displayed on the screen;
  4. where a user has a hearing impairment the speech output is deselected and equivalent prompts are displayed.

4.5 Means to act on a notification that a modality has become available/unavailable (must address)

Note that this is a requirement on the system and not on the markup language. For example, when there is temporarily high background noise, the application may disable speech input and output but enable them again when the noise lessens.This is a requirement for an event handling mechanism.

4.6 Transformable documents

4.6.1 Loosely coupled documents (nice to address)

The mark-up language should support loosely coupled documents, where separate markup streams for each modality are synchronized at well-defined points. For example, separate voice and visual markup streams could be synchronized at the following points: visiting a form, following a link.

4.6.2 Tightly coupled documents (nice to address)

The mark-up language should support tightly coupled documents. Tightly coupled documents have document elements for each interaction modality interspersed in the same document. I.e. a tightly coupled document contains sub-documents from different interaction modalities (e.g. HTML and voice markup) and has been authored to achieve explicit synchrony across the interaction streams.

Tightly coupled documents should be viewed as an optimization of the loosely-coupled approach, and should be defined by describing a reversible transformation from a tightly-coupled document to multiple loosely-coupled documents. For example, a tightly coupled document that includes HTML and voice markup sub-documents should be transformable to a pair of documents, where one is HTML only and the other is voice markup only - see transformation requirement (4.6.3).

4.6.3 Transformation between tightly and loosely coupled documents by standard tree transformations as expressible in XSLT (nice to address)

The markup language should be designed such that tightly coupled documents are transformable to documents for a specific interaction modalities by standard tree transformations as expressible in XSLT. Conversely, tightly coupled documents should be viewed as a simple transformation applied to the individual sub-documents, with the transformation playing the role of tightly coupling the sub-documents into a single document.

This requirement will ensure content re-use, keep implementation of multimodal browsers manageable and provide for accessibility requirements.

It is important to note that all the interaction information from the tightly coupled document may not be preserved. If, for example, you have a speech + GUI design, when you take out the GUI, the application is not necessarily equivalently usable. It is up to the author to decide whether the speech document has all the information that the speech plus GUI document has.Depending on how the author created the multimodal document, the transformation could be entirely lossy, could degrade gracefully by preserving some information from the GUI or could preserve all information from the GUI. If the author's intent is that the application should be usable in the presence or absence of either modality, it is the author's responsibility to design the application to achieve this.

4.7 Synchronization points

4.7.1 Minimally required synchronization points(must address)

The markup language should minimally enable synchronization across different modalities at well known interaction points in today's browsers, for example, entering and exiting specific interaction widgets:

For example:

See multimedia output requirements (3.2, 3.3 and 3.4) and multimodal input requirements (2.2, 2.3 and 2.4).

4.7.2 Finer-grained synchronization points (nice to address)

The markup language should support finer-grained synchronization. Where appropriate, synchronization of speech with other output media should be supported with SMIL or a related standard.

For example:

Synchronization points include:

See 3.4 coordinated simultaneous multimodal output requirement.

4.7.3 Co-ordinate synchronization points with the DOM event model (future study)

  1. Synchronization points should be coordinated with the DOM event model. I.e. one possible starting point for a list of such synchronization points would be the event types defined by the DOM, appropriately modified to be modality independent.
  2. Event types defined for multimodal browsing should be integrated into the DOM; as part of this effort, the Voice WG might provide requirements as input to the next level of the DOM specification.

4.7.4 Browser functions and synchronization points (future study)

The notion of synchronization points (or navigation sign posts) are important; they should also be tied into a discussion of what canonized browser functions like "back, "undo", and "forward" mean, and what they mean to the global state of the MM browser. The notion of 'back' is unclear in a voice context.

4.8 Interaction with External Components (must have)

The markup language must support a generic component interface to allow for the use of external components on the client and/or server side. The interface provides a mechanism for transferring data between the markup language's variables and the component. Examples of such data are: semantic representations of user input (such as attribute-value pairs); URL of markup for different modalities (e.g. URL of an HTML page). The markup language also supports Interaction with External Components that is supported by the W3C Voice Browsing Dialog Requirements (Requirement 2.10).

Examples of external components are components for interaction modalities other than speech (e.g. an HTML browser) and server scripts. Server scripts can be used to interact with remote services, devices or databases.

Acknowledgements

The following people participated in the multimodal subgroup of the Voice Browser working group and contributed to this document