This document describes fundamental requirements for the specifications under development in the W3C Multimodal Interaction Activity. These requirements were derived from use case studies as discussed in Appendix A. They have been developed for use by the Multimodal Interaction Working Group (W3C Members only), but may also be relevant to other W3C working groups and related external standard activities.
The requirements cover general issues, inputs, outputs, architecture, integration, synchronization points, runtimes and deployments, but this document does not address application or deployment conformance rules.
This section describes the status of this document at the time of its publication. Other documents may supersede this document. The latest status of this document series is maintained at the W3C.
W3C's Multimodal Interaction Activity is developing specifications for extending the Web to support multiple modes of interaction. This document describes fundamental requirements for multimodal interaction.
This document has been produced as part of the W3C Multimodal Interaction Activity, following the procedures set out for the W3C Process. The authors of this document are members of the Multimodal Interaction Working Group (W3C Members only). This is a Royalty Free Working Group, as described in W3C's Current Patent Practice NOTE. Working Group participants are required to provide patent disclosures.
Please send comments about this document to the public mailing list: firstname.lastname@example.org (public archives). To subscribe, send an email to <email@example.com> with the word subscribe in the subject line (include the word unsubscribe if you want to unsubscribe).
A list of current W3C Recommendations and other technical documents including Working Drafts and Notes can be found at http://www.w3.org/TR/.
Multimodal interactions extend the Web user interface to allow multiple modes of interaction, offering users the choice of using their voice, or an input device such as a key pad, keyboard, mouse or stylus. For output, users will be able to listen to spoken prompts and audio, and to view information on graphical displays. This capability for the user to specify the mode or device for a particular interaction in a particular situation is expected to significantly improve the user interface, its accessibility and reliability, especially for mobile applications. The W3C Multimodal Interaction Working Group (WG) is developing markup specifications for authoring applications synchronized across multiple modalities or devices with a wide range of capabilities.
This document is an internal working draft prepared as part of the discussions on multimodal interaction requirements for multimodal interaction specifications.
The work on the present requirement document started from the multimodal requirements for voice markup languages public working draft (version 1.0) published by the W3C Voice activity [MM Req Voice]. The outline of the document remains very similar.
The present requirements scope the nature of the work and specifications that will be developed by the W3C Multimodal Interaction Working Group (as specified by the charter [MMI Charter]). These intended works may be referred to below as "specification(s)".
The requirements in this document do not express conformance rules on application, platform runtime implementation or deployment.
In this document, the following conventions have been followed when phrasing the requirements:
It is not required that a particular specification produced by the W3C MMI working group addresses all the requirements in this document. It is possible that the requirements be addressed by different specifications and that all the "MUST specify" requirement are only satisfied by combining the different specifications produced by the W3C Multimodal Interaction Working Group. However, in such a case, it should be possible to clearly indicate which specification will address what requirements.
To lay the groundwork for the technical requirements, we first discuss an intended frame of reference for a multimodal system, introducing various concepts and terms that will be referred to in the normative sections below. For the reader's convenience, we have collected the concepts and terms introduced in this frame of reference in the glossary.
We are interested in defining the requirements for the design of multimodal systems -- systems that support a user communicating with an application by using different modalities such as voice (in a human language), gesture, handwriting, typing, audio-visual speech, etc. The user may be considered to be operating in a delivery context: a term used to specify the set of attributes that characterizes the capabilities of the access mechanism in terms of device profile, user profile (e.g. identify, preferences and usage patterns) and situation. The user interacts with the application in the context of a session, using one or more modalities (which may be realized through one or more devices). Within a session, the user may suspend and resume interaction with the application within the same modality or switch modalities. A session is associated with a context, which records the interactions with the user.
In multimodal systems, an event is a representation of some asynchronous occurrence of interest to the multimodal system. Examples include mouse clicks, hanging up the phone, speech recognition results or errors. Events may be associated with information about the user interaction e.g. the location the mouse was clicked. A typical event source is a user, such events are called input events. An external input event is one not generated by a user, e.g. a GPS signal. The multimodal system may also produce external output events for external systems (e.g. a logging system). In order to preserve temporal ordering, events may be time stamped. Typically, events are formalized as generated by event sources, and associated with event handlers, which subscribe to the event, and are notified of its occurrence. This is exemplified by the XML Event model.
The user typically provides input in one or more modalities, and receives output in one or more modalities. Input may be classified as sequential, simultaneous or composite. Sequential input is input received on a single modality, though that modality can change over time. Simultaneous input is input received on multiple modalities, and treated separately by downstream processes (such as interpretation). Composite input is input received on multiple modalities at the same time and treated as a single, integrated "composite" input by downstream processes. Inputs are combined using the coordination capability of the multimodal system, typically driven by input constraints or decided by the interaction manager.
Input is typically subject to input processing. For instance, speech input may be input to a speech recognition engine (including, for instance, semantic interpretation in order to extract meaningful information (e.g. semantic representation) for downstream processing. Note that simultaneous and composite input may be conflicting, in that the interpretations of the input may not be consistent (e.g. the user says "yes" but clicks on "no").
Two fundamentally different uses of multimodality may be identified: supplementary multimodality, and complementary multimodality. An application makes supplementary use of multimodality if it allows to carry every interaction (input or output) through to completion in each modality as if it was the only available modality. Such an application enables the user to select at each time the modality that is best suited to the nature of the interaction and the user's situation. Conversely, an application makes complementary use of multimodality if interactions in one modality are used to complement interactions in another. (For instance, the application may visually display several options in a form and aurally prompt the user "Choose the city to fly to".) Complementary use may help a particular class of users (e.g. those with dyslexia). Note that in an application supporting complementary use of different modalities each interaction may not be acessible separately in each modality. Therefore it may not be possible for the user to determine which modality to use. Instead, the document author may prescribe the modality (or modalities) to be used in a particular interaction.
The synchronization behavior of an application describes the way in which any input in one modality is reflected in the output in another modality, as well as the way input is combined across modalities (coordination capability). The synchronization granularity specifies the level at which the application coordinates interactions. The application is said to exhibit event-level synchronization if user inputs in one modality are captured at the level of the individual DOM events and immediately reflected in the other modality. The application exhibits field-level synchronization if inputs in one modality are reflected in the other after the user changes focus (e.g. moves from input field to input field) or completes the interaction (e.g. completes a select in a menu). The application exhibits form-level synchronization if inputs in one modality are reflected in the other only after a particular point in the presentation is reached (e.g. after a certain number of fields have been completed in the form).
The output generatedstatus by a multimodal system can take various forms, e.g. audio (including spoken prompts and playback, e.g. using natural language generation, text-to-speech (TTS) which synthesizes audio), visual (e.g. XHMTL or SVG markup rendered on displays), lipsynch(multimedia output in which there is a visual rendition of a face whose lip movements are synchronized with the audio), etc. Of relevance here is the W3C Recommendation SMIL 2.0 which enables simple authoring of interactive audiovisual applications and supports media synchronization.
Interaction (input, output) between the user and the application may often be conceptualized as a series of dialogs, manged by an interaction manager. A dialog is an interaction between the user and the application which involves turn taking. In each turn, the interaction manager manager (working on behalf of the application) collects input from the user, processes it (using the session context and possibly external knowledge sources) to determine , computes a response and updates the presentation for the user. An interaction manager generates or updates the presentation by processing user inputs, session context and possibly other external knowledge sources to determine the intent of the user. An interaction manager relies on strategies to determine focus and intent as well as to disambiguate, correct and confirm sub-dialogs. We typically distinguish directed dialogs (e.g. user-driven or application-driven) and mixed initiative or free flow dialogs.
The interaction manager may use (1) inputs from the user, (2) the session context, (3) external knowledge sources, and (4) disambiguation, correction, and configuration sub-dialogs to determine the user's focus and intent. Based on the user's focus and intent, the interaction manager also (1) maintains the context and state of the application, (2) manages the composition of inputs and synchronization across modalities, (3) interfaces with business logic, and (4) produces output for presentation to the user. In some architectures, the interaction manager may have distributed components, utilizing an event based mechanism for coordination.
Finally, in this document, we use the term configuration or execution model to refer to the runtime structure of the various system components and their interconnection, in a particular manifestation of a multimodal system.
It is the intent of the WG to define specifications that apply to a variety of multimodal capabilities and deployment conditions.
(MMI-G1): The multimodal specifications MUST support authoring multimodal applications for a wide range of multimodal capabilities (MUST specify).
The specifications should support different combinations of input and output modalities, synchronization granularity, configurations and devices. Some aspects of this requirement are elaborated in detail below. For instance, the range of synchronization granularity is addressed by requirement MMI-A6.
It is advantageous that the specifications allow the application developer to author a single version of the application, instead of multiple versions targeted at combinations of multimodal capabilities.
(MMI-G2): The multimodal specifications SHOULD support authoring multimodal applications once for deployment on difference devices with different multimodal capabilities (NICE to specify).
The multimodal capabilities may differ based on available modalities, presentation and interaction capability for each modality (modality-specific delivery context), synchronization granularity, available devices and their configurations etc... They are to be captured in the delivery context associated to the multimodal system.
(MMI-G3): The multimodal specifications MUST support supplementary use of modalities (MUST specify).
Supplementary use of modalities in multimodal applications significantly improves accessibility of the applications. The user may select the modality best used to the nature of the interaction and the context of use.
When supported by the runtime or prescribed by the author, it may be possible for the user to combine modalities as discussed for example in requirement MMI-I7 about composite input.
(MMI-G4): The multimodal specifications MUST support complementary use of modalities (MUST specify).
Authors of multimodal applications that rely on complementary multimodality should pay special attention to the accessibility of the application, for example by ensuring accessibility in each modality or by providing supplementary alternatives.
(MMI-G5): The multimodal specifications will be designed such that an author can write applications where the synchronization of the various modalities is seamless from the user's point of view (MUST specify).
To elaborate, an interaction event or an external event in one modality results in a change in another; based on the synchronization granularity supported by the application. See section 4.5 for a discussion of synchronization granularities.
Seamlessness can encompass multiple aspects:
Limited latency in the synchronization behavior with respect to what is expected by the user for the particular application and multimodal capabilities.
Predictable, non-confusing multimodal behavior
Expanding on the considerations made in section 1.1, it is important to support authoring for any granularity of synchronization covered in (MMI-A6):
(MMI-G6): The multimodal specifications MUST support authoring seamless synchronization of various modalities for any any synchronization granularity and coordination capabilities (MUST specify).
Coordination is defined as the capability to combine multimodal inputs into composite inputs based on an interpretation algorithm that decides what makes sense to combine based on the context. Composite inputs are further discussed in section 2.4. It is a notion different from synchronization granularity described in section 4.5.
The following requirement is proposed in order to address the combinatorial explosion of synchronization granularities that the application developer must author for.
(MMI-G7): The multimodal specifications SHOULD support authoring seamless synchronization of various modalities once for deployment across with a whole range of synchronization granularity or coordination capabilities (NICE to specify).
This requirement addresses the capability for the application developer to write the application once for a particular synchronization granularity or coordination capability and to have the application able to adapt its synchronization behavior when other levels are available.
Multimodal applications are not different from any other web applications. It is important that the specifications be not limited to specific languages.
(MMI-G8): The multimodal specifications MUST support authoring multimodal applications in any human language (MUST specify).
In particular, it must be possible to apply conventional methods for localization and internationalization of applications.
(MMI-G9): The multimodal specification MUST not preclude the capability to move multimodal application from one human language to another, without having to rewrite the whole application (MUST specify).
For example, it should be possible to encapsulate language-specific items, separately encapsulated from the language-independent description.
It is important that multimodal applications remain easy to author and deploy in order to allow wide adoption by the web community.
(MMI-G10): The multimodal specifications produced by the MMI working group MUST be easy to implement and use (MUST specify).
This is a generic requirement that requires designers to consider from the outset issues of: ease-of-authoring by application developers; ease-of-implementation by platform developers and ease-of-use by the user. Thus it affects authoring, platform implementation and deployment.
The following requirement qualifies this further to guarantee that the specifications will be widely deployable with existing technologies (e.g. standards, network and client capabilities etc...)
(MMI-G11): The multimodal specifications produced by the MMI working group MUST depend only on technologies that are widely available during the lifetime of the working group (MUST specify).
For W3C specifications, wide availability is understood as having reached at least the stage of candidate recommendation.
Related considerations are made insection 4.1.
Multimodal applications will provide mechanisms to develop and deploy accessible applications as discussed in section 1.2.
In addition, it is important that, as for all other web applications; the following requirement be satisfied:
(MMI-G12): The multimodal specifications produced by the MMI working group MUST not preclude conforming to the W3C accessibility guidelines (MUST specify).
This is especially important for applications that make complementary use of modalities.
Early deployments of multimodal applications show that security and privacy issues can be very critical for multimodal deployments. While addressing these issues is not directly within the scope of the W3C Multimodal Interaction Working Group, it is important that these issues be considered.
(MMI-G13): The multimodal specifications SHOULD be aligned with the W3C work and specifications for security and privacy (SHOULD specify).
The following security and privacy issues have been identified for multimodal and multi-device interactions.
Other considerations and issues may exist and should be compiled.
Notions of profile and delivery context have been widely introduced to characterize the the capabilities of devices and preferences of users.
From a multimodal point of view, different types of profiles are relevant:
These profiles are combined into the notion of delivery context introduced by the W3C device independent activity [DI Activity] . The delivery context captures the set of attributes that characterize the capabilities of the access mechanism (device or devices) (device profile), the dynamic preferences of the user (as they relates to interaction through this device) and configurations. Delivery context may dynamically change as the application progresses, as the user situation changes (situationalization) or as the number and configurations of the devices change.
CC/PP is an example of formalism to describe and exchange the delivery context [CC/PP].
Users of multimodal interactions will expect to be able to rely on these profiles to optimize the way that multimodal applications are presented to them.
(MMI-G14): The multimodal specifications MUST enable optimization and adaptation of multimodal applications based on delivery context or dynamic changes of delivery context (MUST specify).
Dynamic changes of delivery context encompass situations where available devices, modalities and configurations; or usage preferences dynamically. These changes can be involuntary or initiated by the user, the application developer or the service providers.
(MMI-G15): The multimodal specifications MUST enable authors to specify how delivery context and changes of delivery context affect the multimodal interface of a particular application (MUST specify).
The description of such impacts on a multimodal application could be specified by the author but modified by the user, platform vendor or service provider. In particular, the author can describe how the application can be be affected or adapted to the delivery context but the user and service providers should be able to modify the delivery context. Other use cases should also be considered.
It is expected that the author of multimodal application should always be able to specify the expected flow of navigation (i.e. sequence of interaction) through the application or the algorithm to determine such a flow (e.g. in mixed initiative cases). This leads to the following requirement:
(MMI-G16): The multimodal specifications MUST enable the author of an application to describe the navigation flow through the application or indicate the algorithms to determine the navigation flow (MUST specify).
Numerous modalities or input types require some form of processing before the nature of the input is identified. For instance, speech input requires speech detection and speech recognition which requires specific data files (e.g. grammars, language models etc). Similarly handwritten input requires recognition.
(MMI-I1): The multimodal specifications MUST provide a mechanism to specify and attach modality related information when authoring a multimodal application. (MUST specify).
This implies that authors should be able to include modality-related information, such as the media types, processing requirements or fallback mechanisms that a user agent will need for the particular modality. Mechanisms should be available to make this available to the user agent.
For example, audio input may be recognized (speech recognizer), recorded or processed by speaker recognizers, natural language processing, using specific data files (e.g. grammar, language model), etc. The author must be able to completely define such processing steps.
(MMI-I2): The multimodal specifications developed by the MMI working group MUST support sequential multimodal input (MUST specify).
It implies that
(MMI-I3): The multimodal specifications developed by the MMI working group MUST support simultaneous multimodal input (MUST specify).
(MMI-I4): The multimodal specifications MUST enable the author to specify the granularity of input synchronization (MUST specify).
It should be remarked, however, that the actual granularity of input synchronization may be decided by the user, by the runtime or by the network (delivery context) or some combination thereof.
(MMI-I5): The multimodal specifications MUST enable the author to specify how the multimodal application evolves when the granularity of input synchronization is modified by external factors (MUST specify).
This requirement enables the application developer to specify how the performance of the application can degrade gracefully with changes in the input mechanism. For instance, it should be possible to access an application designed for event-level or field-level synchronization between voice (on the server side) and GUI (on the terminal) on a network that permits only session-level synchronization (that is, permits only sequential multimodality).
(MMI-I6): The multimodal specifications SHOULD enable a default input synchronization behavior and provide "overwrite" mechanisms (SHOULD specify).
Therefore, it should be possible to author multimodal applications while assuming a default synchronization behavior. For example, supplementary event-level multimodal synchronization granularity.
(MMI-I7): The multimodal specifications developed by the MMI working group MUST support composite multimodal input (MUST specify).
(MMI-I8): The multimodal specifications SHOULD allow the author to specify how input combination is achieved, possibly taking into account the coordination capabilities available in the given delivery context (NICE to specify).
This can be achieved with explicit scripts that describe the interpretation and composition algorithms. On the other hand, it may also be left to the interaction manager to apply an interpretation strategy that includes composition, for example by determining the most sensible interpretation given the session context and therefore determining what input combination (if any) to select. This is addressed by the following requirement.
(MMI-I9): The multimodal specifications SHOULD enable the author to specify the mechanism used to decide when coordinated inputs are to be combined and how they are combined (NICE to specify).
Possible ways to address this include:
(MMI-I10): The multimodal specifications must support the description of input to be obtained from:
(MMI-I11): The multimodal specifications SHOULD support other input modes, including:
(NICE to specify).
(MMI-I12): The multimodal specifications MUST describe how extensibility is to be achieved and how new devices or modalities can be added (MUST specify).
(MMI-I13): The multimodal specifications MUST support the representation of the meaning of a user input (MUST specify).
(MMI-I16): The multimodal specifications MUST enable to coordinate the input constraints across modalities (MUST specify).
Input constraints specify, for example through grammars, how inputs are can be combined via rules or interaction management strategies. For example the markup language may coordinates grammars for modalities other than speech with speech grammars to avoid duplication of effort in authoring multimodal grammars.
Possible ways to address this could include:
These methods will be considered during the specification work.
When using multiple modalities or user agents, a user may introduce errors consciously or inadvertently. For example in a voice and GUI multimodal application, the user may say "yes" simultaneously click on "no" in the user interface. We require that the specifications detect such conflict.
(MMI-I17): The multimodal specifications MUST support the detection of conflicting input from several modalities (MUST specify).
It is naturally expected that the author will specify how to handle the conflict through an explicit script or piece of code. It is also possible that an interaction management strategy will be able to detect the possible conflict and provide a strategy or sub-dialog to resolve it.
The interaction manager should be able to place different input events on the timeline, in order to determine the intent of the user.
(MMI-I18): The multimodal specifications MUST provide mechanisms to position the input events relatively to each other in time (MUST specify).
(MMI-I19): The multimodal specifications SHOULD provide mechanisms to allow for temporal grouping of input events (SHOULD specify).
These requirements may by satisfied by mechanisms to order of the input events or, when needed, relative time stamping. For some configurations, this may involve clock synchronization.
(MMI-O1): The multimodal specifications developed by the MMI working group MUST support sequential media output (MUST specify).
As SMIL supports the sequencing of medias, the specification is expected to rely on similar mechanism. This is addressed in more details in other requirements.
It implies that
(MMI-O2): The multimodal specifications MUST provide the ability to synchronize different output medias with different granularities (MUST specify).
This covers simultaneous outputs. The granularity of output synchronization as provided by SMIL may range from no synchronization at all between the medias other than the play in parallel to tightly synchronization mechanisms.
(MMI-O3): The multimodal specifications MUST enable the author to specify the granularity of output synchronization (MUST specify).
However, it should be possible that the granularity of output media synchronization be decided by the user (delivery context) runtime or network.
(MMI-O4): The multimodal markup MUST enable the author to specify how the multimodal application degrade when the granularity of output synchronization is modified by external factors (MUST specify).
(MMI-O5): The multimodal specifications SHOULD rely on a default output synchronization behavior for a particular granularity and it should provide "overwrite" mechanisms (SHOULD specify)
(MMI-O6): The multimodal specifications MUST support as output media:
(MMI-O7): The multimodal specifications SHOULD support additional media outputs like:
(NICE to specify).
(MMI-O8): The multimodal specifications MUST describe how extensibility is to be achieved and how new output medias can be added (MUST specify).
(MMI-O9): The multimodal specifications MUST support the specification of which output media should be processed and how it should be done. The specification MUST provide a mechanism that describe how this can be achieved or extended for different modalities (MUST specify).
Examples of output processing may include: adaptation or styling of presentation for particular modalities, speech synthesis of text output into audio output, natural language generation, etc...
(MMI-A1): Where the functionality is appropriate, and clean integration is possible, the multimodal specifications must enable the use and integration of existing standard language specifications including visual, aural, voice and multimedia standards (MUST specify).
In general, it is understood that in order to satisfy MMI-G11, dependencies of the multimodal specifications on other specifications must be carefully evaluated if these are not yet W3C recommendations or not yet widely adopted.
SMIL 2.0 provide multimedia synchronization mechanisms. Therefore, MMI-A1 implies:
(MMI-A1a): The multimodal specifications MUST enable the synchronization of input and output media through SMIL2.0 as control mechanism (MUST specify).
The following requirement results from MMI-A1.
(MMI-A2): The multimodal specifications MUST be expressible in terms of XHTML modularization (MUST specify).
(MMI-A3): The multimodal specification MUST allow the separation of data model, presentation layer and application logic in the following ways:
This will enable the multimodal specifications to be compatible with XForms in environments which support XForms. This would comply with MMI-A1.
From an authoring point of view, it is important to have mechanisms (events, protocols, handlers) to detect or prescribe the modalities that are or should be available: i.e. to check the delivery context and to adapt to the delivery context. This is covered by MMI-G14 and MMI-G15.
(MMI-A4): There MUST be events associated to changes of delivery context and mechanisms to specify how to handle these events by adapting the multimodal application (MUST specify).
(MMI-A5): There SHOULD be mechanisms available to define the delivery context or behavior that is expected or recommended by the author (SHOULD specify).
(MMI-A6): The multimodal specifications MUST support the synchronization granularities at the following levels of synchronization:
The following requirement results from MMI-A1.
(MMI-A7a): Event-level synchronization MUST follow the DOM event model (MUST specify).
(MMI-A7b): Event-level synchronization SHOULD follow XML events (SHOULD specify).
Such events are not limited to events generated by user interactions as discussed in MMI-A16.
It is important that the application developer be able to fully define the synchronization granularity.
(MMI-A8): The multimodal specifications MUST enable the author to specify the granularity of synchronization (MUST specify).
(MMI-A9): It MUST be possible that the granularity of synchronization be decided by the user runtime or network (through the delivery context) (MUST specify).
(MMI-A10): The multimodal specifications MUST enable the author to specify how the multimodal application degrade when the granularity of synchronization is modified by external factors (MUST specify).
(MMI-A11): The multimodal specifications should rely on an input and output default synchronization behavior and it should provide "overwrite" mechanisms (SHOULD specify).
Nothing imposes that input and output, even in a same modality, be provided in a same device or user agent. The input and output can be independent and the granularity of interfaces afforded by the specification should apply independently to the mechanisms of input and output within a given modality when necessary.
(MMI-A12): The specification MUST support separate interfaces for input and output even within a same modality (MUST specify).
(MMI-A13): The multimodal specifications MUST support synchronization of different modalities or devices distributed across the network, providing the user with the capability to interact through different devices (MUST specify).
In particular, this includes multi-device applications where different devices or user agents are used to interact with a same applications; these may involve presentation in the same modality but on different devices.
Distribution of input and output processing refers to cases where the processing algorithms applied on input and output may be performed by distributed components.
(MMI-A14): The multimodal specifications MUST support the distribution of input and output processing (MUST specify).
(MMI-A15): The multimodal specifications MUST support the expression of some level of control over the distributed processing of input and output processing (MUST specify).
This requirement is related to MMI-I1 and MMI-O9.
(MMI-A16): The multimodal specifications MUST enable author to specify how multimodal applications handle external input events and generate external output events used by other processes (MUST specify).
Examples of input events include camera, sensors or GPS events. Example of output event include any form of notification or trigger generated by the user interaction.
This is expected to be automatically satisfied if events are treated as XML events.
Requirements MMI-I8 and MMI-I9 generalize as follows.
(MMI-A17): The multimodal specifications MUST provide mechanisms to position the input and output events relatively to each other in time (MUST specify).
(MMI-A18): The multimodal specifications SHOULD provide mechanisms to allow for temporal grouping of input and output events (SHOULD specify).
These requirements may by satisfied by mechanisms to order of the events or, when needed, relative time stamping. For some configurations, this may involve clock synchronization.
It is expected that users will interact with multimodal applications through different deployment configurations (i.e. architectures): the different modules responsible for media rendering, input capture, processing, synchronization, interpretation etc, may be partitioned or combined on a single device or distributed across several devices or servers. As previously discussed, these configurations may dynamically change.
The specifications of such configuration is beyond the scope of the W3C Multimodal Interaction Working Group. However:
(MMI-C1): The multimodal specifications MUST support the deployment of multimodal applications authored according the W3C MMI specifications, with all the relevant deployment configurations where functions are partitioned or combined on a single engine or distributed across several devices or servers (MUST specify).
The possibility to interact with multiple devices leads naturally to multi-user access to applications.
(MMI-C2): The multimodal specifications SHOULD support multi-user deployments (NICE to specify).
Multimodal interactions are especially important for mobile deployments. Therefore, the W3C multimodal working group will pay attention to the constraints associated to mobile deployments and especially cell phones.
(MMI-R1): The multimodal specifications MUST be compatible with deployments based on user agents / renderers that run on mobile platforms (MUST specify).
Mobile platforms, like smart phones, are typically constrained in terms of processing power and memory available. It is expected that the multimodal specifications will take such constraints into account and be designed so that multimodal deployments are possible on smart phones.
In addition, it is important to pay attention to the challenges introduced by mobile networks like: limited bandwidth, delays etc...:
(MMI-R2): The multimodal specifications MUST support deployments over mobile networks, considering the bandwidth limitations and delays that they may introduce (MUST specify).
This may enable deployment techniques or specification from other standard activity to provision the necessary quality of service.
The following requirements apply to the objectives for the specification work on EMMA as defined in the glossary. EMMA is intended to support the necessary exchanges of information between the multimodal modules mentioned in section 5.1.
(MMI-E1): The multimodal specifications MUST support the generation, representation and exchange of input events and results of input or output processing (MUST specify)
(MMI-E2): The multimodal specification MUST support the generation, representation and exchange of interpretation and combinations of input event and results of input or output processing (MUST specify).
(MMI-S1): The multimodal specifications MUST enable to author the generation of asynchronous events and their handler (MUST specify).
(MMI-S2): The multimodal specifications MUST enable to author the generation of synchronous events and their handler (MUST specify).
(MMI-S3): The multimodal specifications MUST support event handlers local to the event generator (MUST specify).
(MMI-S4): The multimodal specifications MUST support event handlers remote to the event generator.
(MMI-S5): The multimodal specifications MUST support the exchange of EMMA fragments as part of the synchronization events content (MUST specify).
(MMI-S6): The multimodal specifications MUST support the specification of event handlers for externally generated events (MUST specify).
(MMI-S7): The multimodal specifications MUST support the specification of event handlers for externally generated events that result from the interaction of the user (MUST specify).
(MMI-S8): The multimodal specifications MUST support handlers that manipulate or update the presentation associated to a particular modality (MUST specify).
In distributed configurations, it is important that synchronization exchanges take place with minimum delays. In practical deployments this implies that the highest available quality of services should be allocated to such exchanges.
(MMI-S9): The multimodal specifications MUST enable the identification of multimodal synchronization exchanges. (MUST specify)
This would enable the underlying network to allocate the highest quality of services associated to synchronization exchanges, if it is aware of such needs. This network behavior is beyond the scope of the multimodal specifications.
(MMI-S10): The multimodal specifications MUST support confirmation of event handling (MUST specify).
(MMI-S11): The multimodal specifications MUST support event generation or event handling pending confirmation of a particular event handling (MUST specify).
(MMI-S12a): The multimodal specifications MUST be compatible with existing standards including DOM events and DOM specifications (MUST specify).
(MMI-S12b): The multimodal specifications SHOULD be compatible with existing standards including XML events specifications (SHOULD specify).
(MMI-S13):The multimodal specification MUST allow lightweight multimodal synchronization exchanges compatible with wireless network and mobile terminals (MUST specify).
This last requirement is derived from MMI-R1 and MMI-R2.
[CC/PP]: W3C CC/PP Working Group, URI: http://www.w3c.org/Mobile/CCPP/.
[DI activity]: W3C Device Independent Activity, URI: http://www.w3c.org/2001/di/.
[MMI charter]: W3C Multimodal Interaction Working group Charter, URI: http://www.w3c.org/2002/01/multimodal-charter.html.
[MMI WG]: W3C Multimodal Interaction Working Group, URI: http://www.w3c.org/2002/mmi/.
[MM Req Voice]: Multimodal Requirements for Voice Markup Languages, W3C Working Draft, URI: http://www.w3c.org/TR/multimodal-reqs.
This section is informative.
This document was jointly prepared by the members of the W3C Multimodal Interaction Working Group.
Special acknowledgments to Jim Larson (Intel) and Emily Candell (Comverse) for their significant editorial contributions.
Analysis of use cases provides insight into the requirements for applications likely to require a multimodal infrastructure.
The use cases described below were selected for analysis in order to highlight different requirements resulting from application variations in areas such as device requirements, event handling, network dependencies and methods of user interaction
Use Case Device Classification
A device with little processing power and capabilities that can be used to capture user input (microphone, touch display, stylus, etc) as well as non-user input such as GPS. The device may have a very limited capability to interpret the input, for example a small vocabulary speech recognition, or a character recognizer. The bulk of the processing occurs on the server including natural language processing and interaction management.
An example of such a device may be a mobile phone with DSR capabilities and a visual browser (there could actually be thinner clients than this).
A device with powerful processing capabilities, such that most of the processing can occur locally. Such a device is capable of input capture and interpretation. For example, the device can have a medium vocabulary speech recognizer, a handwriting recognizer, natural language processing and interaction management capabilities. The data itself may still be stored on the server.
An example of such a device may be a recent production PDA or an in-car system.
A device capable of input capture and some degree of interpretation. The processing is distributed in a client/server or a multi-device architecture. For example, a medium client will have the voice recognition capabilities to handle small vocabulary command and control tasks but connects to a voice server for more advanced dialog tasks.
Use Case Summaries
Form Filling for air travel reservation
|Description||Device Classification||Device Details||Execution Model|
|The means for a user to reserve a flight using a wireless personal mobile device and a combination of input and output modalities. The dialog between the user and the application is directed through the use of a form-filling paradigm.||Thin and medium clients||touch-enabled display (i.e., supports pen input), voice input, local ASR and Distributed Speech Recognition Framework, local handwriting recognition, voice output, TTS, GPS, wireless connectivity, roaming between various networks.||Client Side Execution|
User wants to make a flight reservation with his mobile device while he is on the way to work. The user initiates the service via means of making a phone call to a multimodal service (telephone metaphor) or by selecting an application (portal environment metaphor). The details are not described here.
As the user moves between networks with very different characteristics, the user is offered the flexibility to interact using the preferred and most appropriate modes for the situation. For example, while sitting in a train, the use of stylus and handwriting can achieve higher accuracy than speech (due to surrounding noise) and protect privacy. When the user is walking, the input and output modalities that more appropriate would be voice with some visual output. Finally, at the office the user can use pen and voice in a synergistic way.
The dialog between the user and the application is driven by a form-filling paradigm where the user provides input to fields such as "Travel Origin:", "Travel Destination:", "Leaving on date", "Returning on date". As the user selects each field in the application to enter information, the corresponding input constraints are activated to drive the recognition and interpretation of the user input. The capability of providing composite multimodal input is also examined, where input from multiple modalities is combined for the interpretation of the user's intent.
|Description||Device Classification||Device Details||Execution Model|
|This application provides a mechanism for a user to request and receive driving directions via speech and graphical input and output||Medium Client||on-board system (in a car) with a graphical display, map database, touch screen, voice and touch input, speech output, local ASR and TTS Processing and GPS.||Client Side Execution|
User wants to go to a specific address from his current location and while driving wants to take a detour to a local restaurant (The user does not know the restaurant address nor the name). The user initiates service via a button on his steering wheel and interacts with the system via the touch screen and speech.
|Description||Device Classification||Device Details||Execution Model|
|The means for users to call someone by saying their name.||thin and fat devices||Telephone||The study covers several possibilities:
These choices determine the kinds of events that are needed to coordinate the device and network based services.
Janet presses a button on her multimodal phone and says one of the following commands:
The application initially looks for a match in Janet's personal contact list and if no match is found then proceeds to look in other directories. Directed dialog and tapered help are used to narrow down the search, using aural and visual prompts. Janet is able to respond by pressing buttons, or tapping with a stylus, or by using her voice.
Once a selection has been made, rules defined by Wendy are used to determine how the call should be handled. Janet may see a picture of Wendy along with a personalized message (aural and visual) that Wendy has left for her. Call handling may depend on the time of day, the location and status of the both parties, and the relationship between them. An "ex" might be told to never call again, while Janet might be told that Wendy will be free in half an hour after Wendy's meeting has finished. The call may be automatically directed to Wendy's home, office or mobile phone, or Janet may be invited to leave a message.
The use-case analysis exercise helped to identify the types of events a multimodal system would likely need to support.
Based on the use case analysis, the following events classifications were defined:
The events from the use cases described above have been consolidated in the following table.
|Event Type||Asynchronous vs. Synchronous||Local vs. remote generation||Local vs. remote handling||Input inter- pretation||External vs. User||Notifications vs. actions||Comments|
|1.||Data Reply Event||Synchronous||Remote||Local||No||External||Notification||Event containing results from a previous data request|
|2.||HTTP Request||Asynchronous||Local||Remote||No||External||N/A||A request sent via the HTTP Protocol|
|3.||GPS_DATA_in||Synchronous||Remote||Local||No||External||Notification||Event containing GPS Location Data|
|4.||Touch Screen Event||Asynchronous||Local||Local||Yes||User||Action||Event that contains coordinates corresponding to a location on a touch screen|
|5.||Start_Listening Event||Asynchronous||Local / Remote||Local / Remote||No||User||Action||Event to invoke the speech recognizer|
|6.||Return Reco Results||Synchronous||Local / Remote||Local||Yes||External||Notification||Event containing the results of a recognition|
|7.||Alert||Asynchronous||Remote||Local||No||External||Notification||Event containing unsolicited data which may be of use to an application|
|8.||Register User Ack||Synchronous||Remote||Local||No||External||Notification||Event acknowledging that user has registered with the service|
|9.||Call||Asynchronous||Local||Remote||No||User||Action||Request to place an outgoing call|
|10.||Call Ack||Synchronous||Remote||Local||No||External||Notification||Event acknowledging request to place an outgoing call|
|11.||Leave Message||Asynchronous||Local||Remote||No||User||Action||Request to leave a message|
|12.||Message Ack||Synchronous||Remote||Local||No||External||Notification||Event acknowledging request to leave a message|
|13.||Send Mail||Asynchronous||Local||Remote||No||User||Action||Request to send a message|
|14.||Mail Ack||Synchronous||Remote||Local||No||External||Notification||Event acknowledging request to send a message|
||Remote||No||External||Notification||Occurs on connection|
|Notifiication||The user selects a new set of modalities by pressing a button or making menu selections (synchronous event). If the device can detect changes in the network or location via GPS or beacons, then the event is asynchronous.|
|17.||On_Focus (field_name)||Synchronous||Local||Remote||No||User||Action||Event sends selected field to multimodal synchronization server for the purpose of loading the appropriate input constraints for the field.|
|18.||Handwriting_Reco ()||Synchronous||Local||Local||Yes||User||Action||Event to invoke the handwriting recognizer (HWR) after pen input in a field. In the current scenario, we consider that HWR is handled locally, but this may be expanded later to include remote processing.|
||Result of recognition of field input is sent to the server|
|20.||Send_Ink (ink_data, time_stamp)||Synchronous||Local||Remote||Yes
||User||Action||Ink collected for a pen gesture is sent to multimodal server for integration. As before, this event associates time stamp information with the ink data for synchronization.The result of the pen gesture can be transmitted as a sequence of (x,y) coordinates relative to the device display,|
||Ink collection could be interpreted first locally into basic shapes (i.e, circles, lines) and have those transmitted to the server.|
||Send_Gesture (gesture_data, time_stamp)
||The server can provide a deeper semantic
interpetation than the basic shapes that are recognized on
Combination of video and audio to process input (joint face/lips/movement recognition and speech recognition) and generate output (audio-visual media)
complementary use of modalities
A use of modalities where the interactions available to the user differ per modality.
Composite input is input received on multiple modalities at the same time and treated as a single, integrated compound input by downstream processes.
See execution model.
Contradictory inputs provided by the user in different modalities or on different devices. For examples, they may indicate different exclusive selection.
A session context consists of the history of the interaction between the user and the multimodal system, including the input received from the user, the output presented to the user, the current data model and the sequence of data model changes.
Capability of a multimodal system to combine multimodal inputs into composite inputs based on an interpretation algorithm that decides what makes sense to combine based on the context
CC/PP [ Composite Capability/Preference Profiles],
A W3C working group which is developing an RDF-based framework for the management of device profile information. For more details about the group activity please visit http://www.w3.org/Mobile/CCPP/
The text-to-speech engine concatenates short digital-audio segments and performs intersegment smoothing to produce a continuous sound.
Argument files to input or output processing algorithms
Synchronization behavior supported by default by a multimodal application.
A set of attributes that characterizes the capabilities of the access mechanism in terms of device profile, user profile (e.g. identify, preferences and usage patterns) and situation. Delivery context may have static and dynamic components.
A piece of hardware used to access and interact with an application.
A particular subset of the delivery context that describes the device characteristics including for example device form factor, available modalities, level of synchronization and coordination.
DI [Device Independence]
The W3C Device Independence Activity is working to ensure seamless Web access with all kinds of devices, and worldwide standards for the benefit of Web users and content providers alike. For more details pleases refer to http://www.w3.org/2001/di/
Stored or recognized handwriting input.
A dialog in which one party (the user or the computer) follows a pre-selected path, independent of the responses of the other. (cfr. mixed initiative dialog).
System components may live at various points of the network, including the local client.
DOM [Document Object Model]
A standard interface to the contents of a web page. Please visit http://www.w3.org/DOM/ for more details.
Extensible MultiModal Annotation Markup Language. Formerly known as NLSMLÂ—Natural Language Semantics Markup Language. This markup language is intended for use by systems to represent semantic interpretations for a variety of inputs, including but not necessarily limited to, speech and natural language text input
An event is a representation of some asynchronous occurrence of interest to the multimodal system. Examples include mouse clicks, hanging up the phone, speech recognition errors. Events may be associated with data e.g. the location the mouse was clicked.
A software object intended to interpret and respond to a given class of events.
An agent (human or software) capable of generating events.
Runtime configuration of the various system components in a particular manifestation of a multimodal system.
External input events are events that are not originating from direct user input. External output events are events that originate in the multimodal system and are handled by other processes.
GPS [Global Positioning System]
A worldwide radio-navigation system formed from a constellation of 24 satellites and their ground stations. GPS uses these "man-made stars" as reference points to calculate positions accurate to a matter of meters.
A computational mechanism that defines a finite or infinite set of legal strings, usually with some structure.
use of the pen for input which is converted into text or symbols. Involves handwriting recognition.
Portions of profile and session context persisted for a same user across sessions.
HTML [HyperText Markup Language]
A simple markup language used to create hypertext documents that are portable from one platform to another. To find more information about specification of HTML and the working group acitivity please visit http://www.w3c.org/MarkUp/
HTTP [Hypertext Transfer Protocol]
To get details about the HTTP working group and the HTTP specification please visit http://www.w3c.org/Protocols/.
Any spoken language (e.g. French, Japanese, English etc...).
See digital ink.
Event, set of events or macro-event generated by a user interaction in a particular modality on a particular device.
Specify how inputs are can be combined via rules or interaction management strategies. For example the markup language may coordinates grammars for modalities other than speech with speech grammars to avoid duplication of effort in authoring multimodal grammars.
Algorithm to apply to a particular input in order to transform or extract information from it (e.g. filtering, speech recognition; spaker recognition, NL parsing,...). The algorithm may rely on data files as argument (e.g. grammar, acoustic model, NL models, ...)
An interaction manager generates or updates the presentation by processing user inputs, session context and possibly other external knowledge sources to determine the intent of the user. An interaction manager relies on strategies to determine focus and intent as well as to disambiguate, correct and confirm sub-dialogs. We typically distinguish directed dialogs (e.g. user-driven or application-driven) and mixed initiative or free flow dialogs.
Output media where at least a face has lip movements synchronized with an output audio speech
XML vocabularies that provide markup-level access to various system components
Synchronization between output media as specified by SMIL: http://www.w3.org/AudioVideo/
It is a description that can be rendered into physical effects that can be perceived and interacted with by the user in one or multiple modalities and on one or multiple devices
Musical Instrument Digital Interface, an audio format.
mixed initiative dialog
A style of dialog where both parties (the computer and the user) can control what is talked about and when. A party may on its own change the course of the interaction (e.g., by asking questions, providing more or less information than what was requested or making digressions). Mixed initiative dialog is contrasted with directed dialog where only one party controls the conversation. (cf directed dialog)
MMI: [Multimodal Interaction]
A W3C Working Group which is developing markup specifications that extends the Web user interface to allow multiple modes of interaction. For more details of MMI working group and MMI activity, please visit http://www.w3c.org/2002/mmi/
The type of communication channel used for interaction. It also covers the way an idea is expressed or perceived, or the manner in which an action is performed.
Change of modality to perform a particular interaction. It can be decided by the user or imposed by the application or runtime (e.g. when a phone call drops).
Working group established under the joint direction of the International Standards Organization/International Electrotechnical Commission (ISO/IEC), that has for goal to create standards for the digital video and the audiophonic compression. More precisely, MPEG defines the syntax of audio and video format needing low data rates, as well as operations to be undertaken by decoders.
MP3 [MPEG Audio Layer-3]
An Internet music format. For MP3 related technologies please refer to http://www.mp3-tech.org/
A multimodal system supports communication with the user through different modalities such as voice, gesture, and typing. (cfr modality)
A must specify requirement must be satisfied by the multimodal specification(s), starting from their very first version.
natural Language (NL)
Term used for human language, as opposed to artificial languages (such as computer programming languages or those based on mathematical logic). A processor capable of handling NL must typically be able to deal with a flexible set of sentences.
natural language generation (NLG)
A technique for generating natural language sentences based on some higher-level information. Generation by template is an example of simple language generation techniques. The flight from <departure-city> to <arrival-city> leaves at <departure-time> is an example of template where the slots indicated by <Â…> have to be filled with the appropriate information by a higher-level process.
natural language processing
Natural language understanding, generation, translation and other transformations on human language.
natural language understanding (NLU)
The process of interpreting natural language phrases to specify their meaning, typically as a formula in formal logic.
nice to specify
A "nice to specify" requirement will be taken into account when designing the specification. If a technical solution is available, the specifications will try to satisfy the requirement or support the feature, provided that it does not excessively delay the work plan.
The act of communicating an event (see subscribe).
override mechanism for synchronization
Information that specifies how the synchronization should behave when not following its default behavior. (cf. default synchronization)
Expressing information to be conveyed in a user-friendly form, possibly using multiple output media streams.
Algorithm to apply in order to transform or generate an output (e.g. TTS, NLG)
The meaning or interpretation of a word, phrase, or sentence, as opposed to its syntactic form. In natural language and dialog technology the term semantics is typically used to indicate a representation of a phrase or a sentence whose elements can be related to entities of the application (e.g. departure airport and arrival time for a flight application), or dialog acts (e.g. request for help, repeat, etc.).
The process of interpreting the semantic part of a grammar. The result of the interpretation is a semantic representation. This process is often referred as Semantic Tagging.
The semantic result of parsing a written sentence, or a spoken utterance. The semantic interpretation can be expressed as attribute value pairs or more complex structures. W3C is working on the definition of Semantic Representation formalism
A sequential input is one received on a single modality. The modality may change over time.] (cf. simultaneous or composite input.
A sequential multimodal application is one in which the user may interact with the application only one modality at a time, switching between modalities as needed.]
The time interval during which an application and its context context is associated to a user and persisted. Within a session, users may suspend and resume interaction with an application within a same modality or device or switch modality or device.
session level synchronization granularity
Multimodal application that supports suspend and resume behavior across modalities
The specifications (multimodal markup language and other) will aim at addressing and satisfying the requirement or supporting the features during the lifetime of the working group. Early specification will take this into account to allow easy and interoperable updates.
Simultaneous inputs denote inputs that can come from different modalities but are not combined into composite inputs. Simultaneous multimodal inputs, imply that the inputs from several modalities are interpreted one after the other in the order that they where received instead of being combined before interpretation.
External information that can affect the usage or expected behavior of multimodal applications including for example on-going activities (e.g. walking versus driving), environment (e.g. noisy), privacy (e.g. alone versus in public), etc...
SMIL [Synchronized Multimedia Integration Language]
A W3C Recommendation, SMIL 2.0 enables simple authoring of interactive audiovisual applications. See http://www.w3.org/TR/smil20/ for details.
The ability of a computer to understand the spoken word for the purpose of receiving command and data input from the speaker.
A software/hardware component that performs recognition from a digital-audio stream. speech recognition engines are supplied by vendors who specialize in the software.
The act of informing an event source that you want to be notified of some class of events.
supplementary use of modalities
Describes multimodal applications in which every interaction (input or output) can be carried through in each modality as if it was the only available modality
suspend and resume
Suspend and resume behavior; an application suspended in one modality can be resumed in the same or another modality
Way that an input in one modality is reflected in the output in another modality/device as well as way that it may be combined across modalities (coordination capability)
synchronization granularity or level
The text-to-speech engine synthesizes the glottal pulse from human vocal cords and applies various filters to simulate throat length, mouth cavity, lip shape, and tongue position.
Technologies for converting textual (ASCII) information into synthetic speech output. Used in voice-processing applications requiring production of broad, unrelated, and unpredictable vocabularies, such as products in a catalog or names and addresses. This technology is appropriate when system design constraints prevent the more efficient use of speech concatenation alone.
Annotation of an event that characterize the relative (with respect to an agreed upon reference) or absolute time of occurrence of the event
Set of input collected from the user before updating the output
Uniform Resource Identifier - http://www.w3.org/Addressing/
A particular subset of the delivery context that describes the user including for example the identity, personal information, personal preferences and usage preferences.
An XML Events module that provides XML languages with the ability to uniformly integrate event listeners and associated event handlers with DOM Level 2 event interfaces. The result is to provide an interoperable way of associating behaviors with document-level markup. For XML Event specification please visit http://www.w3.org/TR/2001/WD-xml-events-20011026/Overview.html#s_intro
Extensible Stylesheet Language
Extensible Stylesheet Language Transformations