CSS Extensions for Multimodal Interaction

Dave Raggett


Max Froumentin



This note describes experimental ideas for extending CSS style sheets to add multimodal capabilities to XHTML documents without the need for any changes to the markup, and in a manner that is fully backwards compatible with existing browsers. The approach is modality independent and can be used with key strokes, speech and pen-based input, together with aural and visual prompts. This paper describes original work by the authors, and should not be taken to represent the views of the W3C, nor those of any W3C working group.


User Interface, Browsers, Multimodal, Stylesheets, CSS, XHTML


The vast majority of Web browsers today rely on the graphical user interface as invented by Doug Englebart in the 1960s and subsequently refined by Xerox PARC and Apple Computer. We are used to using a pointer device to click on links and scroll the window, and using the keyboard to fill in text fields. Developments in speech and handwriting recognition are providing opportunities for new modes of interaction with Web pages. Electronic pens allow for gestures, drawings and visual notations like mathematics and music, as well as for handwriting. Speech allows for hands and eyes free operation, as well as being more convenient than the limited keypads available on small devices like cellphones. The true potential for natural language—the medium of choice for human to human communication—is only just beginning to be tapped.

This paper describes ideas for extending Web browsers to support multiple modes of interaction, and explores some ways in which style sheets could be extended to describe interaction independent of whether the user chooses to respond with speech, handwriting or keystrokes. The idea that users should be free to make such choices is proposed as a key principle:

To enable widespread adoption, and to repeat the success of the GUI Web, new authoring languages need to be easy to learn for existing Web designers skilled in HTML, and just as importantly, these languages should be backwards compatible with existing browsers. This allows users to view the pages with older browsers that are widely deployed, whilst enabling users with newer browsers to gain the benefits of multimodal interfaces. If the languages aren't backwards compatible, few designers will feel justified in developing new content while there are only a relatively small number of people with access to the new browsers. The lack of such content could then inhibit the deployment of such browsers.

A related issue concerns how much effort is needed to create an effective user interface. A language offering only low level abstractions may provide plenty of flexibility, but can make it hard for most designers to work with. A high level language can suffer from the opposite problem, being easy to work with, but restricted in flexibility. The experience with the Web suggests the value in providing an easy to learn declarative language for the most common cases, together with the means to add flexibility via scripting, to cope with the less common cases.

Some previous approaches such as SALT[1] and X+V[2] have approached the challenge of multimodal interaction through mixing XHTML[3] markup with additional markup for speech. This has two consequences. The first is a risk of incompatibility with existing browsers, and the second is a reduction in flexibility when it comes to allowing users to freely select the mode of interaction. As an example of the first problem, SALT prompt text may be unintentionally visible as part of the body of a document. This occurs because browsers are designed to ignore start and end tags they don't recognize, as required to by the HTML specification. As an example of the second problem, a SALT enabled browser may reprompt the user even though a value was provided via the keyboard.

These problems may be avoided by separating the user interface from the application markup. This paper explores the view that application user interfaces can be treated as a matter of styling, and that variations in user preferences and device capabilities can be treated as a matter of applying different style sheets. This idea is perhaps better known as skin-able user interfaces.

A well known example is "winamp", a Windows application for playing music. Users can download a range of skins and try them out in turn. The skin not only allows for wide variations in the visual appearence of the application, it also changes the details of how you interact with the application. This idea is very much relevant to the needs for Web authoring languages given the wide variations in device capabilities, such as the display size and supported modes of interaction. W3C's Cascading Style Sheets (CSS) language[4] is a well established means for styling Web content, but currently assumes that the user interface is part of the application and not part of the style sheet. This limits the potential of CSS for defining skin-able user interfaces, which need to cover the choice of user interface controls and the details of the interaction with the user, as well as the visual appearence. By treating interaction as part of styling, it is easy to change the user interface simply by switching the style sheet.


This section of the paper gives a brief introduction to what it means to provide a multimodal interface to Web pages. This will be used to motivate the choice of features used for the CSS extensions proposed in section 3.

The user interface for Web browsers may be characterized in terms of the following categories:

Browser controls
These include the Back, Forward, Stop, Reload and Home buttons, as well as the means to pick a Website from the browser favorites or to enter a URI explicitly. Browsers generally allow users to set preferences such as the default fonts. A further function is support for scrolling through documents larger than the document window.
The means to follow hypertext links whether these are textual or image based, plus the means to move the input focus from one input control to the next. This is critically important for text fields, but is also needed when using the keyboard without a pointing device.
Menus, radio buttons and checkboxes all involve making a selection of some kind.
Text fields
Single or multi-line text entry fields.
Special purpose controls
The are application specific and often created using external formats such as Macromedia Flash.

Browser controls can be bound to a fixed set of spoken commands. For navigation, the commands will be application specific. The simplest approach is to make the spoken command the same as the visible label for the link or control, since this makes them easier to learn—the idea of "say what you see". The same applies to selection. For text fields, there is a choice between an application specific grammar, and free text entry using statistical dictation models for speech recognition. The accuracy of using speech for free text entry is generally less than that for application specific grammars, but nonetheless, is expected to become increasingly practical. One advantage of speech is the opportunity to fill out multiple fields or to give multiple commands with a single utterance. This is especially valuable when using network based recognition due to the increased latency compared with embedded (i.e. local) recognition.

For devices with an electronic pen or stylus, the browser controls, navigation, and selection can be operated in much the same way as when using a mouse pointer. You just need to tap on the button or link. The platform may also support a set of pen gestures where a particular movement of the pen is interpreted as a command. In principle, applications could define their own gestures, but this is difficult in practice as there are no established standards for defining such gestures. For handwriting recognition, the user may be free to write directly on the text field, or be required to write one character at a time in a special area. In either case, the speed of text entry tends to be slower for experienced users than when using a conventional full sized keyboard. The application may also be able to collect ink traces for processing on a server. This gives the designer the freedom to enable the use of ink for scribbled notes, drawings, and specialized notations. W3C is developing an XML format for ink traces called "InkML"[8] with this in mind.

When interacting with Web pages in a conventional manner, the user is free to choose which link to click on, and in which order to fill out fields etc. This is known as "user directed" interaction. When carrying out a task like placing an order, the application can direct the user through a sequence of pages, e.g. for selecting products, confirming the selection, collecting payment and delivery details, and placing the order. This is known as "application directed" interaction. This is appropriate for tasks that must be performed in a particular order, or when the user needs guidance, e.g. for an unfamiliar or complex task.

Application directed interaction is also useful when the user fails to respond as expected within a reasonable time, or when there are uncertainties in speech or handwriting recognition. In such situations, the application can be designed to provide progressively stronger guidance with what are known as "tapered prompts". This involves transiting between a sequence of dialog states. Such states can also be used to ask the user for confirmation, or to select from a short list of recognition hypotheses, or to provide feedback on progress, and to set the user's expectation for what the application is currently doing. In principle, transiting between dialog states could be implemented by asking the server for the next page (e.g. as an kind of form submission). But this can be expensive in terms of increased latency and network useage—factors that are important for mobile applications. This makes it worth providing the means to support dialog transitions without requiring such page loads.

Prompts guide the user to respond within the expectations described by application grammars. Sometimes the prompt and grammar are static, but in other situations, it will be necessary to refer back to something the user previously input. This creates a need for dynamically generated prompts and grammars, that are computed as functions of application data. This could be computed server-side or client-side depending on whether dialog transitions involve page loads or not.

This section concludes with a look at issues concerning coupling across different modes of input that need to be addressed by multimodal authoring languages. The first case is where there are two text fields and the user selects the second using the keyboard or pointing device, whilst still saying the text for the first field. This may be considered by the user as akin to type ahead. Users are likely to be upset if their input for the first field is discarded or placed into the wrong field. This suggests that the notion of input focus should be taken as an indication of intent and interpreted in coordination with the speech modality. If the second field is associated with a spoken prompt, this should be deferred until the user has finished talking, just as would normally be the case when two people are in a conversation. The same holds for activating the speech grammar for the second field. It would considerably simplify the application designer's task if the implementation were to manipulate the event queue to hide the details of how this is achieved. This is assumed to be the case for the proposal described in this paper.

The second case is where the user can provide input using various combinations of two modes. An example is an application involving a map where the user can zoom or pan the map, or ask for information about a specific location or region. In principle, this could be implemented in terms of a form with two fields, one for speech and one for ink trace data. Depending on what the user says, the application may or may not expect ink data, equally depending on what the user draws, the application may be fine with just ink data or may expect accompanying speech. Such paired fields should share the input focus since simultaneous speech and pen input needs to be allowed. In principle, such multimodal map controls could be implemented using W3C's Scalable Vector Graphics format[11] with the addition of a means to collect ink traces. The flexibility involved in such composite controls is likely to involve a limited amount of scripting, but otherwise falls within the scope of the ideas described in this paper.


This section of the paper describes proposed extensions to W3C's Cascading Style Sheets language[4]. The extensions cover prompts, grammars, declarative event handlers and named dialog states. The proposal is based upon the idea of text as an abstract modality. Speaking, writing or typing are just different ways to enter text. Likewise, the application can present text using either the display or synthetic speech. This reflects the principle of modality independence as described in the introduction. For an effective user interface, the Web designer needs to control how text is obtained or presented in different modes. It isn't enough to simply leave this to the browser. You might think that HTML Forms are generic across modes, but they rely on the assumption that input is reliable and unambiguous, which while true for key strokes and pointer input, is not the case for speech and handwriting.

In principle, style sheets alter the style of an application but not its substance. Thus you should still be able to do everything when style sheets are disabled, although perhaps not using the same modes or in exactly the same way. This constrains the range of actions that can be applied as part of style sheets. Please note that this paper is not intended as a complete specification due to practical limitations on the length of conference papers. The authors expect to be able to demonstrate a working implementation of the ideas at the conference.

3.1 The 'prompt' property

   prompt: none | auto | url(<address>) | expr(<expression>) | <string>

The prompt property is used to specify prompts that guide users as to how to interact with the application. Prompts are triggered by events as determined by the CSS selector, and may be presented in a variety of ways depending on the CSS media type, the device features and user preferences, for example, via speech synthesis, tool tip or status bar message.

The typical use of the prompt property is with a string, e.g.

body { prompt: "Welcome to ring-tones galore!" }
#search:focus { prompt: "please write or say the name of a band or artist" }

In the first example, a prompt is defined that will be played in response to the onload event when the document is first loaded. When the document is unloaded, any prompts that are playing will be stopped in response to the onunload event. In the second example, a text input field with an XML ID value of "search" is associated with a prompt to be played when the field is given the input focus. Prompts are handled via a prompt queue. If several prompts are triggered by same event, then they should be queued in document order.

Designers need to take care to keep prompts consistent with the text and graphics in the markup, for instance:

<h1 style="prompt: 'say yes or no'">Say black or white</h1>

Prompts can also be defined using external Web formats such as W3C's Speech Synthesis Markup Language[6], multimedia presentations expressed in SMIL[12] or scalable vector graphics represented in SVG[11]. References to such resources are expressed using the CSS url() syntax as follows:

body { prompt: url(welcome.ssml) }

For dynamically computed prompts, you can use an expression that returns a typed value such as a string, or an XML resource like SSML. For example:

body { prompt: expr("Welcome back " + userName) }

The expression is evaluated dynamically just prior to the prompt being presented. An open question is whether the syntax for expressions should be described as part of CSS or whether an arbitrary ECMAScript expression is acceptable.

The special value auto is reserved for indicating that prompt is to be automatically constructed based upon the selected element, for instance, based upon the associated label element in XForms[13] or the title attribute in XHTML[2]. The precise means for constructing the prompt is dependent on the markup language.

To allow for styling of prompts, you can use the ::prompt pseudo-element. This can be combined with CSS pseudo-classes as in:

#search:focus::prompt { voice-family: female }

The special value none can be used to suppress a cascaded prompt when no prompt is intended.

Many speech interfaces allow the user to start talking before the prompt has finished. This is often referred to as "barge in" and it may be possible to such behavior, forcing the user to wait until the prompt has finished. This can be supported through the barge-in property, e.g.

body {
 prompt: url(disclaimer.srgs);
 barge-in: avoid;

A spoken prompt has a natural duration. This may not be the case when the prompt is rendered visually. The user may thus be able to respond immediately regardless of the barge-in property. An open question is whether there should be a way for application designers to set a minimum prompt duration to ensure that users are given sufficient time to read legal notices etc. This time interval could be specified as part of the barge-in property.

3.2 The 'grammar' property

   grammar: none | any | url(<address>) | expr(<expression>) | <type> | <abnf>

The grammar property is used to enable text input that is subject to the syntactic and semantic constraints specified by the grammar property value. The grammar is activated according to the selector in the same way as for prompts, for instance on the onload or onfocus events. If barge-in is inhibited, then activation is delayed until the prompt has finished.

The special value any is reserved for use with free text entry. Some speech recognizers may be unable to support this. For constrained text entry, simple grammars can be expressed inline using a subset of the W3C Speech Recognition Grammar Specification[7] ABNF notation, for example:

#weekday:focus {
 prompt: "what day do you prefer: monday, tuesday or wednesday?";
 grammar: monday | tuesday | wednesday;

The syntax allowed for ABNF is restricted to self-contained rule expansions, i.e. expansions that do not contain references to further rules. The BNF definition is as follows:

  abnf ::=  expansion [ "|" expansion ]*
  expansion ::= (leaf | "(" expansion ["|" expansion ]* ")") [ tag ]
  leaf ::= word | quoted-string
  tag ::= "{" tag-content "}"

A word is defined here to be a string that doesn't contain whitespace characters or any of the other special characters including brackets of all kinds, semicolons, vertical bars, quotation marks and other punctuation symbols. The tag-content is a string that cannot contain a "}" character. In the absence of an explicit tag, the input string is used in its place. Note the use of empty curly braces "{}" to return a null string as the result. More complex normalizations, e.g. those requiring calculations, may be done using SRGS[7] together with the Semantic Interpretation specification[10].

Here is an example showing how tags can be used to normalize input values:

#weekday:focus {
 prompt: "what kind of soda do you prefer: pepsi or coke?";
 grammar: pepsi {pepsi cola} | coke {coca cola} ;

If the user enters "pepsi" the input value is normalized to "pepsi cola", while "coke" is normalized to "coca cola".

The ABNF format could in principle be extended to allow for the use of regular expressions for constraining text input, and for using InkML[8] as a way to bind pen gestures to semantic results. For this, InkML would be used to define examples of gestures that can be matched against the user's input.

Larger grammars are supported by reference to external resources, e.g.

#weekday:focus {
 prompt: "who do you want to call?";
 grammar: url(directory.srgs);

It may be worth allowing for a small number of built-in grammars to reduce the need for complex grammars and to enable the platform to use a platform specific input method. In the following example, a built-in calendar control could be used for convenience in selecting a date.

#birthday:focus {
 prompt: "what is your date of birth?";
 grammar: type(date);

Some possible types include:

The use of type(auto) is intended for markup languages like XForms[13] where the type information is supplied as part of the markup, or where is it is practical to automatically create the grammar from the set of labels, e.g. for the XForms <select> and <select1> elements.

You can also use an expression for dynamically computed grammars, with the same syntax for expressions as for prompts. Similarly, the special value none can be used to indicate that no input is expected.

The default processing of the input value depends on the nature of the element to which the associated grammar property is bound. For grammars associated with hypertext links or buttons, any non-null value activates the link or button. For elements acting as check boxes, or as choices in a multiple selection list, any non-null value activates the selection, while a null value de-activates the selection.

A complication is the need to distinguish between setting the focus and filling out a text field. The reason for this is that speech recognition and handwriting recognition are imperfect. The chances of success are enhanced if the user is constrained in what they say or write. In a mobile Web application, the current Web page may have only a few hypertext links and form controls, while a text field on that page may be associated with a relatively large grammar, for instance, the set of names of US airports.

A solution is to make the behavior dependent on whether the selected element has the input focus. If it has the focus, the element is updated with the returned value, otherwise, the input focus is given to the selected element and its value is left unchanged. In essence, this means that a text field must already have the focus if the field's value is to be updated.

This default processing may be overridden with a scripted event handler. This could be used, for example, to interpret EMMA[9] documents representing annotated interpreted input.

An open question is what level of control to provide over how grammars are de-activated. In the "tap and talk" idiom, the user of a pen enabled device, taps on a field and then speaks to fill it out. The grammar is then de-activated upon a no input, a no match or a match event. In some situations, the grammar should remain active. It can then raise a succession of match events as the user's speech is matched against the grammar. This behavior could be enabled though an additional CSS property.

3.3 The 'reprompt' property

   reprompt: none | [<time>] [<action> [<action>]]

The reprompt property enables reprompting when the user doesn't respond within a specified timeout following the presentation of the prompt, or when the user's input doesn't match the associated grammar. The property has no effect unless the associated element has the input focus and a non-empty value for the grammar property.

The value of the reprompt property is a timeout followed by the action to be taken on no input, or unexpected input that doesn't match the associated grammar (no match). If the action is missing, the default is to re-enter the current state. In the following example, the prompt is repeated after 3 seconds if the user doesn't provide a number:

#checked-bags:focus {
   prompt: "how many bags do you want to check in?";
   grammar: type(number);
   reprompt: 3s;

The action is a value matching the following syntax:

If only one action is provided it will apply to both the no input and no match events. If two actions are given (separated by whitespace), the first applies to no input and the second to no match events. As for other properties, the special value none can be used to suppress cascades.

An open question is whether this property should be split into separate properties for no input and no match events. One reason for doing so would be to provide finer grain control over different kinds of speech time outs. For instance, a babble time out where the user is still talking when the time out occurs, but the recognizer hasn't yet found a match with the associated grammar. It might be worth using different properties for time out values and actions to allow for cascading of time outs, and to allow them to be overridden by a user style sheet.

3.4 The 'next' property

   next: none | [<time>] [filter(<expression>)] [ [<condition> <action>]* [<action>] ]

The next property is used to handle match events where the user's input matches the associated grammar. It only effects elements that have values other than 'none' for either the grammar or prompt properties. It gives authors the means to automatically advance the focus to a specified element or document, thereby overriding the normal flow of interaction.

The time parameter may be used to specify a time interval after which the actions designated by the property take effect. This time interval gives users the chance to take the initiative before the application steps in to give them a hand. Of course, users don't have to follow directions, and may choose to do something different, e.g. by tapping with the stylus on a different field, or by speaking a navigation command.

The filter expression may be used to filter the input before evaluating the conditions. The expression should return a string value that will be used in place of the recognition result. This gives considerable flexibility for dealing with structured results when these are expressed in EMMA[9]. For example, when driving a printing application, the user might say "print 3 copies, A4, best quality". The semantic interpretation rules associated with the grammar could be used to map this to an interpretation like:


A simple scripted function can pick out these values and use them to fill out each of the corresponding form fields. In principle, such a script could be replaced by a mechanism to match the interpretation with the form using some kind of unification algorithm. The 'next' property could then provide a syntax such as unify(#form-identifer). However, the flexibility of W3C's extensible multimodal annotation language (EMMA) is such that it would be impractical to provide declarative solutions for all cases. The filter mechanism and scripting are thus still of value even if such a unification mechanism is provided.

Conditions take one of the following forms:

If several conditions evaluate to true, only the action associated with the first such condition will be taken. Actions have the following forms:

If the action is missing the default is the next element in the document defined tab order that can accept text input. The tab order is the order in which you can move through the input controls via pressing the tab key or equivalent. This can be controlled via the XHTML tabindex attribute.

on(expression) and do(expression) enable the author to determine which actions to take according to the application state rather than just the current text value. They can also be used in combination with markup components exposed to scripting via binding mechanisms such as XBL.

Here is an example that determines which element to set the focus to based upon the user's input:

#wants-drink:focus {
   prompt: "do you want a drink?";
   grammar: yes | no;
   reprompt: 3s;
   next: "yes" #beverage "no" #food;

This example asks whether the user wants a drink. If the user responds "yes", the focus will be set to the element with the XML ID of "beverage", if "no", the focus will set to the element with the XML ID of "no". If the final action lacks a preceding condition string, it acts as the default action for the situation where none of the previous condition strings match the result. If this default action is not provided, the default will be the next appropriate element in the document tab order.

For menus, check boxes and radio buttons etc. the text value is internal value of the field. For the XHTML <input> and <option> elements this is given by the value attribute. If the condition string is missing, any non-empty string will be matched. The timeout only applies after a match is detected.

Consider the situation where a user has filled out a field in a way that matches the grammar, and later wants to change the value. Upon setting the focus back to that field, it would be inappropriate for the focus to be immediately moved away on account of a matching condition in the 'next' property, since this would preclude the ability of the user to update the field's value. To avoid this happening, there is the precondition that after an element gets the focus, it has to receive input before the conditions given by the 'next' property are evaluated. This precondition does not apply if the element is associated with a null grammar using the special value none.

The submit mechanism should be conditional on required fields having been filled out. What is the most convenient way to deal with this? One possibility is a property that indicates that the associated field is required. The submit action would then set the focus to the first unfilled required field. Where first is defined in terms of the document specific tab order. A further idea would be to extend the submit action to name dependent fields, e.g. a voice command might require a pen gesture input via a scribble control.

In principle, the CSS @media rule can be used to tailor the timeouts and control flow depending on the media type. The CSS media type "speech" is applicable when the modes of interaction are restricted to speech and DTMF. See also the CSS3 Speech[5] properties for styling the rendering of XML to speech.


@media speech {

  /* rules applicable to the speech media type */


@media handheld {

  /* rules applicable to handheld media type */

Further work is needed to enrich CSS Media Queries to fully realize their potential for customizing interaction to match user preferences and device capabilities. The current set of media names in CSS 2.1 are inadequate for expressing the possible choices of input and output modes and their combination with other device characteristics.

3.5 The ':state' pseudo-class

The :state(name) pseudo-class allows you to define named states for use in defining dialogs. Here is an example where it is used to change the prompt when reprompting:

  #sign:focus {
     prompt: "What is your star sign?";
     grammar: Aries | Taurus | Gemini | twins {Gemini} | Cancer;
     reprompt: 1.5s state(retry);
     next: submit;

  #sign:focus:state(retry) {
     prompt: "Please tell me your astrological sign?"

The CSS cascade ensures that retry state inherits the properties defined with :focus. In the above example, this applies to the 'grammar', 'reprompt' and 'next' properties, but not to 'prompt' which is overridden.

The next example shows how :state can be used to provide an apology after an unrecognized input:

  #sign:focus {
     prompt: "What is your star sign?";
     grammar: Aries | Taurus | Gemini | Cancer;
     reprompt: 1.5s state(shy) state(sorry);
     next: submit;

  #sign:focus:state(shy) {
     prompt: "Don't be shy, tell me your star sign?";

  #sign:focus:state(sorry) {
     prompt: "I didn't quite get that, what is your star sign?";

The following shows how :state can be used to provide a prompt after a successful match against the input grammar:

  #sign:focus {
     prompt: "What is your star sign?";
     grammar: Aries | Taurus | Gemini | Cancer;
     reprompt: 1.5s;
     next: state(epilog);

  #sign:focus:state(epilog) {
     prompt: "please wait while we retrieve your horoscope";
     grammar: none; /* no input is expected at this point */
     next: submit;

The ability to create named sub-states was inspired by the work of David Harel on statecharts[14], and part of the UML standard. Harel's work allows for hierarchically nested states and for concurrent activation of multiple states. This can be used to formalize the way in which CSS selectors and properties are used in this paper to bind behaviors to XML markup.

3.6 Dealing with uncertain input

A speech or handwriting recognizer may have difficulties in correctly identifying what the user said. In some circumstances, it is sufficient to assume the most likely recognition hypothesis, and to ask the user for confirmation before submitting the form etc. In other cases, it will be necessary to ask the user to indicate which interpretation was intended. How is this to be supported?

Many systems provide the application designer with access to the N-best list of recognition hypotheses along with the associated confidence scores. The designer can then apply some kind of thresholding on these scores to determine how to proceed. The problem with this is that the scores tend to be platform dependent, making interoperability problematic. One way around this is to leave the thresholds to the platform. Here is one possibility:

   confidence: <high> <medium> <low>

Where <high>, <medium> and <low> are the actions to be taken if the last input was above the upper threshold, between the upper and lower thresholds, or below the lower threshold, respectively. The actions would give the names of interaction states as defined in section 3.5.

The absolute confidence score is not the only cue that matters, If there are several matches with similar confidence scores, then disambiguation will be needed. Some of the proposed matches may be improbable based upon the current application state. In principle, an applications script can use the current state to re-rank the matches proposed by the recognizer. This can be done using the filter mechanism. Finally, if the user is having problems with one mode of input, it may be much better to encourage the user switch to another mode of input, or to combine more than one mode.

3.7 Full example

Here is a simple small but complete example:

<html xmlns="http://www.w3.org/1999/xhtml">
    <title>Daily Horoscope</title>
    <style type="text/css">
      body {
         color: white;
         background-image: url(stars.gif);
      #sign:focus {
         prompt: "What is your star sign?";
         grammar: Aries | Taurus | Gemini | twins {Gemini} | Cancer;
         reprompt: 1.5s state(second);
         next: submit;
      #sign:focus:state(second) {
         prompt: "Please tell me your astrological sign?"
         reprompt: 1.5s state(third);
      #sign:focus:state(third) {
         prompt: "Are you Aries, Taurus, Gemini or Cancer?"
         reprompt: 1.5s;
    <form action="http://example.com/horoscope">
      Your star sign?  <input id="sign" type="text" name="sign" />


This paper started with design principles, covering modality independence, the complementary roles of declarative representations and scripting, and the need to separate the user interface from the application. W3C's work on CSS and XForms help, but aren't really adequate when it comes to enabling effective multimodal user interfaces. This paper explores the view that interaction is really a matter of styling, and proposes a set of extensions to CSS based upon a study of different patterns of interaction. Regrettably, there isn't enough space in this paper to make a comparison with other proposals such as SALT and X+V. More information about these can be found via the references.

The emergence of embedded speech is creating an opportunity for extending mobile devices to support multimodal interaction. To encourage widespread adoption and a vigourous growth in the available content, it will be critically important to provide an effective end-user experience. The approach taken in this paper is to try to simplify the design effort needed for such applications. Whilst this paper has focused on extending CSS, another possibility would be an XML language for interaction sheets.

The authors plan to demonstrate the CSS approach at the conference using an implementation based upon Internet Explorer and SALT, with scripts for interpreting the CSS extensions and dynamically compiling them to SALT. This work has shown the viability of using scripting to explore declarative approaches, and further experiments are under consideration.


[1] The SALT specification is available from the SALT Forum at http://www.saltforum.org/.

[2] The XHTML+Voice specification is available from IBM at http://www.ibm.com/software/pervasive/multimodal/x%2Bv/11/spec.htm

[3] Pemberton et al., XHTML 1.0 The Extensible HyperText Markup Language (Second Edition), 1st August 2002, available from the W3C at http://www.w3.org/TR/xhtml1/

[4] Bos et al., Cascading Style Sheets, level 2 revision 1 CSS 2.1 Specification, 25 Feburary 2004, available from W3C at http://www.w3.org/TR/CSS21/

[5] Raggett et al., CSS3 Speech Module, 27th July 2004, available from the W3C at http://www.w3.org/TR/css3-speech/

[6] Burnett et al., Speech Synthesis Markup Language (SSML) Version 1.0, 7th September 2004, available from the W3C at http://www.w3.org/TR/speech-synthesis/

[7] Hunt et al., Speech Recognition Grammar Specification Version 1.0, 16th March 2004, available from the W3C at http://www.w3.org/TR/speech-grammar/

[8] Yi-Min Chee et al., Ink Markup Language, 28th September 2004, available from the W3C at http://www.w3.org/TR/InkML/

[9] Wu Chou et al., EMMA: Extensible MultiModal Annotation markup language, 1st September 2004, available from the W3C at http://www.w3.org/TR/emma/

[10] Luc Van Tichelen, Semantic Interpretation for Speech Recognition, 8th November 2004, available from the W3C at http://www.w3.org/TR/semantic-interpretation/

[11] Tolga Capin, Mobile SVG Profiles: SVG Tiny and SVG Basic, 14th January 2003, available from the W3C at http://www.w3.org/TR/SVGMobile/

[12] J Ayars et al., Synchronized Multimedia Integration Language (SMIL 2.0), 7th August 2001, available from the W3C at http://www.w3.org/TR/smil20/

[13] M. Dubinko et al., XForms 1.0, 14th October 2003, available from the W3C at http://www.w3.org/TR/xforms/

[14] D. Harel, Statecharts: A visual approach to complex systems, Science of Computer Programming, 8:231 -- 274, 1987.


During the W3C Multimodal Interaction workshop, held in Sophia Antipolis in July 2004, one of the participants suggested that W3C should try to develop a simple standard for authoring multimodal applications for use on mobile devices. In discussions after the workshop had ended on how to address this goal, the idea came up of extending CSS to describe multimodal interaction. This seemed like an intriguing possibility and well worth exploring. The authors would like to thank Debbie Dahl and Bert Bos for their helpful suggestions.

A copy of this paper can be found at http://www.w3.org/2004/10/css-mmi/