Common Sense Suggestions for Developing Multimodal User Interfaces

W3C Working Group Note 11 September 2006

This version:
Latest version:
Previous version:
This is the first publication.
Jim Larson, Intel


This document is based on the accumulated experience of several years of developing multimodal applications. It provides a collection of common sense advice for developers of multimodal user interfaces.

Status of this Document

This section describes the status of this document at the time of its publication. Other documents may supersede this document. A list of current W3C publications and the latest revision of this technical report can be found in the W3C technical reports index at http://www.w3.org/TR/.

This document is a W3C Working Group Note. It represents the views of the W3C Multimodal Interaction Working Group at the time of publication. The document may be updated as new technologies emerge or mature. Publication as a Working Group Note does not imply endorsement by the W3C Membership. This is a draft document and may be updated, replaced or obsoleted by other documents at any time. It is inappropriate to cite this document as other than work in progress.

This document is one of a series produced by the Multimodal Interaction Working Group (Member Only Link), part of the W3C Multimodal Interaction Activity. The MMI activity statement can be seen at http://www.w3.org/2002/mmi/Activity.

Comments on this document can be sent to www-multimodal@w3.org, the public forum for discussion of the W3C's work on Multimodal Interaction. To subscribe, send an email to www-multimodal-request@w3.org with the word subscribe in the subject line (include the word unsubscribe if you want to unsubscribe). The archive for the list is accessible online.

This document was produced by a group operating under the 5 February 2004 W3C Patent Policy. This document is informative only. W3C maintains a public list of any patent disclosures made in connection with the deliverables of the group; that page also includes instructions for disclosing a patent. An individual who has actual knowledge of a patent which the individual believes contains Essential Claim(s) must disclose the information in accordance with section 6 of the W3C Patent Policy.

Table of Contents


When fonts were first introduced, many messages looked like ransom notes from kidnappers. When color was introduced, many reports looked like they had barely survived an explosion in a paint factory. To avoid these annoying user interfaces, developers adopted suggestions and best practices for using fonts and colors.

With the introduction of multiple modes of input-voice, pen, and keys-inexperienced developers may design loud, confusing, and annoying user interfaces that result in low user performance and high user discontent. This document attempts to enumerate a collection of commonsense suggestions for developing high performance and high preference multimodal user interfaces. We have collected suggestions, techniques, and principles from many diverse disciplines to generate the following suggestions for developing multimodal user interfaces.

This set of suggestions originated in a brainstorming session with some of my students at the Oregon Graduate Institute of the Oregon Health and Sciences University. I categorized the suggestions, and showed them to several multimodal application developers, who added additional suggestions. These have been reviewed and revised by the W3C Multimodal Interaction Working Group. The suggestions will be reviewed by other relevant W3C working groups including Accessibility, Internationalization, and Mobile Web Initiative Best Practices.

Again, these are commonsense suggestions. You may think that no one would ever develop user interfaces that violate these suggestions, but developers have violated commonsense suggestions before and will likely do so again. Use these suggestions as a checklist when you design a multimodal interface. These suggestions should help you to construct a multimodal user interface that improves user performance and satisfaction, so intended people can use your application easily and effectively.

These suggestions can be used as follows:

  1. Review the suggestions before designing a multimodal user interface. The suggestions will assist you in making decisions as you design your multimodal user interface.

    Review the suggestions after designing a multimodal user interface. Use the suggestions as a check list to assess your design after it is completed. Some designers rank their user interface with respect to each suggestion, giving a high score if the user interface conforms to the suggestions and a low score if it does not.

  2. The suggestions are only suggestions. There are situations when every suggestion should be overridden, and these suggestions are no exception. If there are good reasons for not following a suggestions, then ignore the suggestion.

  3. Some users will want to configure their user interface to satisfy their personal preferences. We encourage the use of configuration dialogs to help the user achieve the configuration that is best for him or her. We also note that many users are afraid of configuration, and are happy to use the user interface "as is," without ever configuring the system.

Four Major Principles

The suggestions are organized into four major principles of user interface design. The following four principles determine how quickly users are able to learn and how effectively they are able to perform desired tasks with the user interface:

  1. Satisfy real-world constraints
  2. Communicate clearly, concisely, and consistently with users
  3. Help users recover quickly and efficiently from errors
  4. Make users comfortable

Multimodal user interface developers should follow the above four principles and apply the following suggestions to avoid many of the potential usability problems caused by using modes incorrectly.

1. Satisfy Real-world Constraints

Real-world constraints limit what the users may achieve with an application. These limitations may be due to the nature of the task the user intend to perform, other activities the user is performing, physical limitations of the user, and conditions of the environment in which the user will perform the task. The user interface should be designed to compensate for these limitations.

Task-oriented Suggestions

The nature of the task influences the mode (or modes) users select to perform the task. Tasks which are easy to perform in one mode may be difficult or impossible to perform using another mode. Task-oriented suggestions suggest which tasks lend themselves best to data entry using various modes of entry.

New mobile devices will enable users to enter data by speaking into a microphone, writing with a stylus, and pressing keys on a small keypad. These input modes can be used to perform the following four basic manipulation tasks:

There are other basic tasks, but the tasks mentioned above are performed most frequently in common applications using handheld computers.

Table 1 summarizes how users perform the four basic tasks using the following popular input modes:

Table 1: Performing the four basic manipulation tasks using four popular input modes, ranked from easiest (1) to most difficult (4)
Content Manipulation Task Voice Mode Pen Mode Keyboard/keypad Mouse/Joystick
Select objects (3) Speak the name of the object (1) Point to or circle the object (4) Press keys to position the cursor on the object and press the select key (2) Point to and click on the object or drag to select text
Enter text (2) Speak the words in the text (3) Write the text (1) Press keys to spell the words in the text (4) Spell the text by selecting letters from a soft keyboard
Enter symbols (3) Say the name of the symbol and where it should be placed. (1) Draw the symbol where it should be placed (4) Enter one or more characters that together represent the symbol (2) Select the symbol from a menu and indicate where it should be placed
Enter sketches or illustrations (2) Verbally describe the sketch or illustration (1) Draw the sketch or illustration (4) Impossible (3) Create the sketch by moving the mouse so it leaves a trail (similar to an Etch-a-Sketch™)

Select objects. Object selection is easy with a pen-just point to or circle the desired object. When using voice, just say the name of the desired object, assuming the object has a name. With a keyboard, press keys to position the cursor on the desired object and press the select key.

Enter text. Each of the four modes can be used for text entry-the user speaks words into a microphone, handwrites the words using a pen, presses keys on a keypad to spell the words or selects letters from a soft keyboard. Most users can speak and write easily. However, some training and practice may be necessary to use a keyboard or mouse efficiently.

Enter symbols. Entering mathematical equations, special characters, and signatures is easy with a pen, awkward and time-consuming with a mouse, and most difficult with speech.

Enter sketches or illustrations. Drawing simple illustrations and maps is easy with a pen, awkward with a mouse, and nearly impossible with speech. When speaking, users must verbally describe the illustration or map.

Each input mode has its strengths and weaknesses. Voice is good for describing attributes. The pen is good for pointing and sketching. Keys are good for entering text, numbers, and symbols. A useful and efficient multimodal system uses the appropriate mode for each entry.

1.1. Suggestion: For each task, use the easiest modes available on the device.

Suggestion examples include:

Physical Suggestions

Different physical devices exhibit different usability characteristics. The device's size, shape, and weight affect how it may be used. Most important, the placement of a microphone and speaker, the size of the display and writing surface, and the size of keys in a keypad all affect the ease with which a user can enter information by speaking, writing or pressing keys. Table 2 summarizes the three modes of input with respect to physical usability issues.

Table 2: Physical usability issues for the four most popular modes of information entry
Device Usability Issues Voice Mode Pen Mode Keystrokes Mode Mouse/joystick mode
Required number of user hands None (plus possibly one to hold the device) One (plus possibly one to hold the device) One or two One
Required use of eyes No Yes Frequently, but some users can operate familiar keyboards without looking at them Yes
Portable Yes, especially when walking Yes, but difficult while walking Yes, but difficult while walking Yes, but difficult while walking

Required number of user hands. A user's hands may be required when operating machinery, assembling parts into a device, or creating an object of art. No hands are needed to speak and listen to a voice user interface. A pen requires one hand to hold the pen. A mouse requires one hand to hold the mouse and in most cases requires a surface for the mouse to rest on. By their nature, handheld devices also may require a hand to hold the device. A 12-key keypad requires one hand to enter data, while a QWERTY keypad requires two hands to enter data efficiently. Some users become skilled at holding a small QWERTY keyboard with both hands and using their thumbs to type.

1.2. Suggestion: If the user's hands are unavailable for use, then make speech available.

Suggestion examples include:

Required use of eyes. A user's eyes should be focused primarily on the road while driving a vehicle, on a physical device to be constructed or repaired, or on subjects and their activities while observing an experiment. Usually, users must look at what they are writing with a pen or typing on a keypad. However, the user's eyes may be free to observe his or her environment while speaking.

1.3. Suggestion: If the user's eyes are busy or not available, then make speech available.

Suggestion examples include:

Portable. Speech and pen devices are very portable. Users may use them while sitting, standing, walking, and sometimes while running. Traditionally, keyboard devices are used only while the user is not moving. Keypads requiring only one hand, like those frequently found on handheld devices and telephones, can be used while sitting or standing.

1.4. Suggestion: If the user may be walking, then make speech available

Suggestion examples include:

Environmental Suggestions

People work in environments that may not be ideal for some modes of user interfaces. The environment might be noisy or quiet, hot or cold, light or dark, or moving or stationary with a variety of distractions and possible dangers. Multimodal user interfaces must be designed to work in the environments where they will be used. Table 3 summarizes the environmental usability issues with respect to four popular input modes.

Table 3: Environmental usability issues for the four popular modes of information entry
Device Usability Issues Voice Mode Pen Mode Keystroke Mode Mouse/joystick mode
Noisy environment Works poorly in a noisy environment Works well in a noisy environment Works well in a noisy environment Works well in a noisy environment
Other environmental concerns Works well independently of gloves Does not work well when users must wear thick gloves Does not work well when users must wear thick gloves Does not work well when users must wear thick gloves

Noisy environment. Because speech recognition systems pick up background sounds, they often make mistakes if the user speaks in a noisy environment.

1.5. Suggestion: If the user may be in a noisy environment, then use a pen, keys,or mouse.

Suggestion examples include:

Other environmental concerns: Pen and keyboard devices are difficult if the user must wear thick gloves, such as in a cold environment or when protecting hands from rough objects.

1.6. Suggestion: If the user's manual dexterity may be impaired, then use speech.

A suggestion example is:

2. Communicate Clearly, Concisely, and Consistently with Users

Efficient communication is required if teams of people are to achieve success in joint activities. Likewise, effective communication between the user and the device is necessary for achieving the user's goals. The multimodal user interface is the conduit for all communication between the user and the device. Communication should be clear and concise, avoiding ambiguities and confusion. Communication styles should be consistent and systematic so users know what to expect and can leverage the patterns and rhythms in the dialog.

Consistency Suggestions

Consistency enables users to leverage conversational patterns to accelerate their interaction. For example, users can follow a consistent conversational rhythm without having to pause to adjust to heterogeneous dialog styles.

Consistent prompts. If prompts are worded inconsistently, then users must pause to decode each wording format. Users must spend additional time and mental effort to respond to differently structured questions. When prompts are consistently worded, users can concentrate on the answers to questions rather than trying to understand the questions.

2.1. Suggestion: Phrase all prompts consistently.

Suggestions examples include:

Consistent command format. The current state of the art of speech recognition and natural language technology does not always accurately recognize and understand arbitrary complete sentences. Keyword recognition is much faster and accurate. Many tasks lend themselves to keyword commands better than natural language sentences.

2.2. Suggestion: Enable the user to speak keyword utterances rather than natural language sentences.

Switching modes. Switching modes can be jarring and sometimes surprising. For example, a user who has just answered three verbal questions will be surprised if a textual question suddenly pops up.

2.3. Suggestion: Switch presentation modes only when the information is not easily presented in the current mode.

Suggestion examples include:

Command consistency. Using different commands for the same purpose confuses users, as does using the same command for multiple functions.

2.4. Suggestion: Make commands consistent.

Users tend to use the wording which is visually presented. Include the command name on buttons and other navigational elements in the grammar for the voice mode. All voice commands that achieve the same functionality should have the same grammar. Users tend to use known commands from their daily use of computers. Incorporate these commands into the grammar, even it they are not visually presented in the GUI.

Suggestion exampless:

2.5. Suggestion: Make the focus consistent across modes

If the user is prompted to speak a value for a field, then highlight that field in the GUI.

Suggestion examples:

Organizational Suggestions

Grade school teachers always teach that organizing your thoughts before writing a composition will dramatically improve its understandability. The same principle applies to user interfaces. Organizing information and transitioning between topics will improve the users' comprehension of and performance with the multimodal interface. Information should be structured and organized in ways that are familiar to the user.

Content structure. Audio cues help users understand audio information. For example, use a click to introduce each item of a bulleted list, increase the volume to emphasize highlighted text, or use a whisper to speak parenthetical text.

2.6. Suggestion: Use audio and/or visual icons to indicate the content structure.

There are generally accepted icons to represent content structure. for example, a clock may indicate that an application is busy, arrows may represent next and previous pages, etc.

Because there are no standard assignments of meanings for sounds, common sense and user testing should guide the dialog designer. Here are suggestions for items that lend themselves to non-speech sounds:

Chunks of information. Users comprehend audio information more easily if it is presented as blocks, or chunks, of information. For example, users may not recognize "six, one, seven, two, two, five, four, three, seven, six" as a telephone number, but they will recognize "six, one, seven (pause) two, two, five (pause) four, three, seven, six" as either an American or Canadian telephone number.

2.7. Suggestion: Use pauses to divide information into natural "chunks."

Suggestion examples include:

Transitions. A user may become disoriented if the information content suddenly changes. Writers are well aware of the need for transitions between topics. Similar transitions are needed for visual and verbal information.

2.8. Suggestion: Use animation and sound to show transitions.

Suggestion examples:

2.9. Use voice navigation to reduce the number of screens.

Modality synchronization. Multiple modalities should be appropriately synchronized. Here are some examples:

  1. Stop talking/listening when the visual browser is minimized or exited.
  2. The visual browser verbal browsers should present the same information at the same time.
  3. In a multifield form, the focus field of the visual browser should correspond to the field prompt currently presented by the verbal browser.

2.10. Synchronize multiple modalities appropriately.

Simplicity. Complex user interfaces are confusing to the user and lead to errors. While this rule applies to all user interfaces, it is especially important to multimodal user interfaces.

2.11. Keep the user interface as simple as possible.

3. Help Users Recover Quickly and Efficiently from Errors

The user interface must help users recover quickly and efficiently from errors. All users, especially novice users, will occasionally fail to respond to a prompt appropriately. The user interface must be designed to detect such errors and assist users to recover naturally. The multimodal interface also should help users learn how to use the user interface to achieve the desired results quickly and efficiently.

Conversational Suggestions

Principles of conversational discourse suggest that the suggestions for the nature, content, and format of information exchanged between two humans may be applied to information exchanged between a human and a computer.

Reflexive principle. The reflexive principle states that people tend to respond in the same manner that they are prompted. For example, if users are given long rambling prompts, they will likely reply with long rambling responses.

3.1. Suggestion: Enable users to use the same mode that was used to prompt them.

Suggestion examples include:

Verbal help. Speech is more immediate and does not obscure screen contents.

3.2. Suggestion: If privacy is not a concern, use speech as output to provide commentary or help.

Suggestion examples include:

When privacy is not a concern, consider using speech for help and error messages about the current contents in the diaplay, possibly augmenting the display by highlighting the area in which the error occurs.

Directed user interface. While user-directed and mixed initiative user interfaces may be useful for experienced users, they are confusing and inhibiting for novice users. Directed user interfaces always work for all classes of users. Directed search provides the user with results they want quickly and accurately.

3.3. Suggestion: Use directed user interfaces unless the user is always knowledgeable and experienced in the domain.

Context sensitive help. As an application becomes more complex, offering the user more choices, offering help becomes mandatory. For simple application with fewer choices, the user may need help only the first time the application is run. A novice user may not know the meaning of a field or command.

3.4. Suggestion: Always provide context sensitive help for every field and command

Enable users to learn the purpose and function of every field, and what values can be entered into the field.

Suggestion example:

One advantage of verbal and visual modalities is that help can be offered using speech and/or GUI interfaces.

Reliability Suggestions

Few situations are more frustrating to users than to have a device at hand but not be able to use it.

Operational status. Users need to know when the device is listening to them speak and when the device is not listening.

3.5. Suggestion: The user always should be able to easily determine if the device is listening to the user.

Operational status can be presented as a light or icons indicating the operational status of the device.

Power status. One especially frustrating situation is when the device suddenly goes dead because the batteries are low.

3.6. Suggestion: For devices with batteries, user always should be able to easily determine how much longer the device will be operational.

A suggestion example is:

Backup mode. In Section 1, Table 1 summarized the various strengths and weaknesses of using voice, pen, and keys as input methods. Because user tasks, environmental situations, and user distractions change, users should be able to switch modes when it becomes inconvenient or impossible to use the primary mode of input.

3.7. Suggestion: Support at least two input modes so one input mode can be used when the other cannot.

Suggestion examples include:

Visual feedback. Sometimes speech recognition systems misrecognize the words which a user speaks. It is useful to present words recognized by the speech recognition system to the user who can verify their correctness. In speech only systems, the tiresome phrase "Did you say ...?" is the only option. However, in multimodal systems, the recognized word can be presented on a display.

3.8. Suggestion: Present words recognized by the speech recognition system on the display so the user can verify they are correct.

Correction mode. When the speech recognition fails, the user needs to correct the error by entering the correct word. While the user could simply speak again, a better approach is to display the n-best list (the list of words the the speech recognizer heard but did not select) so the user can select from among these options rather than speak again (and possibly experience the same error).

3.9. Suggestion: Display the n-best list to enable easy speech recognition error correction

Response time. Response times greater than 5 seconds will significantly reduce usage. If a response time exceeds this limit, inform the user that the computer is busy processing the request.

3.10. Try to keep response times less than 5 seconds. Inform the user of longer response times.

4. Make Users Feel Comfortable

Users often judge a computer application by its user interface. If users do not like the user interface, the application will not be used. If the user interface is not easy to learn and easy to use, the application cannot be used successfully.

Listening mode

There are several possible listening modes, including

In theory, always listening would be the preferred listening mode. However, this mode doesn't always work very well, and it makes heavy use of computer resources. So the generally perferred mode is push to activate.

4.1. Suggestion: Use push to activate listening mode speak to a mobile device.

It is easy for users to press a speak key before talking. This is similar to asking for permission to speak by raising your hand. However, while speaking, it is desirable to concentrate on what is being said without worring about holding down a key or pressing a key when finished speaking.

System Status

Users need feedback to determine whether the computer is processing input data, is waiting for input, or is malfunctioning.

4.2. Suggestion: Always present the current system status to the user.

Some suggestions for indicating if the computer is idle or busy are shown in Table 4.

Table 4: Suggested indicators for the current system status
Mode Idle Busy Error
Text "Ready for next input" "Processing, please wait" Explanation for the cause of the error and how to fix it
Icons Green* Red* Blinking "danger" icon
Audio Silence Sounds of a clicking clock or a percolationg coffee pot Emergency vehicle siren

* Note: because about 6 per cent of the male population has some degree of color blindness, always use another feature in addition to color. For example, use a "standing person" icon that is green to indiate the device is idle, and a "walking person" icon that is red to indicate that the current system is busy.

Human-memory Constraints

Normally, human short-term memory holds only a limited number of items, so it is necessary to keep verbal lists short. Instead of reading a list of options to users, display the list so users will not forget the spoken information.

4.3. Suggestion: Use the screen to ease stress on the user's short-term memory.

Suggestion examples include:

Social Suggestions

Social customs among people suggest suggestions for user interfaces between users and devices.

Privacy. Speech presented by the device is not private. Others in close proximity can hear the computer's speech. The display provides greater privacy.

4.4. Suggestion: If the user may need privacy and the user is not using a headset, use a display rather than render speech.

Speech uttered by the user is not private. Others in close proximity can hear both the user. The keyboard/mouse and pen provide greater privacy. Also, present asterisks for password fields.

4.5. Suggestion: If the user may need privacy while he/she enters data, use a pen or keys.

Suggestion examples include:

A related suggestion is to present asterisks instead of displaying private information (e.g., passwords) entered by the user.

Acceptance in meetings. Pen devices are accepted in meetings. They replace a pen and pad of paper for taking notes. Keyboards and keypads are becoming acceptable with the widespread use of laptops. However, key sounds should be turned off. Usually, devices that speak or are spoken to are not accepted in meetings without the use of earphones; and, in some cases, earphones may imply that the user is not interested in the current discussion.

4.6. Suggestion: If the device may be used during a business meeting or in a public place, and no headset is used, then use a pen or keys (with the keyboard sounds turned off).

Advertising Suggestions

Techniques from the field of advertising can be applied to user interfaces to make them more appealing and interesting to the user.

Important messages. Users must notice important messages.

4.7. Suggestion: Use animation and sound to attract the user's attention.

A suggestion example is:

Caution: Users tire of animation and sound quickly. Do not overuse animation and sound.

Navigational aids. It is easy for a user to become "lost in space" when using multimodal applications.

4.8. Suggestion: Use landmarks to help the know where he is.

Example Suggestions include:

Ambience Suggestion

Television and movie directors set the mood with set design, lighting, and background music. Screen layout, colors, and background music also create moods in multimodal user interfaces. However, in some cases, moods and emotion may not be appropriate in productivity applications.

4.9. Suggestion: Use audio and graphics design to set the mood and convey emotion in games and entertainment applications.

Suggestion examples include:

Accessibility Suggestions

Some users have special needs that when fulfilled, enable them to gain all the benefits of computing generally available to users without special needs. Users with limited or no sight, limited or no hearing, or have a cognitive impairment should be able to access the computer.

4.10. Suggestion: For each traditional output technique, provide an alternative output technique.

Suggestion examples include:

4.11. Suggestion: Enable users to adjust the output presentation

Example suggestions include:

Designing user interfaces to support accessibility generally results in better usability for all users.


Use these suggestions as a checklist when you first construct a multimodal user interface. However, the final decisions about the usefulness and friendliness of the user interface rest in an abundance of iterative usability testing. If users do not like or cannot use the user interface, it does not matter if the suggestions were followed. The user interface needs to be changed so users will like and be productive with it, even when some suggestion may not have been followed. The users' needs should be the foremost concern for multimodal user interface designers and developers.


The following members of the W3C Multimodal Interaction Working Group contributed suggested suggestions to this Note:

Deborah Dahl, W3C Invited Expert, contributed points that were raised during a tutorial on Multimodal Interfaces presented at the Spring 2006 SpeechTEK/AVIOS meeting.

Ingmar Kliche, T-Systems, contributed suggestions based on his work with developers of multimodal applications at T-Systems.

Gerald McCobb, IBM, contributed suggestions based on his work with developers of multimodal applications at IBM.