From Multimodal to Natural Interactions

Kuansan Wang, Natural Interaction Service Division, Microsoft Corporation


Since the pointing device was introduced alongside with the keyboard to the graphical user interface (GUI), users have been interacting with computers in a multimodal fashion: the pointing device enables the user to point and click on icons that are given specific meanings by the application developers, while the keyboard allows the users to enter more unconstrained text. Despite the successful adoption of GUI, computers are still regarded as too difficult to use for the mass. Although the graphical nature offers great flexibility for innovative presentation, GUI still puts the burden on users to discover how to express their intents by interacting with the graphical objects in a way foreseen by the application designers. As the computer becomes feature rich to meet the ever increasing and diverse needs, feature discoverability becomes a critical issue to improve usability. Users are frustrated when the computer cannot fulfill their wishes. Quite often, it is not the application does not offer the capabilities the user wanted, but users have hard time discovering and remembering how to use them.

Microsoft is committed to making computers more user friendly and accessible to a larger user group. We believe a logical step to make computers easier to use is to shift some of the responsibility of communications from the user to the computer. In addition to having the users learn and discover how to express themselves in the application designer’s terms, why not also allowing the users to specify their intents in ways they communicate with another human being, and utilizing the growing computational power of a modern computer to understand what they mean? This idea, known as the natural user interface or natural interactions, inevitably requires the computer to interact with the user in modalities such as speech, vision, and gesture that are common in human to human communications. Also, the computer must adopt an interaction style with apparent intelligence, a quality possessed by humans during their communications. While users are getting accustomed to be bounded by keyboard and pointing device, natural interactions should be able to take place anywhere and anytime, bringing forth the notion of ubiquitous computing to a reality.

The “natural” modalities such as speech, vision, and gesture bring unique technical challenges. While keyboard and pointing device can deliver the keystrokes and coordinates faithfully and unambiguously, it is not so for the natural modalities. There, the raw signal generated by the user (e.g. speech waveform, hand gestures) must be recognized into discrete patterns, and the state of the art technologies often employ pattern recognition algorithms that have expected errors, especially when the signal is corrupted by environmental noises. Even when the noise is well controlled, the natural modalities can produce ambiguous recognition results. For example, it is quite common that a speech recognizer can report the user might have said “Austin” or “Boston” with different degrees of uncertainty. In contrast, it is seldom necessary for a GUI developer to worry about whether the user has indeed clicked on a button or typed on a specific key when such events are reported, but confirmation on what the user has just said to written is often quite necessary. On the other hand, the natural modalities can offer better expressiveness for the user. For example, a simple spoken command “Send the email from Mary to John” can involve a series of search, query, point and click or cut and paste operations that, in tandem, are tedious and hard for users. The key in designing natural interactions is therefore to leverage the strengths of GUI and natural modalities to complement rather than compete against each other.

Another issue for natural interactions is the apparent intelligence the computer is expected to possess. How to facilitate intelligent interactions has been an extensively researched topic, with abundant literature in the area of spoken dialog (e.g. [1-3]). A key concept advocated by the successful implementations to achieve intelligent interactions is that the flow of the interaction is better off to be computed dynamically based on the context of human computer exchanges, rather than hard coded into programs a priori in the painstakingly step-by-step detail as many commercial applications are commonly crafted today. Ideally, the human interactions should be merely a natural outcome or “side effect” of the computer attempting to infer the user intentions and achieve the common goal. To some extent, this principle has been embraced in the Web. Although there are still manually authored Web pages and static links among them, many applications are now generating pages dynamically based on the up to date domain data, application logic and interaction history. Instead of deploying a collection of Web pages that have a fixed cross-linking topology, the designers often program into their Web applications a plan that manages user interactions on the fly. For many applications, intelligent interactions are not a luxury, but a necessity.

Case Study: MiPad

MiPad, which stands for Multimodal Interactive Personal Assistance Device, is a speech enabled personal information manager for the mobile device. Users can access their email, calendar and contact information in a multimodal fashion using a stylus or naturally spoken commands.

To facilitate intelligent interactions, MiPad employs an event driven interaction manager [4] where the user interactions are driven dynamically by “semantic” events such as the instantiations of semantic objects that represent the meaning of user’s actions. The interaction manager constantly evaluates these semantic objects as they arrive, initiating user interactions based on the outcome of the semantic evaluation. The core of MiPad’s interaction manager implements the axioms of collaborative communication agents as described in [1-3], implying the interaction is goal oriented and every user action represents a goodwill intention to assist the computer to achieve the goal. In other words, the semantic objects received by MiPad should always lead to an outcome where MiPad is closer to complete a domain specific task. Introducing a new feature to MiPad therefore amounts to declaring the domain knowledge that assists the interaction manager in computing how far a dialog state is from completing a task. An XML implementation is described in [5].

In the first version, the semantic evaluation process for speech didn’t take place until the user finished speaking the whole utterance. While this interaction style adheres to the turn taking model commonly seen in most spoken dialog systems, it does sometimes lead to suboptimal results from the system point of view. To address this issue, a new approach to extract semantic objects from speech, called the semantic synchronous understanding (SSU) [6-7], was introduced. SSU allows the semantic evaluation process to provide certain context for the speech modality to improve the semantic object recognition accuracy and reduce ambiguities, and the semantic objects conveyed through speech are reported as soon as they occur in the utterance so that the semantic evaluation process can update the context in a timely manner. One unique feature SSU introduces is that the interaction style does not have to follow the turn taking model any more. Based on the event driven model, MiPad can execute on the semantic objects and immediately update the display before the user even finishes the utterance, overlaying the computer turn on top of the user turn. Furthermore, because the semantic context is constantly updated amidst a user utterance, the interpretation of a speech utterance may be quite different depending on user activities on non speech modalities. For example, the utterance “Send email to John and Mary” may be interpreted as both John and Mary or John alone is the primary recipient depending on whether the user has clicked on the “CC” field when he says “Mary” (online demo available at [8]). User experiments were conducted to assess the impact of SSU in terms of multimodal interactions. Although SSU always intrudes on user’s utterance, the interaction style is preferred by a statistically significant margin. Also, the immediate feedback has avoided problems in handling errors caused by spontaneous speech or noise, and simplified the confirmation strategy [6-7]. Since these are critical issues needed to be addressed when a system includes modalities that might produce uncertain results, we are encouraged by these findings from the SSU technology.


  1. Allen J.F., Ferguson G., Miller B., and Ringger E., “Trains as an embodied natural language system,” Proc. AAAI-95 Symposium on Embodied Language and Action, 1995.
  2. Sadek M.D., Ferrieux A., Cozannet A., Bretier P., Panaget F., and Simonin J., “Effective human-computer cooperative spoken dialogue: The AGS demonstrator,” Proc. ICSLP-96, 1996.
  3. Cohen P.R. and Levesque H.J., “Communicative actions for artificial agents,” Proc. ICMAS-95, San Francisco, 1995.
  4. Wang K. “An event driven model for dialog systems,” Proc. ICSLP-98, Sydney Australia, 1998.
  5. Wang K. “Implementation of a multimodal dialog system using extensible markup language,” Proc. ICSLP-2000, Beijing China, 2000.
  6. Wang K. “A study of semantic synchronous understanding on speech interface design,” Proc. UIST-2003, Vancouver Canada, 2003.
  7. Wang K. “A detection based approach to robust speech understanding,” Proc. ICASSP-2004, Montreal Canada, 2004.