BBN Technologies - Speech and Language Processing
Our previous research investigated multimodal interactive error correction for speech user interfaces. Multimodal error correction combines pen, speech, mouse and keyboard input with visual output. Repeated recognition errors are avoided by allowing the user to switch modality for error correction. The first section of this paper describes our prototype multimodal dictation system that enhances a large-vocabulary, continuous speech dictation system with multimodal error correction.
Based on our experiences gained during the design, iterative implementation and evaluation of this prototype, we develop position statements to the three workshop focus points in the second section of this paper. The second section first clarifies important terminology. Based on a survey of multimodal systems and our own research, we then identify situations that benefit from multimodal speech and pen input, and to some degree combined visual and verbal interaction in general. Addressing the second workshop issue, we modify the standard interface design methodology - conceptual design, followed by iterative design - to address the additional requirements of multimodal systems, and present an initial set of design principles. Finally, we present research topics that prevent successful commercialization of visual and verbal interfaces.
Automatic dictation systems have become available commercially, however, correction of speech recognition errors is limited to variations of repeating using continuous speech, choosing from alternative words, and typing. In a multimodal dictation system, the user corrects recognition errors interactively, but unlike in conventional dictation systems, the user can switch to alternative speech and pen input modalities for error correction.
Correction of recognition errors proceeds in two phases, locating misrecognized words and correcting them. In our prototype, recognition output is displayed on the writing-sensitive screen, and the user locates recognition errors by touching misrecognized words. The user corrects recognition errors by deleting, inserting, or replacing words. In multimodal error correction, the user can choose among alternative speech and pen correction modalities: repeating input using continuous speech (respeaking), verbal spelling, handwriting, and editing using gestures drawn on the display. In addition to these modalities, the user can choose one of the conventional correction methods, i.e., choosing from alternative words and typing. Assuming that the reader is familiar with the conventionel methods, we describe the novel speech and pen correction modalities in more detail below.
Multimodal error correction includes interative methods to correct by repeating input, using speech or pen input. In correction by respeaking, the user simply respeaks verbatim the words (or items) that were misrecognized; in correction by spelling, the user repeats by spelling it verbally, and in correction by handwriting, the user writes in cursive script on the writing-sensitive display. To achieve high correction accuracy, correction input must be correlated with repair context (for more details, see ). Figure 1 below shows inserting by handwriting.
Figure 1: Example for multimodal error correction, inserting the word "made" by handwriting.
Multimodal error correction also includes editing using pen gestures. In particular, the user can delete words with an X or scratching gesture, or change the position of the cursor with a caret gesture. Additional gestures are available to perform editing on the level of characters within a word (so-called partial-word correction).
Experts in the field contributed to an extension of the EAGLES handbook on spoken language systems  on multimedia and multimodal systems. Insights from this work may help to focus the discussion in the present workshop.
The EAGLES experts recommend a taxonomy of multimodal systems that distinguishes on the highest level between multimodality and multimedia. Multimodality refers to several modalities on the input side, while multimedia refers to several modalities on the output side of a human-computer interface, while. Using this terminology, the present workshop encompasses multimedia with audio and visual output, and multimodality with speech and pen input. This paper focuses on multimodal speech and pen input.
Other topics addressed by the EAGLES initiative that are relevant for this workshop include: recommendations for interfaces that combine speech input and output with other modalities, and available standards and resources (databases).
A workshop on multimedia and multimodal interface design attempted to identify situations in which multimodal interaction helps . These areas include: increasing interpretation accuracy and presentation clarity, enabling new applications, offering freedom of choice, and increasing naturalness of human-computer interaction. Have nine years of research on multimodal interfaces since substantiated these claims, suggesting specific situations in which multimodal interaction helps?
By surveying previous research on multimodal systems, we identified the following four specific situations in which a combination of pen and speech input appears to be useful: references to visually displayed objects, command control, document manipulation, and video analysis. First, deictic references to objects are easier to express using gestures than in speech alone, provided that the user can see the objects. Initially demonstrated in , the ease of resolving deictic object references using combinations of visual and verbal interaction is exploited in a multitude of research systems that support interaction with maps. Such systems have been developed for city maps , real estate maps , geographic maps , and calendars . Second, pen input can facilitate command control because gestures can determine both type and scope of the operation intended by the user. Research systems in this category include drawing and sketching tools [9, 12]. Other research systems combined visual and verbal interaction in applications that manipulate graphic documents [5, 7] and for analysis of video and image data [3, 15]. However, only few formal evaluations of multimodal interface have been conducted.
Our own research  shows that combining visual and speech interaction is useful in two other situations: error correction and data entry. In interactive correction of recognition errors, switching modality avoids repeated errors. In particular, for correction of speech recognition errors, correction accuracy is significantly higher if the user switches from speech to other modalities, such as pen-gestures, handwriting, and choosing from visually displayed lists of alternative interpretations. Concerning the second situation, data entry is faster if input modality can be switched, because data input efficiency depends on both modality and data type. For example, while speech is very efficient for text input, pen input is faster for entry of numeric and manipulation of graphic data.
Human-computer interface design generally proceeds from a conceptual design to several iterations of redesign and informal user tests. In our experience, this methodology can be successfully applied to the design of multimodal interfaces (in particular interfaces that combine pen and speech input) with appropriate modifications.
An important problem in the conceptual design of multimedia and multimodal interfaces is choosing which modalities are available for user input and system output. In this process, the designer can draw on the following three knowledge sources (discussed in the following paragraphs): knowledge of available hardware and recognition systems, predictions from models of multimodal interaction, and emerging design principles for multimodal interfaces.
The design of interfaces that use pen and speech input needs to circumvent limitations of hardware and recognition systems, which otherwise lead to severe usability problems. For speech input, headsets provide the sound quality required for highly accurate recognition, but are obtrusive to the user. Displays that capture pen input are much harder to handle than pen and paper. Furthermore, the interface design needs to compensate imperfect recognition performance.
Models of multimodal interaction may help the designer in choosing the set of modalities. In addition, such models can predict the impacts of performance. Some initial models have recently been published [10, 13]. Our model of multimodal recognition-based interaction predicts input speed including the time required for error correction, based on parameters that characterize the modality and the recognition system used.
The accumulated knowledge of multimodal and multimedia interfaces suggest areas that design principles of visual/verbal interfaces should address. These areas include a characterization of the situations in which each modality is beneficial, advice how to circumvent known limitations of recognition technology, such as error correction, and implementation of such interfaces, e.g., how to capture different input modalities and how to process multimodal input. Based on our experience with the multimodal dictation system, we propose an initial set of design principles for each of these three areas.
Principles for choosing the set of modalities:
Principles to circumvent limitations of recognition technology:
Principles for the implementation of Pen-Speech Interfaces:
To facilitate the design and development of interfaces with combined visual and verbal interaction, future research must address hardware, multimodal toolkits, and conceptual understanding of multimodal and multimedia applications. First, available hardware to capture speech and pen input still leads to severe usability problems. Second, development of multimodal applications is to-date limited to a few research labs, because the required recognition technology is difficult to obtain and requires substantial expert knowledge. Toolkits that support the development of multimodal applications would free resources that currently must be spent on the technology for application design and user interface issues. Vo's work  is only a first step in this direction. Finally, the knowledge in which application situations multimodality helps, and how to go about implementing it, must be gathered and made available to the whole user interface community. We consider initiatives such as EAGLES or this workshop to be initial steps in the right direction.
1. Blattner, M.M. and Dannenberg, R., "CHI '90 Workshop on Multimedia and Multimodal Interface Design." Bulletin of the ACM Special Interest Group on Computer-Human Interaction, 1990. 22(2): pp. 54-57.
2. Bolt, R.A., "Put-That There: Voice and Gesture at the Graphics Interface." Computer Graphics Journal of the Association of Computing and Machinery, 1980. 14(3): pp. 262-270.
3. Cheyer. "MVIEWS: Multimodal Tools for the Video Analyst," in International Conference on Intelligent User Interfaces. 1997. San Francisco (CA): ACM Press. pp. 55-62.
4. Cheyer, A. and Julia, L. "Multimodal Maps: An agent-based approach," in International Conference on Cooperative Multimodal Communication. 1995. Eindhoven (NL):
5. Fauré, C. and Julia, L. "Interaction Homme-Machine par la Parole et le Geste pour l'edition de documents: TAPAGE," in International Conference on Interfaces to Real and Virtual Worlds. 1993. pp. 171-180.
6. Gibbon, D., Moore, R., and Winski, R., eds. Handbook of Standards and Resources for Spoken Language Systems. 1997, Mouton de Gruyter: Berlin, New York.
7. Hauptmann, A.G. "Speech and Gestures for Graphic Image Manipulation," in International Conference on Computer-Human Interaction. 1989. Austin (TX): ACM. 1. pp. 241-245.
8. Koons, D.B., Sparrell, C.J., and Thorisson, K.R., "Integrating Simultaneous Input from Speech, Gaze, and Hand Gestures," in Intelligent Multimedia Interfaces, M. Maybury, Editor 1993, Morgan Kaufmann. pp. 257-275.
9. Landay, J., Interactive Sketching for the Early Stages of User Interface Design. Ph.D., Computer Science Carnegie Mellon, 1996, Pittsburgh (PA).
10. Mellor, B. and Baber, C. "Modelling of Speech-based User Interfaces," in European Conference on Speech Communication and Technology. 1997. Rhodes (Greece): ESCA. 4. pp. 2263-2266.
11. Oviatt, S., DeAngeli, A., and Kuhn, K. "Integration and Synchronization of Input modes during multimodal Human-Computer Interaction," in International Conference on Computer-Human Interaction. 1997. Atlanta (GA): ACM. 1. pp. 415-422.
12. Rubine, D., "Specifying Gestures by Example." ACM Journal on Computer Graphics, 1991. 25(4): pp. 329-337.
13. Suhm, B., Multimodal Interactive Error Recovery for Non-Conversational Speech User Interfaces. PhD, Computer Science Fredericiana, 1998, Karlsruhe. 210.
14. Vo, M.T., A Framework and Toolkit for the Construction of Multimodal Learning Interfaces. Ph.D., Computer Science Carnegie Mellon, 1998, Pittsburgh. 195.
15. Waibel, A., et al. "Multimodal Interfaces for Multimedia Information Agents," in International Conference on Acoustics, Speech and Signal Processing. 1997. Munich (Germany): IEEE Signal Processing Society.