Position Paper for the W3C Workshop: "Voice Browsers"

October 8, 1998

Nikko Ström
nikko@sls.lcs.mit.edu

MIT Laboratory for Computer Science
Spoken Language Systems Group
http://www.sls.lcs.mit.edu

Background

The Spoken Language Systems Group at the MIT Laboratory for Computer Science is devoted to research that will lead to the development of interactive conversational systems. We formulate and test computational models and develop algorithms that are suitable for human computer interaction using verbal dialogues. These research results are funneled into the development of experimental conversational systems with varying capabilities. For example, our GALAXY system handles queries in three domains: weather, air travel, and city guide. Our GALAXY architecture uses a Java-enabled web browser as a graphical user interface and a telephone line for spoken interaction. Our Jupiter system provides conversational access to weather information on 500+ cities worldwide via a standard telephone.

Research Interests

Voice enabled browsing, in terms of speaking the contents of a page to the user, and/or letting the user speak links, is a valuable objective, particularly in terms of furthering web accessibility. However, HTML is designed with a graphical user interface in mind, and it is clear that this simple voice-browsing model does not benefit from the full power of conversational interaction. The graphical user interface is asymmetrical in the sense that the information from the system to the user has typically much higher bandwidth than the user's responses. The speech modality on the other hand is more symmetrical. Thus, to unleash the full power of speech interaction, one needs to make use of the possibility for more user input, and filter the information from the system accordingly.

The FORM element is the most input dense element of HTML and is therefore the natural choice for extensions that would make the role of user input more important. In fact, many of the domains that we are currently exploring can be reduced to an "E-form" as a first approximation. For example, our Jupiter weather information system can be handled by an E-form with fields for the location, time, and category of information (e.g., "full forecast", "temperature", "chance of rain", etc). While highly complex domains may not be elegantly reducible to an E-form, we believe that many domains, particularly those that we expect can be reasonably handled by the available conversational systems technology available today and in the near future, are suitable for an E-form formulation.

Although the FORM element may already be a possible means of representing all the information of a user query, there is currently poor support for expressing constraints and relations between fields. Such constraints and relations would make it easier for an automated dialogue manager to find a suitable strategy for completing the form by conversational interaction with the user. A simple example would be to formalize the commonly practiced convention to indicate which fields are required to complete the form. More complex relations can guide the dialogue manager about which fields to clear and which to keep at dialogue turns, e.g., in a weather information system: keep the "location" if the user changes only the "time" of the query.