Voice Browsing the Web for Information Access

Rajeev Agarwal, Yeshwant Muthusamy, and Vishu Viswanathan

Media Technologies Lab
Texas Instruments Incorporated
P.O. Box 655303, MS 8374, Dallas, TX 75265
[rajeev | yeshwant | vishu]@csc.ti.com
 

Abstract

There is a large amount of information on the World Wide Web that is at the fingertips of anyone with access to the internet.  However, so far this information has primarily been used by people who connect to the web via a traditional computer.  This is about to change.   Recent advances in wireless communication, speech recognition, and speech synthesis technologies have made it possible to access this information from any place, at any time, by using only a cellular phone.   Some possible applications are browsing the web, getting stock quotes, verifying flight schedules, getting maps and directions for various locations, or checking E-mail.   In this paper, we discuss different types of web-based applications, briefly describe our system architecture with examples of applications we have developed, and discuss some of the key issues in building spoken dialog applications for the web.

1. Introduction

The World Wide Web offers a plethora of information that is already at the fingertips of anyone who has access to the internet.  However, so far this information has primarily been used by people who connect to the web via a traditional computer.  This is about to change.  Recent advances in wireless communication, speech recognition, and text-to-speech (TTS) synthesis technologies have made it possible to access this information from any place, at any time, by using only a cellular phone.  Some possible applications are browsing the web, getting stock quotes, verifying flight schedules, getting maps and directions for various locations, or checking E-mail.  The use of speech recognition for input and speech synthesis for output provides an easy-to-use interface that is ideal for hands/eyes free operation.   In addition, it is important to have some degree of dialog management built into the system to handle erroneous, ambiguous, or incomplete utterances by the user and to facilitate naturally spoken input.

In this paper, we outline different categories of applications for voice browsers and present suggestions that should be helpful in setting standards for voice browsers.  We also give an overview of our work in this area and discuss some design issues that are important in building portable, usable, and robust spoken interfaces for web-based applications.

2.  Application Categories

We classify the variety of applications that can be built using voice browsers into the following three categories: Web browsing applications satisfy the basic need for accessing information on the web by voice.  The user may go to any web page and then have the ability to speak all the links on that page.  At Texas Instruments, we have developed a voice-driven web browsing tool called SAM (Speech-Aware Multimedia) that incorporates features such as speakable commands, speakable bookmarks, speakable links and "smart" pages. It also supports the Java Speech API allowing users to run speech-enabled Java applets.  For a detailed description of SAM and related web-based speech applications, please refer to [Hemphill and Thrift, 1995] and [Hemphill and Muthusamy, 1997].

The limited information access applications take voice browsers one step further by giving the user the ability to provide input for certain speech-based form-filling applications.  Once the user has specified the values of certain fields in a form, the information from that page may be used to query either a CGI script or a Java servlet in order to access the desired information.  Although this is a very useful feature, there is a limited set of tasks that one can accomplish with the small, fixed grammars of these systems.  For most "real" applications, it is important to provide the user with a natural interface and this requires a spoken dialog system.  We believe that this is where the largest number of information access applications are possible.  We have also done considerable work in building Java applet-based limited information access systems and in building spoken dialog systems.

In the next section, we present some suggestions for modifications/extensions to HTML4 or CSS2 that may help in setting standards for voice browsers.  Some of these suggestions may already be well understood by some but we include them here for the sake of completion.  In Section 4, we provide an overview of our work in developing several prototype spoken dialog systems. Finally, in Section 5, we discuss additional design issues that are important in building such systems.

3.  Suggestions for HTML4/CSS2 Modifications

The voice browsers that exist today do not have much in terms of standards for better integration of the speech recognizer and the TTS synthesizer with the web browser.  In this section, we address some of the issues in such an integration and provide suggestions that may be useful in setting standards for this process.

<STYLE type="text/grammar" media="speech">

4.  Our Voice-Driven Information Access Prototypes

At Texas Instruments, we have created an architecture for enabling the development of spoken dialog systems.  Our architecture is built specifically for the purposes of web-based information access via speech.  The overall architecture is client-server in nature.  The client could be embedded in a cellular phone while the server could be at any remote location.  The speech synthesis as well as speech recognition are currently done solely on the client but each would eventually be split between the client and the server.  The client also has a Global Positioning System (Garmin 12XL) on it, since that greatly enhances the array of useful applications that can be created.  The server, on the other hand, has a portable dialog manager that can easily used in different applications.  The dialog manager is responsible for identifying all useful information from the user's utterance, generating a CGI query on-the-fly, parsing the HTML that comes back as the response for the query, and generating the textual message that must be spoken back to the user.  In addition, it handles any errors, ambiguities or incompleteness in the user's requests, helps in constraining the grammars for the speech recognizer, and provides context-sensitive help.

Some of the sample applications we have developed are described next.  These include getting flight, weather, and stock information from an InfoPhone, remote access to E-mail/voice mail, voice navigation to get maps/directions to different locations, and business locator service to find certain businesses near the user's current location. The InfoPhone application belongs to the second category of limited information access while the others fall under the category of spoken dialog systems.

5.  Additional Design Issues

In Section 3, we described some of the issues related to the proper integration of a speech recognizer and a speech synthesizer into a browser.  We now present additional design issues related to the development of better speech applications for the web.  We have already incorporated some of these issues into our application design, as discussed below.

We believe that there will be an explosion of speech applications for the web in the next few years.  It is important that some standards be set now so that all voice browsers can provide at least some features in a consistent manner.  We have also discussed additional design issues which, although not directly related to specification of HTML4/CSS2 standards, are nevertheless important to consider when building such applications.

References

[Agarwal, 1997]  "Towards a PURE spoken dialogue system for information access,"  by Rajeev Agarwal, In Proceedings of the ACL/EACL Workshop on Interactive  Spoken Dialog Systems, pp. 90-97, Madrid, Spain, July 1997.

[Hemphill and Thrift, 1995]  "Surfing the Web by Voice", by C. Hemphill and P. Thrift. In Proceedings of ACM Multimedia, San Francisco, CA, November 7-9, 1995, pp. 215-222.

[Hemphill and Muthusamy, 1997] "Developing Web-based Speech Applications", by C. Hemphill and Y. K. Muthusamy. In Proceedings of Eurospeech '97, Rhodes, Greece, September 1997, Vol. 2, pp. 895-898.

[JSAPI, 1998] "Java Speech Application Programming Interface", Beta Version 0.7. Located at http://java.sun.com/products/java-media/speech, June 1998.

[JSGF, 1998]  "Java Speech Grammar Format Specification", Beta Version 0.6. Located at http://java.sun.com/products/java-media/speech/forDevelopers/JSGF/index.html, April 1998.

[JSML, 1998] "Java Speech Markup Language Specification", Beta Version 0.5. Located at http://java.sun.com/products/java-media/speech/forDevelopers/JSML/index.html, August 1998.

[Marx, 1995] "Towards Effective Conversational Messaging", by M. T. Marx, Master's Thesis, MIT, June 1995.

[Raggett and Ben-Natan, 1998]  "Voice Browsers", by Dave Raggett and Or Ben-Natan, Located at http://www.w3.org/TR/NOTE-voice, 1998.